
CASE Study
Sometimes we want to populate a table with random data, mainly for testing purposes. For example, let’s consider the following table:
Transact-SQL
CREATE TABLE Sales.Orders ( Id INT NOT NULL IDENTITY(1,1) , CustomerId INT NOT NULL , DateAndTime DATETIME2(0) NOT NULL , OrderStatus NVARCHAR(50) NULL ); GO
1
2
3
4
5
6
7
8
9
CREATE TABLE
Sales.Orders
(
Id INT NOT NULL IDENTITY(1,1) ,
CustomerId INT NOT NULL ,
DateAndTime DATETIME2(0) NOT NULL ,
OrderStatus NVARCHAR(50) NULL
);
GO
Let’s say we want to insert 100,000 rows with random data. We want to generate random values between 1 and 1,000,000 for the “CustomerId” column, and we want to generate random date & time values in the past 5 years for the “DateAndTime” column. For now, let’s ignore the “OrderStatus” column. Here is how I would do it:
Transact-SQL
INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime
)
SELECT TOP (100000)
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))
% (60 * 60 * 24 * 365 * 5) , SYSDATETIME ())
FROM
sys.all_columns AS T1
CROSS JOIN
sys.all_columns AS T2;
GO
Notice a few things:
I’m using the TABLOCK hint in order to use minimal logging. For more information about minimally logged inserts, I recommend this article by Itzik Ben-Gan.
I’m retrieving rows from sys.all_columns cross joined to itself. It really doesn’t matter what table/s I’m using in this case. All I need is to make sure that the data set contains at least 100,000 rows. The TOP clause limits the number of rows returned to 100,000.
I’m using the NEWID function in order to generate random values. If you try to use the RAND function, it will return the same value for the whole recordset, because the RAND function is evaluated only once for the query. If you want a different value for each row (of course you do), then you need to use the NEWID function.
Here is how the table looks like (10 first rows) after the insert:IdCustomerIdDateAndTimeOrderStatus11276932009-09-30 02:18:25NULL21357642008-04-26 07:25:29NULL35163462010-04-17 13:19:24NULL45236682009-05-20 21:12:51NULL52498782010-04-14 10:43:30NULL67201862008-11-20 07:00:06NULL7807922011-03-08 02:39:24NULL8461222011-08-24 10:47:45NULL97370562009-12-15 02:14:43NULL103043842008-03-17 04:51:10NULL
Now, let’s add the “OrderStatus” column to the game. This column should contain one of the following values: “New”, “Open”, “Closed”, “Cancelled” and “Archived”. Generating random values for this column is a bit trickier. We can achieve this by using the CASE statement. But first, a few words about this statement…
The CASE statement has two flavors. One is called “Simple CASE”, and it compares a single expression to a set of simple expressions. The second is called “Searched CASE”, and it evaluates a set of general Boolean expressions.
Here is an example of a simple CASE:
Transact-SQL
CASE CustomerTypeId WHEN 1 THEN N'Regular' WHEN 2 THEN N'VIP' WHEN 3 THEN N'New' WHEN 4 THEN N'Blocked' ELSE N'Unknown' END
1
2
3
4
5
6
7
CASE CustomerTypeId
WHEN 1 THEN N'Regular'
WHEN 2 THEN N'VIP'
WHEN 3 THEN N'New'
WHEN 4 THEN N'Blocked'
ELSE N'Unknown'
END
And here is an example of a searched CASE:
Transact-SQL
CASE WHEN Price <= $100.00 THEN N'Cheap' WHEN Price > $100.00 AND ProductType = 1 THEN N'Medium' WHEN Price > $100.00 AND ProductType = 2 THEN N'Expensive' WHEN Price IS NULL THEN N'Unknown Price' END
1
2
3
4
5
6
CASE
WHEN Price <= $100.00 THEN N'Cheap'
WHEN Price > $100.00 AND ProductType = 1 THEN N'Medium'
WHEN Price > $100.00 AND ProductType = 2 THEN N'Expensive'
WHEN Price IS NULL THEN N'Unknown Price'
END
In both cases, the first condition that evaluates to TRUE determines the result. The ELSE clause is optional. If all conditions evaluate to FALSE and there is no ELSE clause, then the result is NULL.
Now, let’s go back to our example. In order to generate a random value for the “OrderStatus” column, we can generate a random value between 1 and 5, and use the simple CASE statement in order to translate these numbers into text values, like this:
Transact-SQL
TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE ABS (CHECKSUM (NEWID ())) % 5 + 1 WHEN 1 THEN N'New' WHEN 2 THEN N'Open' WHEN 3 THEN N'Closed' WHEN 4 THEN N'Cancelled' WHEN 5 THEN N'Archived' END FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
TRUNCATE TABLE
Sales.Orders;
GO
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime ,
OrderStatus
)
SELECT TOP (100000)
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))
% (60 * 60 * 24 * 365 * 5) ,
SYSDATETIME ()) ,
OrderStatus = CASE ABS (CHECKSUM (NEWID ())) % 5 + 1
WHEN 1 THEN N'New'
WHEN 2 THEN N'Open'
WHEN 3 THEN N'Closed'
WHEN 4 THEN N'Cancelled'
WHEN 5 THEN N'Archived'
END
FROM
sys.all_columns AS T1
CROSS JOIN
sys.all_columns AS T2;
GO
Here are the results (first 10 rows):IdCustomerIdDateAndTimeOrderStatus11160622008-09-24 19:50:01NULL234692008-02-24 00:59:53Cancelled36353182009-07-23 09:44:25Open44131572010-04-27 01:14:15NULL55050982009-08-20 22:42:10NULL61139482010-01-26 21:47:57NULL79819392007-10-20 21:12:36NULL82585772007-05-20 20:59:16Open97048822007-10-01 16:13:24Cancelled106682012010-04-08 15:12:40NULL
Notice that there are many NULL values in the “OrderStatus” column. This is strange, cause the expression we used (ABS (CHECKSUM (NEWID ())) % 5 + 1) always returns a number between 1 and 5, so one of the conditions in the CASE clause should have been evaluated to TRUE. What’s going on?
Well, I’m glad you asked, because this is what I wanted to blog about this time. SQL Server doesn’t really have two flavors of the CASE statement. It’s actually two interfaces for our convenience. When we use the simple CASE statement, SQL Server translates it into the searched CASE format, because this is actually the only format that exists for the query processor. So the previous INSERT statement is translated by SQL Server to this:
Transact-SQL
TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 1 THEN N'New' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 2 THEN N'Open' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 3 THEN N'Closed' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 4 THEN N'Cancelled' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 5 THEN N'Archived' END FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
TRUNCATE TABLE
Sales.Orders;
GO
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime ,
OrderStatus
)
SELECT TOP (100000)
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))
% (60 * 60 * 24 * 365 * 5) ,
SYSDATETIME ()) ,
OrderStatus = CASE
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 1
THEN N'New'
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 2
THEN N'Open'
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 3
THEN N'Closed'
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 4
THEN N'Cancelled'
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 5
THEN N'Archived'
END
FROM
sys.all_columns AS T1
CROSS JOIN
sys.all_columns AS T2;
GO
Now you can see that the expression ABS (CHECKSUM (NEWID ())) % 5 + 1 is calculated separately for each condition in the Searched CASE statement. For the first condition, a random value between 1 and 5 is calculated and compared to “1”. The probability of a match is 20%. If it’s FALSE, then we move one to the second condition, for which another (different) random value between 1 and 5 is calculated and, this time, compared to “2”. Again, we have a probability of 20% for a match, and so on. It is very likely that all conditions will evaluate to FALSE, and in that case the result will be NULL. Actually, the probability of getting NULL from this CASE expression is 0.85 = 32.768%. If you count the number of NULL values in the table, you’ll find that there are around 32,768 such rows (out of 100,000 rows in total).