• Eitan Blumin

CASE Study

Sometimes we want to populate a table with random data, mainly for testing purposes. For example, let’s consider the following table:

Transact-SQL

CREATE TABLE Sales.Orders ( Id INT NOT NULL IDENTITY(1,1) , CustomerId INT NOT NULL , DateAndTime DATETIME2(0) NOT NULL , OrderStatus NVARCHAR(50) NULL ); GO

1

2

3

4

5

6

7

8

9

CREATE TABLE

Sales.Orders

(

Id INT NOT NULL IDENTITY(1,1) ,

CustomerId INT NOT NULL ,

DateAndTime DATETIME2(0) NOT NULL ,

OrderStatus NVARCHAR(50) NULL

);

GO

Let’s say we want to insert 100,000 rows with random data. We want to generate random values between 1 and 1,000,000 for the “CustomerId” column, and we want to generate random date & time values in the past 5 years for the “DateAndTime” column. For now, let’s ignore the “OrderStatus” column. Here is how I would do it:

Transact-SQL

INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

INSERT INTO

Sales.Orders WITH (TABLOCK)

(

CustomerId ,

DateAndTime

)

SELECT TOP (100000)

CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,

DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))

% (60 * 60 * 24 * 365 * 5) , SYSDATETIME ())

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2;

GO

Notice a few things:

I’m using the TABLOCK hint in order to use minimal logging. For more information about minimally logged inserts, I recommend this article by Itzik Ben-Gan.

I’m retrieving rows from sys.all_columns cross joined to itself. It really doesn’t matter what table/s I’m using in this case. All I need is to make sure that the data set contains at least 100,000 rows. The TOP clause limits the number of rows returned to 100,000.

I’m using the NEWID function in order to generate random values. If you try to use the RAND function, it will return the same value for the whole recordset, because the RAND function is evaluated only once for the query. If you want a different value for each row (of course you do), then you need to use the NEWID function.

Here is how the table looks like (10 first rows) after the insert:IdCustomerIdDateAndTimeOrderStatus11276932009-09-30 02:18:25NULL21357642008-04-26 07:25:29NULL35163462010-04-17 13:19:24NULL45236682009-05-20 21:12:51NULL52498782010-04-14 10:43:30NULL67201862008-11-20 07:00:06NULL7807922011-03-08 02:39:24NULL8461222011-08-24 10:47:45NULL97370562009-12-15 02:14:43NULL103043842008-03-17 04:51:10NULL

Now, let’s add the “OrderStatus” column to the game. This column should contain one of the following values: “New”, “Open”, “Closed”, “Cancelled” and “Archived”. Generating random values for this column is a bit trickier. We can achieve this by using the CASE statement. But first, a few words about this statement…

The CASE statement has two flavors. One is called “Simple CASE”, and it compares a single expression to a set of simple expressions. The second is called “Searched CASE”, and it evaluates a set of general Boolean expressions.

Here is an example of a simple CASE:

Transact-SQL

CASE CustomerTypeId WHEN 1 THEN N'Regular' WHEN 2 THEN N'VIP' WHEN 3 THEN N'New' WHEN 4 THEN N'Blocked' ELSE N'Unknown' END

1

2

3

4

5

6

7

CASE CustomerTypeId

WHEN 1 THEN N'Regular'

WHEN 2 THEN N'VIP'

WHEN 3 THEN N'New'

WHEN 4 THEN N'Blocked'

ELSE N'Unknown'

END

And here is an example of a searched CASE:

Transact-SQL

CASE WHEN Price <= $100.00 THEN N'Cheap' WHEN Price > $100.00 AND ProductType = 1 THEN N'Medium' WHEN Price > $100.00 AND ProductType = 2 THEN N'Expensive' WHEN Price IS NULL THEN N'Unknown Price' END

1

2

3

4

5

6

CASE

WHEN Price <= $100.00 THEN N'Cheap'

WHEN Price > $100.00 AND ProductType = 1 THEN N'Medium'

WHEN Price > $100.00 AND ProductType = 2 THEN N'Expensive'

WHEN Price IS NULL THEN N'Unknown Price'

END

In both cases, the first condition that evaluates to TRUE determines the result. The ELSE clause is optional. If all conditions evaluate to FALSE and there is no ELSE clause, then the result is NULL.

Now, let’s go back to our example. In order to generate a random value for the “OrderStatus” column, we can generate a random value between 1 and 5, and use the simple CASE statement in order to translate these numbers into text values, like this:

Transact-SQL

TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE ABS (CHECKSUM (NEWID ())) % 5 + 1 WHEN 1 THEN N'New' WHEN 2 THEN N'Open' WHEN 3 THEN N'Closed' WHEN 4 THEN N'Cancelled' WHEN 5 THEN N'Archived' END FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

TRUNCATE TABLE

Sales.Orders;

GO

INSERT INTO

Sales.Orders WITH (TABLOCK)

(

CustomerId ,

DateAndTime ,

OrderStatus

)

SELECT TOP (100000)

CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,

DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))

% (60 * 60 * 24 * 365 * 5) ,

SYSDATETIME ()) ,

OrderStatus = CASE ABS (CHECKSUM (NEWID ())) % 5 + 1

WHEN 1 THEN N'New'

WHEN 2 THEN N'Open'

WHEN 3 THEN N'Closed'

WHEN 4 THEN N'Cancelled'

WHEN 5 THEN N'Archived'

END

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2;

GO

Here are the results (first 10 rows):IdCustomerIdDateAndTimeOrderStatus11160622008-09-24 19:50:01NULL234692008-02-24 00:59:53Cancelled36353182009-07-23 09:44:25Open44131572010-04-27 01:14:15NULL55050982009-08-20 22:42:10NULL61139482010-01-26 21:47:57NULL79819392007-10-20 21:12:36NULL82585772007-05-20 20:59:16Open97048822007-10-01 16:13:24Cancelled106682012010-04-08 15:12:40NULL

Notice that there are many NULL values in the “OrderStatus” column. This is strange, cause the expression we used (ABS (CHECKSUM (NEWID ())) % 5 + 1) always returns a number between 1 and 5, so one of the conditions in the CASE clause should have been evaluated to TRUE. What’s going on?

Well, I’m glad you asked, because this is what I wanted to blog about this time. SQL Server doesn’t really have two flavors of the CASE statement. It’s actually two interfaces for our convenience. When we use the simple CASE statement, SQL Server translates it into the searched CASE format, because this is actually the only format that exists for the query processor. So the previous INSERT statement is translated by SQL Server to this:

Transact-SQL

TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 1 THEN N'New' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 2 THEN N'Open' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 3 THEN N'Closed' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 4 THEN N'Cancelled' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 5 THEN N'Archived' END FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

TRUNCATE TABLE

Sales.Orders;

GO

INSERT INTO

Sales.Orders WITH (TABLOCK)

(

CustomerId ,

DateAndTime ,

OrderStatus

)

SELECT TOP (100000)

CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,

DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))

% (60 * 60 * 24 * 365 * 5) ,

SYSDATETIME ()) ,

OrderStatus = CASE

WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 1

THEN N'New'

WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 2

THEN N'Open'

WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 3

THEN N'Closed'

WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 4

THEN N'Cancelled'

WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 5

THEN N'Archived'

END

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2;

GO

Now you can see that the expression ABS (CHECKSUM (NEWID ())) % 5 + 1 is calculated separately for each condition in the Searched CASE statement. For the first condition, a random value between 1 and 5 is calculated and compared to “1”. The probability of a match is 20%. If it’s FALSE, then we move one to the second condition, for which another (different) random value between 1 and 5 is calculated and, this time, compared to “2”. Again, we have a probability of 20% for a match, and so on. It is very likely that all conditions will evaluate to FALSE, and in that case the result will be NULL. Actually, the probability of getting NULL from this CASE expression is 0.85 = 32.768%. If you count the number of NULL values in the table, you’ll find that there are around 32,768 such rows (out of 100,000 rows in total).

OK, so we learned that the Simple CASE is actually a Searched CASE behind the scenes, and that for this reason, using a random expression with the Simple CASE is problematic. That’s good to know (I think), but I still want to populate the “OrderStatus” column with random values. How do I do that?

Well, I’m glad you asked again. There are several options to solve this problem. One of them is to use a small table with a single column containing the 5 text values, cross join to this table as well and sort the whole recordset randomly. Like this:

Transact-SQL

CREATE TABLE Sales.OrderStatuses ( OrderStatus NVARCHAR(50) NOT NULL ); GO INSERT INTO Sales.OrderStatuses ( OrderStatus ) VALUES (N'New') , (N'Open') , (N'Closed') , (N'Cancelled') , (N'Archived'); GO TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = OrderStatus FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2 CROSS JOIN Sales.OrderStatuses ORDER BY NEWID () ASC; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

43

44

45

46

47

48

49

CREATE TABLE

Sales.OrderStatuses

(

OrderStatus NVARCHAR(50) NOT NULL

);

GO

INSERT INTO

Sales.OrderStatuses

(

OrderStatus

)

VALUES

(N'New') ,

(N'Open') ,

(N'Closed') ,

(N'Cancelled') ,

(N'Archived');

GO

TRUNCATE TABLE

Sales.Orders;

GO

INSERT INTO

Sales.Orders WITH (TABLOCK)

(

CustomerId ,

DateAndTime ,

OrderStatus

)

SELECT TOP (100000)

CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,

DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))

% (60 * 60 * 24 * 365 * 5) ,

SYSDATETIME ()) ,

OrderStatus = OrderStatus

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2

CROSS JOIN

Sales.OrderStatuses

ORDER BY

NEWID () ASC;

GO

This method works, but it’s very inefficient, because the query processor has to complete all the join operations first, then calculate the NEWID function for every row, then sort the whole recordset according to the values of NEWID, and only then retrieve the top 100,000 rows. Since the recordset is huge (around 134,000,000 in my case), the sort operation is going to take forever. You can try to replace sys.all_columns with another table that has fewer rows in order to get a recordset of a size close to 100,000, but it’s still not going to be an easy task.

Another solution to the problem is to insert the 100,000 rows one by one in a loop. Inside the loop we can calculate the random value once and put it in a variable, and then we can use the Simple CASE statement with the variable. Like this:

Transact-SQL

DECLARE @Counter AS INT = 100000 , @RandomValueBetween1And5 AS TINYINT; TRUNCATE TABLE Sales.Orders; WHILE @Counter > 0 BEGIN SET @RandomValueBetween1And5 = ABS (CHECKSUM (NEWID ())) % 5 + 1; INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE @RandomValueBetween1And5 WHEN 1 THEN N'New' WHEN 2 THEN N'Open' WHEN 3 THEN N'Closed' WHEN 4 THEN N'Cancelled' WHEN 5 THEN N'Archived' END; SET @Counter -= 1; END; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

DECLARE

@Counter AS INT = 100000 ,

@RandomValueBetween1And5 AS TINYINT;

TRUNCATE TABLE

Sales.Orders;

WHILE

@Counter > 0

BEGIN

SET @RandomValueBetween1And5 =

ABS (CHECKSUM (NEWID ())) % 5 + 1;

INSERT INTO

Sales.Orders WITH (TABLOCK)

(

CustomerId ,

DateAndTime ,

OrderStatus

)

SELECT

CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,

DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM

(NEWID ())) % (60 * 60 * 24 * 365 * 5)

, SYSDATETIME ()) ,

OrderStatus = CASE @RandomValueBetween1And5

WHEN 1 THEN N'New'

WHEN 2 THEN N'Open'

WHEN 3 THEN N'Closed'

WHEN 4 THEN N'Cancelled'

WHEN 5 THEN N'Archived'

END;

SET @Counter -= 1;

END;

GO

This method also works, of course, but it is also inefficient, because there are 100,000 INSERT operations instead of one.

There are more inefficient ways to solve this problem, but let’s look at an efficient one. Let’s go back to the original query with the Simple CASE statement. We can still use the same query, but we need to make sure that the expression in the Simple CASE statement is calculated only once, and then compared to the other values (1 through 5). We can achieve this by performing the calculation of the random value in a derived table, like this:

Transact-SQL

TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE RandomValueTable.RandomValue WHEN 1 THEN N'New' WHEN 2 THEN N'Open' WHEN 3 THEN N'Closed' WHEN 4 THEN N'Cancelled' WHEN 5 THEN N'Archived' END FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2 CROSS JOIN ( SELECT RandomValue = ABS (CHECKSUM (NEWID ())) % 5 + 1 ) AS RandomValueTable; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

TRUNCATE TABLE

Sales.Orders;

GO

INSERT INTO

Sales.Orders WITH (TABLOCK)

(

CustomerId ,

DateAndTime ,

OrderStatus

)

SELECT TOP (100000)

CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,

DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))

% (60 * 60 * 24 * 365 * 5) ,

SYSDATETIME ()) ,

OrderStatus = CASE RandomValueTable.RandomValue

WHEN 1 THEN N'New'

WHEN 2 THEN N'Open'

WHEN 3 THEN N'Closed'

WHEN 4 THEN N'Cancelled'

WHEN 5 THEN N'Archived'

END

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2

CROSS JOIN

(

SELECT

RandomValue = ABS (CHECKSUM (NEWID ())) % 5 + 1

)

AS

RandomValueTable;

GO

Now the random value is calculated once inside the derived table, and it is exposed to the outer query as a column. When SQL Server transforms the Simple CASE statement into a Searched CASE format, it’s still OK, because the “RandomValue” column is not calculated separately for each condition. It’s actually similar to the solution using a loop and a variable to hold the random value. Here, we use a column in a derived table to hold the random value instead of a variable.

Now we have a solution that works and is also very efficient. Here are the results:IdCustomerIdDateAndTimeOrderStatus13691112011-01-01 03:31:42Archived22043922009-05-02 09:00:00Cancelled39974232010-12-09 00:22:51Archived45423082009-12-04 03:10:26Cancelled59608222009-03-12 07:26:03Closed62167702007-05-02 21:03:30Closed713142008-07-23 22:26:21Archived88868492007-08-27 15:56:29Archived91764552009-08-31 20:43:28Archived109491952010-07-31 02:41:29New

#programming #tsql

CONTACT US

 3 Rapaport St. Kfar Saba, Israel

JOIN OUR MAILING LIST

 +972-9-7400101

  • Google+ - White Circle
  • Facebook - White Circle
  • Twitter - White Circle
  • LinkedIn - White Circle