Sometimes we want to populate a table with random data, mainly for testing purposes. For example, let’s consider the following table:
Transact-SQL
CREATE TABLE Sales.Orders ( Id INT NOT NULL IDENTITY(1,1) , CustomerId INT NOT NULL , DateAndTime DATETIME2(0) NOT NULL , OrderStatus NVARCHAR(50) NULL ); GO
1
2
3
4
5
6
7
8
9
CREATE TABLE
Sales.Orders
(
Id INT NOT NULL IDENTITY(1,1) ,
CustomerId INT NOT NULL ,
DateAndTime DATETIME2(0) NOT NULL ,
OrderStatus NVARCHAR(50) NULL
);
GO
Let’s say we want to insert 100,000 rows with random data. We want to generate random values between 1 and 1,000,000 for the “CustomerId” column, and we want to generate random date & time values in the past 5 years for the “DateAndTime” column. For now, let’s ignore the “OrderStatus” column. Here is how I would do it:
Transact-SQL
INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime
)
SELECT TOP (100000)
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))
% (60 * 60 * 24 * 365 * 5) , SYSDATETIME ())
FROM
sys.all_columns AS T1
CROSS JOIN
sys.all_columns AS T2;
GO
Notice a few things:
I’m using the TABLOCK hint in order to use minimal logging. For more information about minimally logged inserts, I recommend this article by Itzik Ben-Gan.
I’m retrieving rows from sys.all_columns cross joined to itself. It really doesn’t matter what table/s I’m using in this case. All I need is to make sure that the data set contains at least 100,000 rows. The TOP clause limits the number of rows returned to 100,000.
I’m using the NEWID function in order to generate random values. If you try to use the RAND function, it will return the same value for the whole recordset, because the RAND function is evaluated only once for the query. If you want a different value for each row (of course you do), then you need to use the NEWID function.
Here is how the table looks like (10 first rows) after the insert:IdCustomerIdDateAndTimeOrderStatus11276932009-09-30 02:18:25NULL21357642008-04-26 07:25:29NULL35163462010-04-17 13:19:24NULL45236682009-05-20 21:12:51NULL52498782010-04-14 10:43:30NULL67201862008-11-20 07:00:06NULL7807922011-03-08 02:39:24NULL8461222011-08-24 10:47:45NULL97370562009-12-15 02:14:43NULL103043842008-03-17 04:51:10NULL
Now, let’s add the “OrderStatus” column to the game. This column should contain one of the following values: “New”, “Open”, “Closed”, “Cancelled” and “Archived”. Generating random values for this column is a bit trickier. We can achieve this by using the CASE statement. But first, a few words about this statement…
The CASE statement has two flavors. One is called “Simple CASE”, and it compares a single expression to a set of simple expressions. The second is called “Searched CASE”, and it evaluates a set of general Boolean expressions.
Here is an example of a simple CASE:
Transact-SQL
CASE CustomerTypeId WHEN 1 THEN N'Regular' WHEN 2 THEN N'VIP' WHEN 3 THEN N'New' WHEN 4 THEN N'Blocked' ELSE N'Unknown' END
1
2
3
4
5
6
7
CASE CustomerTypeId
WHEN 1 THEN N'Regular'
WHEN 2 THEN N'VIP'
WHEN 3 THEN N'New'
WHEN 4 THEN N'Blocked'
ELSE N'Unknown'
END
And here is an example of a searched CASE:
Transact-SQL
CASE WHEN Price <= $100.00 THEN N'Cheap' WHEN Price > $100.00 AND ProductType = 1 THEN N'Medium' WHEN Price > $100.00 AND ProductType = 2 THEN N'Expensive' WHEN Price IS NULL THEN N'Unknown Price' END
1
2
3
4
5
6
CASE
WHEN Price <= $100.00 THEN N'Cheap'
WHEN Price > $100.00 AND ProductType = 1 THEN N'Medium'
WHEN Price > $100.00 AND ProductType = 2 THEN N'Expensive'
WHEN Price IS NULL THEN N'Unknown Price'
END
In both cases, the first condition that evaluates to TRUE determines the result. The ELSE clause is optional. If all conditions evaluate to FALSE and there is no ELSE clause, then the result is NULL.
Now, let’s go back to our example. In order to generate a random value for the “OrderStatus” column, we can generate a random value between 1 and 5, and use the simple CASE statement in order to translate these numbers into text values, like this:
Transact-SQL
TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE ABS (CHECKSUM (NEWID ())) % 5 + 1 WHEN 1 THEN N'New' WHEN 2 THEN N'Open' WHEN 3 THEN N'Closed' WHEN 4 THEN N'Cancelled' WHEN 5 THEN N'Archived' END FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
TRUNCATE TABLE
Sales.Orders;
GO
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime ,
OrderStatus
)
SELECT TOP (100000)
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))
% (60 * 60 * 24 * 365 * 5) ,
SYSDATETIME ()) ,
OrderStatus = CASE ABS (CHECKSUM (NEWID ())) % 5 + 1
WHEN 1 THEN N'New'
WHEN 2 THEN N'Open'
WHEN 3 THEN N'Closed'
WHEN 4 THEN N'Cancelled'
WHEN 5 THEN N'Archived'
END
FROM
sys.all_columns AS T1
CROSS JOIN
sys.all_columns AS T2;
GO
Here are the results (first 10 rows):IdCustomerIdDateAndTimeOrderStatus11160622008-09-24 19:50:01NULL234692008-02-24 00:59:53Cancelled36353182009-07-23 09:44:25Open44131572010-04-27 01:14:15NULL55050982009-08-20 22:42:10NULL61139482010-01-26 21:47:57NULL79819392007-10-20 21:12:36NULL82585772007-05-20 20:59:16Open97048822007-10-01 16:13:24Cancelled106682012010-04-08 15:12:40NULL
Notice that there are many NULL values in the “OrderStatus” column. This is strange, cause the expression we used (ABS (CHECKSUM (NEWID ())) % 5 + 1) always returns a number between 1 and 5, so one of the conditions in the CASE clause should have been evaluated to TRUE. What’s going on?
Well, I’m glad you asked, because this is what I wanted to blog about this time. SQL Server doesn’t really have two flavors of the CASE statement. It’s actually two interfaces for our convenience. When we use the simple CASE statement, SQL Server translates it into the searched CASE format, because this is actually the only format that exists for the query processor. So the previous INSERT statement is translated by SQL Server to this:
Transact-SQL
TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 1 THEN N'New' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 2 THEN N'Open' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 3 THEN N'Closed' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 4 THEN N'Cancelled' WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 5 THEN N'Archived' END FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
TRUNCATE TABLE
Sales.Orders;
GO
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime ,
OrderStatus
)
SELECT TOP (100000)
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))
% (60 * 60 * 24 * 365 * 5) ,
SYSDATETIME ()) ,
OrderStatus = CASE
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 1
THEN N'New'
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 2
THEN N'Open'
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 3
THEN N'Closed'
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 4
THEN N'Cancelled'
WHEN ABS (CHECKSUM (NEWID ())) % 5 + 1 = 5
THEN N'Archived'
END
FROM
sys.all_columns AS T1
CROSS JOIN
sys.all_columns AS T2;
GO
Now you can see that the expression ABS (CHECKSUM (NEWID ())) % 5 + 1 is calculated separately for each condition in the Searched CASE statement. For the first condition, a random value between 1 and 5 is calculated and compared to “1”. The probability of a match is 20%. If it’s FALSE, then we move one to the second condition, for which another (different) random value between 1 and 5 is calculated and, this time, compared to “2”. Again, we have a probability of 20% for a match, and so on. It is very likely that all conditions will evaluate to FALSE, and in that case the result will be NULL. Actually, the probability of getting NULL from this CASE expression is 0.85 = 32.768%. If you count the number of NULL values in the table, you’ll find that there are around 32,768 such rows (out of 100,000 rows in total).
OK, so we learned that the Simple CASE is actually a Searched CASE behind the scenes, and that for this reason, using a random expression with the Simple CASE is problematic. That’s good to know (I think), but I still want to populate the “OrderStatus” column with random values. How do I do that?
Well, I’m glad you asked again. There are several options to solve this problem. One of them is to use a small table with a single column containing the 5 text values, cross join to this table as well and sort the whole recordset randomly. Like this:
Transact-SQL
CREATE TABLE Sales.OrderStatuses ( OrderStatus NVARCHAR(50) NOT NULL ); GO INSERT INTO Sales.OrderStatuses ( OrderStatus ) VALUES (N'New') , (N'Open') , (N'Closed') , (N'Cancelled') , (N'Archived'); GO TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = OrderStatus FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2 CROSS JOIN Sales.OrderStatuses ORDER BY NEWID () ASC; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
CREATE TABLE
Sales.OrderStatuses
(
OrderStatus NVARCHAR(50) NOT NULL
);
GO
INSERT INTO
Sales.OrderStatuses
(
OrderStatus
)
VALUES
(N'New') ,
(N'Open') ,
(N'Closed') ,
(N'Cancelled') ,
(N'Archived');
GO
TRUNCATE TABLE
Sales.Orders;
GO
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime ,
OrderStatus
)
SELECT TOP (100000)
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))
% (60 * 60 * 24 * 365 * 5) ,
SYSDATETIME ()) ,
OrderStatus = OrderStatus
FROM
sys.all_columns AS T1
CROSS JOIN
sys.all_columns AS T2
CROSS JOIN
Sales.OrderStatuses
ORDER BY
NEWID () ASC;
GO
This method works, but it’s very inefficient, because the query processor has to complete all the join operations first, then calculate the NEWID function for every row, then sort the whole recordset according to the values of NEWID, and only then retrieve the top 100,000 rows. Since the recordset is huge (around 134,000,000 in my case), the sort operation is going to take forever. You can try to replace sys.all_columns with another table that has fewer rows in order to get a recordset of a size close to 100,000, but it’s still not going to be an easy task.
Another solution to the problem is to insert the 100,000 rows one by one in a loop. Inside the loop we can calculate the random value once and put it in a variable, and then we can use the Simple CASE statement with the variable. Like this:
Transact-SQL
DECLARE @Counter AS INT = 100000 , @RandomValueBetween1And5 AS TINYINT; TRUNCATE TABLE Sales.Orders; WHILE @Counter > 0 BEGIN SET @RandomValueBetween1And5 = ABS (CHECKSUM (NEWID ())) % 5 + 1; INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE @RandomValueBetween1And5 WHEN 1 THEN N'New' WHEN 2 THEN N'Open' WHEN 3 THEN N'Closed' WHEN 4 THEN N'Cancelled' WHEN 5 THEN N'Archived' END; SET @Counter -= 1; END; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
DECLARE
@Counter AS INT = 100000 ,
@RandomValueBetween1And5 AS TINYINT;
TRUNCATE TABLE
Sales.Orders;
WHILE
@Counter > 0
BEGIN
SET @RandomValueBetween1And5 =
ABS (CHECKSUM (NEWID ())) % 5 + 1;
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime ,
OrderStatus
)
SELECT
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM
(NEWID ())) % (60 * 60 * 24 * 365 * 5)
, SYSDATETIME ()) ,
OrderStatus = CASE @RandomValueBetween1And5
WHEN 1 THEN N'New'
WHEN 2 THEN N'Open'
WHEN 3 THEN N'Closed'
WHEN 4 THEN N'Cancelled'
WHEN 5 THEN N'Archived'
END;
SET @Counter -= 1;
END;
GO
This method also works, of course, but it is also inefficient, because there are 100,000 INSERT operations instead of one.
There are more inefficient ways to solve this problem, but let’s look at an efficient one. Let’s go back to the original query with the Simple CASE statement. We can still use the same query, but we need to make sure that the expression in the Simple CASE statement is calculated only once, and then compared to the other values (1 through 5). We can achieve this by performing the calculation of the random value in a derived table, like this:
Transact-SQL
TRUNCATE TABLE Sales.Orders; GO INSERT INTO Sales.Orders WITH (TABLOCK) ( CustomerId , DateAndTime , OrderStatus ) SELECT TOP (100000) CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 , DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ())) % (60 * 60 * 24 * 365 * 5) , SYSDATETIME ()) , OrderStatus = CASE RandomValueTable.RandomValue WHEN 1 THEN N'New' WHEN 2 THEN N'Open' WHEN 3 THEN N'Closed' WHEN 4 THEN N'Cancelled' WHEN 5 THEN N'Archived' END FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2 CROSS JOIN ( SELECT RandomValue = ABS (CHECKSUM (NEWID ())) % 5 + 1 ) AS RandomValueTable; GO
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
TRUNCATE TABLE
Sales.Orders;
GO
INSERT INTO
Sales.Orders WITH (TABLOCK)
(
CustomerId ,
DateAndTime ,
OrderStatus
)
SELECT TOP (100000)
CustomerId = ABS (CHECKSUM (NEWID ())) % 1000000 + 1 ,
DateAndTime = DATEADD (SECOND , - ABS (CHECKSUM (NEWID ()))
% (60 * 60 * 24 * 365 * 5) ,
SYSDATETIME ()) ,
OrderStatus = CASE RandomValueTable.RandomValue
WHEN 1 THEN N'New'
WHEN 2 THEN N'Open'
WHEN 3 THEN N'Closed'
WHEN 4 THEN N'Cancelled'
WHEN 5 THEN N'Archived'
END
FROM
sys.all_columns AS T1
CROSS JOIN
sys.all_columns AS T2
CROSS JOIN
(
SELECT
RandomValue = ABS (CHECKSUM (NEWID ())) % 5 + 1
)
AS
RandomValueTable;
GO
Now the random value is calculated once inside the derived table, and it is exposed to the outer query as a column. When SQL Server transforms the Simple CASE statement into a Searched CASE format, it’s still OK, because the “RandomValue” column is not calculated separately for each condition. It’s actually similar to the solution using a loop and a variable to hold the random value. Here, we use a column in a derived table to hold the random value instead of a variable.
Now we have a solution that works and is also very efficient. Here are the results:IdCustomerIdDateAndTimeOrderStatus13691112011-01-01 03:31:42Archived22043922009-05-02 09:00:00Cancelled39974232010-12-09 00:22:51Archived45423082009-12-04 03:10:26Cancelled59608222009-03-12 07:26:03Closed62167702007-05-02 21:03:30Closed713142008-07-23 22:26:21Archived88868492007-08-27 15:56:29Archived91764552009-08-31 20:43:28Archived109491952010-07-31 02:41:29New
Comentarios