• Madeira Team

How to Compare Tables in SQL Server

One of our customers has recently asked me the following question (I changed just the business case):

The Problem

“I have a “MemberSessions” table, which stores web sessions made by members. The table contains the “MemberId” column, as well as other columns that describe the session, such as “LoginDateTime”. I would like to find all members with the exact same sessions.”

Let’s create the table and populate it with some test data:

Transact-SQL

CREATE SCHEMA Operation; GO CREATE TABLE Operation.MemberSessions ( Id INT IDENTITY (1,1) NOT NULL , MemberId INT NOT NULL , LoginDateTime DATETIME2(0) NOT NULL , EndDateTime DATETIME2(0) NULL , EndReasonId TINYINT NULL , CONSTRAINT pk_MemberSessions_c_Id PRIMARY KEY CLUSTERED (Id ASC) ); GO INSERT INTO Operation.MemberSessions ( MemberId , LoginDateTime , EndDateTime , EndReasonId ) VALUES (1 , '2015-12-03 13:01:10' , '2015-12-03 15:50:53' , 1) , (1 , '2015-12-08 12:47:53' , '2015-12-08 13:11:43' , 1) , (2 , '2015-12-03 13:01:10' , '2015-12-03 15:50:53' , 1) , (2 , '2015-12-08 12:47:53' , '2015-12-08 13:11:43' , 1) , (3 , '2015-12-14 09:45:01' , '2015-12-14 09:48:52' , 1) , (3 , '2015-12-15 10:33:41' , '2015-12-15 11:02:17' , 2) , (4 , '2012-08-09 11:17:05' , '2012-08-09 15:02:07' , 1) , (5 , '2012-11-23 19:51:44' , '2012-11-23 20:15:16' , 1) , (5 , '2014-09-28 04:10:26' , '2014-09-28 06:00:37' , 3) , (5 , '2006-10-29 17:10:47' , '2006-10-29 20:33:40' , 2); GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

39

40

41

42

CREATE SCHEMA

Operation;

GO

CREATE TABLE

Operation.MemberSessions

(

Id INT IDENTITY (1,1) NOT NULL ,

MemberId INT NOT NULL ,

LoginDateTime DATETIME2(0) NOT NULL ,

EndDateTime DATETIME2(0) NULL ,

EndReasonId TINYINT NULL ,

CONSTRAINT

pk_MemberSessions_c_Id

PRIMARY KEY CLUSTERED

(Id ASC)

);

GO

INSERT INTO

Operation.MemberSessions

(

MemberId ,

LoginDateTime ,

EndDateTime ,

EndReasonId

)

VALUES

(1 , '2015-12-03 13:01:10' , '2015-12-03 15:50:53' , 1) ,

(1 , '2015-12-08 12:47:53' , '2015-12-08 13:11:43' , 1) ,

(2 , '2015-12-03 13:01:10' , '2015-12-03 15:50:53' , 1) ,

(2 , '2015-12-08 12:47:53' , '2015-12-08 13:11:43' , 1) ,

(3 , '2015-12-14 09:45:01' , '2015-12-14 09:48:52' , 1) ,

(3 , '2015-12-15 10:33:41' , '2015-12-15 11:02:17' , 2) ,

(4 , '2012-08-09 11:17:05' , '2012-08-09 15:02:07' , 1) ,

(5 , '2012-11-23 19:51:44' , '2012-11-23 20:15:16' , 1) ,

(5 , '2014-09-28 04:10:26' , '2014-09-28 06:00:37' , 3) ,

(5 , '2006-10-29 17:10:47' , '2006-10-29 20:33:40' , 2);

GO

In this example, there are 5 members. Notice that members #1 and #2 have the exact same sessions with all data exactly the same (LoginDateTime, EndDateTime and EndReasonId). All other members have different sessions.

So the customer wanted to get the following result:MemberGroupMemberId1112

Where “MemberGroup” is an incremental number that identifies different groups of similar members. If there was another group of similar members (in terms of session data), then they would have the same member group value (2), and so on…

The Solution

There are several ways to try and solve this problem. For example, we could join between the table to itself and find all pairs of sessions that match. Then we can aggregate the data per member and join between members. But this is a very complicated and inefficient solution.

A much simpler solution is to use hash values. We can calculate a hash value (a single integer value) over all the sessions of each member, and then we can compare between the hash values at the member level. Instead of dealing with all the rows and columns in the “MemberSessions” table, we only need to deal with a single hash value per member. This value represents all the sessions of that member in a single integer value.

In order to achieve that, we are going to use the CHECKSUM_AGG function, which returns the checksum (hash value) of all the values in a column. For example, if you run the following query, it will calculate the checksum of all the values in the “EndReasonId” column:

Transact-SQL

SELECT ChecksumValue = CHECKSUM_AGG (EndReasonId) FROM Operation.MemberSessions; GO

1

2

3

4

5

SELECT

ChecksumValue = CHECKSUM_AGG (EndReasonId)

FROM