Updated: Jan 26, 2020
Sometimes you need to compare data between two tables with many rows and columns. There are several ways to achieve this. Here is one of them...
One of our customers has recently asked me the following question (I changed just the business case):
“I have a “MemberSessions” table, which stores web sessions made by members. The table contains the “MemberId” column, as well as other columns that describe the session, such as “LoginDateTime”. I would like to find all members with the exact same sessions.”
Let’s create the table and populate it with some test data:
CREATE SCHEMA Operation; GO CREATE TABLE Operation.MemberSessions ( Id INT IDENTITY (1,1) NOT NULL , MemberId INT NOT NULL , LoginDateTime DATETIME2(0) NOT NULL , EndDateTime DATETIME2(0) NULL , EndReasonId TINYINT NULL , CONSTRAINT pk_MemberSessions_c_Id PRIMARY KEY CLUSTERED (Id ASC) ); GO INSERT INTO Operation.MemberSessions ( MemberId , LoginDateTime , EndDateTime , EndReasonId ) VALUES (1 , '2015-12-03 13:01:10' , '2015-12-03 15:50:53' , 1) , (1 , '2015-12-08 12:47:53' , '2015-12-08 13:11:43' , 1) , (2 , '2015-12-03 13:01:10' , '2015-12-03 15:50:53' , 1) , (2 , '2015-12-08 12:47:53' , '2015-12-08 13:11:43' , 1) , (3 , '2015-12-14 09:45:01' , '2015-12-14 09:48:52' , 1) , (3 , '2015-12-15 10:33:41' , '2015-12-15 11:02:17' , 2) , (4 , '2012-08-09 11:17:05' , '2012-08-09 15:02:07' , 1) , (5 , '2012-11-23 19:51:44' , '2012-11-23 20:15:16' , 1) , (5 , '2014-09-28 04:10:26' , '2014-09-28 06:00:37' , 3) , (5 , '2006-10-29 17:10:47' , '2006-10-29 20:33:40' , 2); GO
In this example, there are 5 members. Notice that members #1 and #2 have the exact same sessions with all data exactly the same (LoginDateTime, EndDateTime and EndReasonId). All other members have different sessions.
So the customer wanted to get the following result:
Where “MemberGroup” is an incremental number that identifies different groups of similar members. If there was another group of similar members (in terms of session data), then they would have the same member group value (2), and so on…
There are several ways to try and solve this problem. For example, we could join between the table to itself and find all pairs of sessions that match. Then we can aggregate the data per member and join between members. But this is a very complicated and inefficient solution.
A much simpler solution is to use hash values. We can calculate a hash value (a single integer value) over all the sessions of each member, and then we can compare between the hash values at the member level. Instead of dealing with all the rows and columns in the “MemberSessions” table, we only need to deal with a single hash value per member. This value represents all the sessions of that member in a single integer value.
In order to achieve that, we are going to use the CHECKSUM_AGG function, which returns the checksum (hash value) of all the values in a column. For example, if you run the following query, it will calculate the checksum of all the values in the “EndReasonId” column:
SELECT ChecksumValue = CHECKSUM_AGG (EndReasonId) FROM Operation.MemberSessions; GO
And the result is:
Msg 8117, Level 16, State 1, Line 51 Operand data type tinyint is invalid for checksum_agg operator.
Oops! I forgot to mention… The CHECKSUM_AGG function accepts only integer expressions.
So we can, of course, do this:
SELECT ChecksumValue = CHECKSUM_AGG (CAST (EndReasonId AS INT)) FROM Operation.MemberSessions; GO
And now the result is: 2.
This is a hash value representing the entire “EndReasonId” column.
CHECKSUM_AGG is an aggregate function, so we can add a GROUP BY clause and calculate the hash value for each member, like this:
SELECT MemberId = MemberId , ChecksumValue = CHECKSUM_AGG (CAST (EndReasonId AS INT)) FROM Operation.MemberSessions GROUP BY MemberId ORDER BY MemberId ASC; GO
Here is the result:
First, we can see that members #1 and #2 have the same hash value, which is what we expected, because they have the same sessions data. Each one has two rows with the values “1” in column “EndReasonId”. We can also see that members #3 and #4 have different hash values, because they have different sets of sessions. But member #5, who also has a different set of sessions, gets the same hash value (0) as members #1 and #2. What is going on?
This is an important aspect of hash functions. When a hash function is applied on the same list of values, it will always produce the same result (like the case of members #1 and #2), but it also might return the same result for a different list of values (like member #5). This is statistically rare, but it can happen. And as the range of possible input values gets larger, the probability of getting the same hash values for different inputs gets lower. Since in our case we’re calculating a hash value over a single column of the TINYINT data type (the range of values is 0-255) with a small number of rows, then the probability is relatively high.
But this is not even the main problem we have. The main problem is that so far we only calculated a hash value per member based on the “EndReasonId” column. But we also want to take into account the “LoginDateTime” and the “EndDateTime” columns. The problem is that the CHECKSUM_AGG function only accepts a single expression or column name.
Luckily, the CHECKSUM_AGG function has a cousin. Her name is CHECKSUM. This function is not an aggregate function, like its relative, CHECKSUM_AGG. It’s a scalar function that accepts multiple expressions of various data types, and calculates a single hash value (of data type INT). So we are going to use this function in order to calculate a hash value for each row in the “MemberSessions” table, and then we are going to calculate a hash value for each member by aggregating the row-level hash values using the CHECKSUM_AGG function.
The CHECKSUM function has two properties that are exactly what we need:
It accepts multiple values, so we can calculate a hash value over the 3 columns – “LoginDateTime”, “EndDateTime” and “EndReasonId”.
It accepts any data type, except TEXT, NTEXT, IMAGE and XML. This solves the need to cast every value to INT, as with the CHECKSUM_AGG function. The result of the CHECKSUM function is always INT, and we are going to use that value as input to the CHECKSUM_AGG function, so we’ve got the data types issue covered.
Here is the code:
SELECT MemberId = MemberId , ChecksumValue = CHECKSUM_AGG (CHECKSUM (LoginDateTime , EndDateTime , EndReasonId)) FROM Operation.MemberSessions GROUP BY MemberId ORDER BY MemberId ASC; GO
And here is the result:
Now, these checksum values represent the entire sessions data for each member. Again, we can see that members #1 and #2 have the same values as expected. But we can also see that members #4 and #5 have the same checksum values, although they have completely different sets of sessions. Again, this is because of the properties of hash functions. It’s a statistical issue. Now, you might think that it is a common thing based on these examples, but it’s actually quite rare. I carefully picked the values for members #4 and #5 in order to reproduce this behavior for the sake of this demo. In reality, it’s quite rare to get the same checksum value for different sets of inputs.
Don't rely on a hash function to produce distinct values for distinct inputs
I wanted to demonstrate this behavior, because it’s important to understand that you cannot rely on a hash function to produce distinct values for distinct inputs. If it is critical to achieve accurate results, then you need to implement a second step in order to filter out those statistical mistakes. One option is to manually inspect each group of members to make sure their sessions data is indeed identical. Another option is to run a second step over the results of the first step, which will compare all values and filter out the statistical mistakes. Such a comparison is expensive, but it only has to be applied to a very small set of member groups, because the hashing algorithm has already ruled out most members.
OK, back to our story. We’re almost done. Now we need to return only the groups of identical members, and we also want to rank the groups and assign numbers to them.
Here is the code:
WITH MembersWithChecksum ( MemberId , ChecksumValue , ChecksumCount ) AS ( SELECT MemberId = MemberId , ChecksumValue = CHECKSUM_AGG (CHECKSUM (LoginDateTime , EndDateTime , EndReasonId)) , ChecksumCount = COUNT (*) OVER (PARTITION BY CHECKSUM_AGG (CHECKSUM (LoginDateTime , EndDateTime , EndReasonId))) FROM Operation.MemberSessions GROUP BY MemberId ) SELECT MemberGroup = DENSE_RANK () OVER (ORDER BY ChecksumValue ASC) , MemberId = MemberId FROM MembersWithChecksum WHERE ChecksumCount > 1 ORDER BY MemberGroup ASC , MemberId ASC; GO
And here is the result:
We can examine members #1 and #2, and verify that their sessions are indeed identical. We can also examine members #4 and #5, and realize that they are a statistical mistake. But you can be assured that if there was another group of identical members, you would have known about it.
This blog was initially published at our Linkedin blog