LOOP, HASH and MERGE Join Types
Today I’ll talk about the available JOIN operator types in SQL Server (Nested Loops, Hash and Merge Joins), their differences, best practices and complexity.
For the samples in this post, we’ll use the free AdventureWorks database sample available here: http://msftdbprodsamples.codeplex.com/releases/view/4004
Introduction: What are Join Operators?
A join operator is a type of an algorithm which the SQL Server Optimizer chooses in order to implement logical joins between two sets of data.
The SQL Server Optimizer may choose a different algorithm for different scenarios based on the requested query, available indexes, statistics and number of estimated rows in each data set.
It’s possible to find the operator which was used by looking at the execution plan that SQL Server has prepared for your query.
For more information on execution plans and how to read them, I recommend checking out the first chapter out of Grant Fritchey’s excellent book: http://www.simple-talk.com/sql/performance/execution-plan-basics/
“Nested Loops” is the simplest operator of the bunch.
We’ll take the following query as an example, which gets some order detail columns for orders placed during July 2001 (assuming the OrderDate column only includes dates without time):
SELECT OH.OrderDate, OD.OrderQty, OD.ProductID, OD.UnitPrice FROM Sales.SalesOrderHeader AS OH JOIN Sales.SalesOrderDetail AS OD ON OH.SalesOrderID = OD.SalesOrderID WHERE OH.OrderDate BETWEEN '2001-07-01' AND '2001-07-31'
OH.OrderDate, OD.OrderQty, OD.ProductID, OD.UnitPrice
Sales.SalesOrderHeader AS OH
Sales.SalesOrderDetail AS OD
OH.SalesOrderID = OD.SalesOrderID
OH.OrderDate BETWEEN '2001-07-01' AND '2001-07-31'
The resulting execution plan looks like this:
The operator on the top right is called the outer input and the one just below it is called the inner input.
What the “Nested Loops” operator basically does is: For each record from the outer input – find matching rows from the inner input.
Technically, this means that the clustered index scan you see as the outer input is executed once to get all the relevant records, and the clustered index seek you see below it is executed for each record from the outer input.
We’ll verify this information by placing the cursor over the Clustered Index Scan operator and looking at the tooltip:
We can see that the estimated number of executions is 1. We’ll look at the tooltip of the Clustered Index Seek:
This time we can see that the estimated number of executions is 179 which is the approximate number of rows returned from the outer input.
In terms of complexity (assume N is the number of rows from the outer output and M is the total number of rows in the SalesOrderDetail table): The complexity of this query is: O(NlogM) where “logM” is the complexity of each seek in the inner input.
The SQL Server Optimizer will prefer to choose this operator type when the outer input is small and the inner input has an index on the column(s) by which the two data sets are joined. The bigger the difference in number of rows between the outer and inner inputs, the more benefit this operator will provide over the other operator types.
Having indexes and up-to-date statistics is crucial for this join type, because you don’t want SQL Server to accidently think there’s a small number of rows in one of the inputs when in fact there are a whole lot. For example: Performing 10 times index seek is nothing like performing 100,000 times index seek, especially if the table size of the inner input is around 120,000 and you’d be better off doing one table scan instead.
The “Merge” algorithm is the most efficient way to join between two very large sets of data which are both sorted on the join key.
We’ll use the following query as an example (which returns a list of customers and their sale order identifiers):
SELECT OC.CustomerID, OH.SalesOrderID FROM Sales.SalesOrderHeader AS OH JOIN Sales.Customer AS OC ON OH.CustomerID = OC.CustomerID
Sales.SalesOrderHeader AS OH
Sales.Customer AS OC
OH.CustomerID = OC.CustomerID
The execution plan for this query is:
First, we’ll notice that both of the data sets are sorted on the CustomerID column: The Customer table because that’s the clustered primary key, and the SalesOrderHeader table because there’s a nonclustered index on the CustomerID column.
By the width of the arrows between the operators (and by placing the cursor over them), we can see that the number of rows returned from each set is rather large.
In addition, we used the equality operator (=) in the ON clause (the Merge join requires an equality operator).
These three factors cause SQL Server Optimizer to choose the Merge Join for this query.
The biggest performance gain from this join type is that both input operators are executed only once. We can verify this by placing the cursor over them and see that the number of executions is 1 for both of the operators. Also, the algorithm itself is very efficient:
The Merge Join simultaneously reads a row from each input and compares them using the join key. If there’s a match, they are returned. Otherwise, the row with the smaller value can be discarded because, since both inputs are sorted, the discarded row will not match any other row on the other set of data.
This repeats until one of the tables is completed. Even if there are still rows on the other table, they will clearly not match any rows on the fully-scanned table, so there is no need to continue. Since both tables can potentially be scanned, the maximum cost of a Merge Join is the sum of both inputs. Or in terms of complexity: O(N+M)
If the inputs are not both sorted on the join key, the SQL Server Optimizer will most likely not choose the Merge join type and instead prefer the Hash join (will be explained soon). However if it does anyway (whether it’s because it was forced to due to a join hint, or because it was still the most efficient), then SQL will need to sort the table which is not already sorted on the join key.
The “Hash” join type is what I call “the go-to guy” of the join operators. It’s the one operator chosen when the scenario doesn’t favor in any of the other join types. This happens when the tables are not properly sorted, and/or there are no indexes. When SQL Server Optimizer chooses the Hash join type, it’s usually a bad sign because something probably could’ve been done better (for example, adding an index). However, in some cases (complex queries mostly), there’s simply no other way.
We’ll use the following query as an example (which gets the first and last names of contacts starting with “John” and their sales order identifiers):