Contact us

Madeira Data Solutions

Your Data, Our Solutions

LOOP, HASH and MERGE Join Types

Written By: Eitan Blumin 05/01/2012

Today I’ll talk about the available JOIN operator types in SQL Server (Nested Loops, Hash and Merge Joins), their differences, best practices and complexity.

For the samples in this post, we’ll use the free AdventureWorks database sample available here: http://msftdbprodsamples.codeplex.com/releases/view/4004

Introduction: What are Join Operators?

A join operator is a type of an algorithm which the SQL Server Optimizer chooses in order to implement logical joins between two sets of data.

The SQL Server Optimizer may choose a different algorithm for different scenarios based on the requested query, available indexes, statistics and number of estimated rows in each data set.

It’s possible to find the operator which was used by looking at the execution plan that SQL Server has prepared for your query.

For more information on execution plans and how to read them, I recommend checking out the first chapter out of Grant Fritchey’s excellent book:
http://www.simple-talk.com/sql/performance/execution-plan-basics/

NESTED LOOPS

“Nested Loops” is the simplest operator of the bunch.

We’ll take the following query as an example, which gets some order detail columns for orders placed during July 2001 (assuming the OrderDate column only includes dates without time):

The resulting execution plan looks like this:

 

The operator on the top right is called the outer input and the one just below it is called the inner input.

What the “Nested Loops” operator basically does is: For each record from the outer input – find matching rows from the inner input.

Technically, this means that the clustered index scan you see as the outer input is executed once to get all the relevant records, and the clustered index seek you see below it is executed for each record from the outer input.

We’ll verify this information by placing the cursor over the Clustered Index Scan operator and looking at the tooltip:

We can see that the estimated number of executions is 1. We’ll look at the tooltip of the Clustered Index Seek:

 

This time we can see that the estimated number of executions is 179 which is the approximate number of rows returned from the outer input.

In terms of complexity (assume N is the number of rows from the outer output and M is the total number of rows in the SalesOrderDetail table): The complexity of this query is: O(NlogM) where “logM” is the complexity of each seek in the inner input.

The SQL Server Optimizer will prefer to choose this operator type when the outer input is small and the inner input has an index on the column(s) by which the two data sets are joined. The bigger the difference in number of rows between the outer and inner inputs, the more benefit this operator will provide over the other operator types.

Having indexes and up-to-date statistics is crucial for this join type, because you don’t want SQL Server to accidently think there’s a small number of rows in one of the inputs when in fact there are a whole lot. For example: Performing 10 times index seek is nothing like performing 100,000 times index seek, especially if the table size of the inner input is around 120,000 and you’d be better off doing one table scan instead.

MERGE Join

The “Merge” algorithm is the most efficient way to join between two very large sets of data which are both sorted on the join key.

We’ll use the following query as an example (which returns a list of customers and their sale order identifiers):

The execution plan for this query is:

  • First, we’ll notice that both of the data sets are sorted on the CustomerID column: The Customer table because that’s the clustered primary key, and the SalesOrderHeader table because there’s a nonclustered index on the CustomerID column.
  • By the width of the arrows between the operators (and by placing the cursor over them), we can see that the number of rows returned from each set is rather large.
  • In addition, we used the equality operator (=) in the ON clause (the Merge join requires an equality operator).

These three factors cause SQL Server Optimizer to choose the Merge Join for this query.

The biggest performance gain from this join type is that both input operators are executed only once. We can verify this by placing the cursor over them and see that the number of executions is 1 for both of the operators. Also, the algorithm itself is very efficient:

The Merge Join simultaneously reads a row from each input and compares them using the join key. If there’s a match, they are returned. Otherwise, the row with the smaller value can be discarded because, since both inputs are sorted, the discarded row will not match any other row on the other set of data.

This repeats until one of the tables is completed. Even if there are still rows on the other table, they will clearly not match any rows on the fully-scanned table, so there is no need to continue. Since both tables can potentially be scanned, the maximum cost of a Merge Join is the sum of both inputs. Or in terms of complexity: O(N+M)

If the inputs are not both sorted on the join key, the SQL Server Optimizer will most likely not choose the Merge join type and instead prefer the Hash join (will be explained soon). However if it does anyway (whether it’s because it was forced to due to a join hint, or because it was still the most efficient), then SQL will need to sort the table which is not already sorted on the join key.

HASH Match

The “Hash” join type is what I call “the go-to guy” of the join operators. It’s the one operator chosen when the scenario doesn’t favor in any of the other join types. This happens when the tables are not properly sorted, and/or there are no indexes. When SQL Server Optimizer chooses the Hash join type, it’s usually a bad sign because something probably could’ve been done better (for example, adding an index). However, in some cases (complex queries mostly), there’s simply no other way.

We’ll use the following query as an example (which gets the first and last names of contacts starting with “John” and their sales order identifiers):

The execution plan looks like this:

Because there’s no index on the ContactID column in the SalesOrderHeader table, SQL Server chooses the Hash join type.

Before diving into the example, I’ll first explain two important concepts: A ”Hashing” function and a “Hash Table”.

“Hashing” is a programmatic function which takes one or more values and converts them to a single symbolic value (usually numeric). The function is usually one-way meaning you can’t convert the symbolic value back to its original value(s), but it’s deterministic meaning if you provide the same value as input you will always get the same symbolic value as output. Also, it’s possible for several different inputs to result in the same output hash value (meaning, the function isn’t necessarily unique).

A “Hash Table” is a data structure that divides all rows into equal-sized “buckets”, where each “bucket” is represented by a hash value. This means that when you activate the hash function on some row, using the result you’ll know immediately to which bucket it belongs.

As with the Merge join, the two input operators are executed only once. We can verify this by looking at the tooltips of the input operators.

Using available statistics, SQL Server will choose the smaller of the two inputs to serve as the build input and it will be the one used to build the hash table in memory. If there’s not enough memory for the hash table, SQL Server will use physical disk space in TEMPDB. After the hash table is built, SQL Server will get the data from the larger table, called the probe input, compare it to the hash table using a hash match function, and return any matched rows. In graphical execution plans, the build input will always be the one on top, and the probe input will be the one below.

As long as the smaller table is very small, this algorithm can be very efficient. But if both tables are very large, this can be a very costly execution plan.

SQL Server Optimizer uses statistics to figure out the cardinality of the values. Using that information, it dynamically creates a hash function which will divide the data into as many buckets with sizes as equal as possible.

Because  it’s so dynamic,  it’s difficult to estimate the complexity of the creation of the hash table and the  complexity of each hash match. Because SQL Server Optimizer performs this dynamic operation during execution time and not during compilation time, sometimes the values you see in the execution plan are incorrect. In some cases, you could compare a hash join and a nested loops join, see that according to the execution plans the nested loop is more expensive (in terms of logical reads etc.), when in fact the hash join executes much slower (because its cost estimation is incorrect).

For our complexity terms, we’ll assume hc is the complexity of the hash table creation, and hm is the complexity of the hash match function. Therefore, the complexity of the Hash join will be O(N*hc + M*hm + J) where N is the smaller data set, M is the larger data set and J is a “joker” complexity addition for the dynamic calculation and creation of the hash function.

Microsoft has an interesting page in Books Online that describes further Hash Join sub-types and additional aspects about them. I highly recommend you read it: http://msdn.microsoft.com/en-us/library/ms189313.aspx

Query Hints

Using Query Hints, it’s possible to force SQL Server to use specific join types. However, it’s highly not recommended to do so especially in production environments, because the same join type may not be the best choice forever (because data can change), and SQL Server Optimizer usually has it right (assuming statistics are up-to-date).

I’ll talk about them just so you can experiment with the different join types on your test environment and see the differences.

To force SQL Server to use specific join types using query hints, you add the OPTION clause at the end of the query, and use the keywords LOOP JOIN, MERGE JOIN or HASH JOIN.

Try executing the above queries with different join hints and see what happens:

Summary

Nested Loops

  • Complexity: O(NlogM)
  • Used usually when one table is significantly small
  • The larger table has an index which allows seeking it using the join key

Merge Join

  • Complexity: O(N+M)
  • Both inputs are sorted on the join key
  • An equality operator is used
  • Excellent for very large tables

Hash Match

  • Complexity: O(N*hc+M*hm+J)
  • Last-resort join type
  • Uses a hash table and a dynamic hash match function to match rows

Resources

The following resources were used for the making of this post:

29 responses to “LOOP, HASH and MERGE Join Types”

  1. Syed Sami Ur Rehman says:

    Good One Eitan.
    It helped me a lot.. thanks for your time for defining such a good article.

  2. Deepak Kumar says:

    Really nice one. Must Read.
    Thanks for sharing it.

  3. Susie says:

    Thank you, Eitan. You have nailed the key difference on three join algorithms. Wonderful work.

  4. Joseph says:

    This is the best explanation I have read

  5. Dmitriy Zaretsky says:

    Eitan , thanks for good job!
    Very good article with clear explanation.

  6. dharmesh says:

    Curious about Hash join example, why it needs to build hash table for person table (smaller input) when there is already primary key on contactid on table. shouldn’t it do just Index seek.
    Also our where clause is on Person table (firstname) so why it needs to probe everything from order table when it only needs subset

    • Eitan Blumin says:

      Hello dharmesh and thank you for your comment.

      In regards to your question regarding Hash table for the Person table:
      An index seek is actually more expensive row-for-row than a hash match operation.
      An index seek is dependent on the number of rows in the table. That’s why its complexity is about NlogM, where M is the total number of rows in the table you seek, and N is the number of times the seek is performed (i.e. the number of rows in the outer input).
      A hash match, however, is a finite calculation (the hash function), and a direct pointer to where the hash bucket is located in memory. The complexity of the hash match operation is O(1). In other words, what I called “Hm” is actually finite.

      However, the additional price to the hash join is the construction of the hash table beforehand, because this is what SQL Server needs to do in order to allow that O(1) finite complexity using direct memory pointers. What I called “Hc” is also actually finite.

      So the total complexity of the hash match algorithm is basically the number of times the hash match is performed (i.e. N in our case), plus the construction of the hash table (i.e. M in our case): O(N+M)

      So when SQL Server Optimizer compares between O(N+M) and O(NlogM), it’s very possible that O(N+M) will be smaller, and therefore, the faster choice.

      As for your other question, I’m not sure what you meant.
      The JOIN operator is actually what DEFINES and CREATES the “subset” of rows to be returned.
      Before SQL Server begins the JOIN algorithm, it doesn’t actually know which subset it needs. That’s what the join algorithm is for.

      Please let me know whether I answered your questions to your satisfaction, dharmesh.
      And if needed, I’ll do my best to clarify whatever needs clarification.

  7. osman says:

    its good article , thanks .

  8. srikar says:

    Very informative article.
    Thanks a lot.

  9. srikar says:

    hi,

    One quick question.
    Are these joins used by sql server only when there is a join operator in our query?
    Please assist.

    Thanks in advance.

    • Eitan Blumin says:

      Hello Srikar and thanks for your question.

      The answer is no, join algorithms are not used only when the “join” operator is used within SQL Server.
      There are also other ways to cause SQL Server use the join algorithms, such as:
      IN / NOT IN
      EXISTS / NOT EXISTS
      CROSS / OUTER APPLY
      Correlated Sub-Queries
      And implicit join queries as well (where you separate the tables in the FROM clause using a comma (,) and add joining conditions within the WHERE clause).

      In other words: Any sort of query that needs to do any sort of joining or comparison between two or more tables.

  10. srikar says:

    ” In some cases, you could compare a hash join and a nested loops join, see that according to the execution plans the nested loop is more expensive (in terms of logical reads etc.), when in fact the hash join executes much slower (because its cost estimation is incorrect).

    The above are picked from your description on hash join.

    You mentioned that, hash join will be selected when there is no indexes on the columns used in the query.
    But for optimizer to select nested loop, we need indexes on the columns used in the query right?
    How can we compare hash join and nested loop? Please assist.

    Thanks in advance.

    • Eitan Blumin says:

      Hi Srikar,

      I’m not sure what you mean in your question.
      What do you mean when you say “compare hash join and nested loop”?
      I can only guess that you wish to compare the performance of these two algorithms on the same query, perhaps?

      If so then you can use the Query Hints which I mention in the last part of this blog post. You can use them to force your query to use a specific join algorithm and then take a look at the execution plans and compare them.

      You can also use Join Hints which are specified per JOIN clause. More info here:
      https://msdn.microsoft.com/en-us/library/ms173815.aspx

  11. Kevin says:

    Excellent summary of a complex topic.

    Thanks for sharing.

  12. Nathan says:

    Hi Eitan,
    Thanks a lot for this article.
    It’s really helps me understand the issue.

  13. Evan says:

    Great article! I don’t think anyone could explain it to me more clearly.

    I think there’s a tiny error. The description of the first query says you’re looking for orders “during July 2001”. But then the query itself says
    BETWEEN ‘2001-07-01’ AND ‘2001-07-31’
    which will give only the first 30 days. It’s a DATETIME rather than a string giving only the date, so anything that happens on the 31st (except at midnight?) will fall outside the BETWEEN. Instead, I suggest
    BETWEEN ‘2001-07-01’ AND ‘2001-08-01’
    which is a correct midnight-to-midnight range.

    I wouldn’t normally point out an error this small, especially one that is irrelevant to your main point. But your readers will learn one more thing correctly if you fix this.

    • Eitan Blumin says:

      Hi Evan, thank you for your comment!

      Technically you’re right, of course.
      If the OrderDate column includes time, then the query wouldn’t return orders made during the last day of July.
      However, in the particular case of the AdventureWorks database that I was using, the OrderDate column only includes the date portion, without the time.
      So the query is still correct (within the context of the specific database I was querying, of course).

      But I’ll take your remark to heart anyway, and add a note about this in the blog post.

      Thank you!

  14. wpo says:

    Great explanation. Thank you!

  15. Avinash says:

    Wow what an explanation. Thank you. Hope many more articles fro you.

  16. Samaneh says:

    It is a wonderful explanation , thanks a lot

  17. shashi kumar says:

    That’s great explanation. Thank you dear….

  18. Chandrajit Samanta says:

    Explained complex and unclear topics in lucid way.
    Key points are summarized properly.
    Description also explain the use cases for choosing a particular join.
    Troubleshooting and resolutions for inappropriate join selection in poor query are elaborated.
    It help me a lot.
    Thank You

  19. Tissa Rathnayake says:

    Great post, especially about hash Join..

  20. sunil says:

    nicely written and joins are explained with good examples

  21. proxy list says:

    What’s up,I read your blog named “LOOP, HASH and MERGE Join Types” regularly.Your humoristic style is witty, keep up the good work! And you can look our website about proxy list.

  22. Moe says:

    I’m preparing myself for 70-461 exam and I was looking for the Hints – Join that I found out this awesome talk.
    Thank you very much.

  23. Praki says:

    Thx a lot for the great explanation.

Leave a Reply

Your email address will not be published. Required fields are marked *