Nested Loops, Parallelism and Confusion
I’m 30 today.
In addition, it’s been 10 years since I started working with SQL Server.
In my twenties, I fell in love with T-SQL and SQL Server, got to a point I felt I needed a break, and came back fully energized.
One of my favorite parts of being a DBA has always been performance tuning, and what’s performance tuning without analyzing execution plans?
The moment I understood I needed a break was when I examined an execution plan at work and just didn’t feel like making it look better.
Well, that feeling went away very quickly. For T-SQL Tuesday, hosted this month by Rob Farley (Blog | Twitter), here’s something I learned about the nested loops operator during an optimization session that shows why I love performance tuning so much.
First, let’s create and populate the two tables we’ll use. Please watch out if you run the script, since it populates a good amount of data.
–Create tables CREATE TABLE #ItemAttributes ( ItemId BIGINT IDENTITY(1,1), ItemAttributeTypeId INT, Value VARCHAR(MAX), CONSTRAINT PK_ItemAttributes PRIMARY KEY(ItemAttributeTypeId,ItemId) ) GO
CREATE TABLE #ItemAttributeTypes ( ItemAttributeTypeId INT IDENTITY PRIMARY KEY, ItemAttributeTypeName VARCHAR(100) ) GO
–Insert data BEGIN TRAN DECLARE @I INT = 1 WHILE @I<=1000 BEGIN INSERT INTO #ItemAttributeTypes(ItemAttributeTypeName) SELECT ‘Type’ + CAST(@I AS VARCHAR(100)) SET @I+=1 END COMMIT
–Populate the major amount of data using –Itzik Ben-Gan’s trick: WITH L0 AS(SELECT 1 AS C UNION ALL SELECT 1 AS O), — 2 rows L1 AS(SELECT 1 AS C FROM L0 AS A CROSS JOIN L0 AS B), — 4 rows L2 AS(SELECT 1 AS C FROM L1 AS A CROSS JOIN L1 AS B), — 16 rows L3 AS(SELECT 1 AS C FROM L2 AS A CROSS JOIN L2 AS B), — 256 rows L4 AS(SELECT 1 AS C FROM L3 AS A CROSS JOIN L3 AS B), — 65,536 rows L5 AS(SELECT 1 AS C FROM L4 AS A CROSS JOIN L4 AS B), — 4,294,967,296 rows Nums AS(SELECT ROW_NUMBER() OVER(ORDER BY (SELECT NULL)) AS N FROM L5)
INSERT INTO #ItemAttributes SELECT TOP 1000000 1000, REPLICATE(CAST(N%100 AS VARCHAR(10)),4000) FROM Nums ORDER BY N;
–Insert our "Bingo" row. We will search for it INSERT #ItemAttributes (ItemAttributeTypeId, Value) SELECT 1000,’Bingo’
Our goal is to find the row of Type 1000 and a value of ‘Bingo’. Let’s try:
The query takes 8 seconds to complete and generates the following execution plan:
Not good enough..
SELECT * FROM #ItemAttributes a WHERE a.ItemAttributeTypeId = 1000 –Type1000’s Id and a.Value = ‘Bingo’
The second query takes only 1 second to complete and generates the following plan:
So what’s the difference? The first query is fairly straightforward and completely legitimate. Also, the amounts of IO generated by the two queries were almost the same.
The answer lies in the way the nested loops operator works with parallelism. Or does it?
Based on this post by Craig Freedman, my assumption was this:
The outer input of the loop can be separated to different threads for parallel work. On the other hand, the inner input of the loop cannot, even if the operators on the inner side can use parallelism. So as in our case, when a certain value of the outer input has to go through lots of data in the inner input, it can painful.
The second query worked well because other operators like index seek, index scan and even a table scan can be separated properly to a few threads, when each thread performs a portion of the work. By eliminating the need for a nested loops join, we allowed parallelism to do its magic.
Well, not quite.
Adam Machanic (Blog | Twitter) corrected me and stated that the inner input of the loop can in fact be processed in parallel (see the original comment below). The actual problem was a statistics problem due to the heavy skewed data.
By using the following query, we can actually see parallelism on the inner input of the loop:
SELECT * FROM #ItemAttributes a INNER JOIN #ItemAttributeTypes t ON a.ItemAttributeTypeId = t.ItemAttributeTypeId WHERE t.ItemAttributeTypeName = ‘Type1000’ AND a.Value = ‘Bingo’ AND a.ItemAttributeTypeId = 1000
So, what do I learn from this?
First, when you’re not 100% sure of something, maybe it’s worth posting a short question on #sqlhelp before writing a post about it.
Second, I’m currently a bit confused about when parallelism can be applied on the inner input. To add to the confusion, check out this post by Paul White, stating that “the inner side almost always runs multiple threads serially”. There isn’t a lot of material about the subject, so maybe it’s a good topic for someone out there who understands it