Recently, while working with a customer and tuning some queries, we spotted a query that seemed odd. Something about it wasn't right. After some more investigation, the developer recognized the query as a one generated by Entity-Framework and using the SplitQuery feature. This was new to me. It's the first time I encountered this feature, so I went to learn about it.
Now that I know what it is and how it works, I can tell you that it's a terrible feature in most cases. Developers should avoid using it, unless there is a good reason to use it (which I doubt).
So what is the SplitQuery feature in Entity-Framework?
You can read about it here. According to that article, if you have a one-to-many relationship between two or more entities (tables), and you want to query both entities, then EF will generate a query with a join between the tables.
Here is an example query from the article:
SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url], [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title] FROM [Blogs] AS [b] LEFT JOIN [Post] AS [p] ON [b].[BlogId] = [p].[BlogId] ORDER BY [b].[BlogId], [p].[PostId]
The article then describes a problem with this approach, called "cartesian explosion" problem:
If a typical blog has multiple related posts, rows for these posts will duplicate the blog's information. This duplication leads to the so-called "cartesian explosion" problem. As more one-to-many relationships are loaded, the amount of duplicated data may grow and adversely affect the performance of your application.
So as a developer, what you can do to avoid this problem is to use the SplitQuery feature. If you instruct EF to split the above query, it will result in two queries executed one after the other:
SELECT [b].[BlogId], [b].[OwnerId], [b].[Rating], [b].[Url] FROM [Blogs] AS [b] ORDER BY [b].[BlogId] SELECT [p].[PostId], [p].[AuthorId], [p].[BlogId], [p].[Content], [p].[Rating], [p].[Title], [b].[BlogId] FROM [Blogs] AS [b] INNER JOIN [Post] AS [p] ON [b].[BlogId] = [p].[BlogId] ORDER BY [b].[BlogId]
The idea is to reduce the amount of data that the database sends to the client. All the columns from the [Blogs] table will be returned only once per blog instead of being duplicated across posts. This results in less network traffic and less memory footprint on the client.
But this is bad for so many reasons:
It results in two queries instead of just one. These queries are executed serially, one after the other. The Blogs table is accessed twice. All of this leads to more load on the database, and in most cases it will degrade performance instead of improving it.
If there is a foreign key between the tables (on the BlogId column), then the second query doesn't even have to join to the Blogs table. It only needs data from the Posts table, so it could have been much simpler. Unfortunately, as it seems, EF generates the second query with a join, even when there is a foreign key.
The client application now has two datasets, and it is responsible for joining them together. The database engine is an expert in joining tables. This is what it does best. It has several join algorithms to choose from, and it can leverage all kinds of data structures, like indexes and bitmaps, to run joins faster. It would be a waste to implement the join in the client application instead of letting the database handle it.
When the single query is split into two queries, there is a risk of data inconsistency. If data changes in between the executions of the two queries, you might get results from the first query that are inconsistent with the results from the second query.
When would this feature make sense?
I would say that it might make sense to use it if all the following conditions apply:
You need to return many columns and/or columns with large data types from the parent table, such that the average row size is very large.
You need to return a large number of rows from the parent table.
On average, there are many rows in the child table for each row in the parent table (e.g. many posts per blog).
You don't care about data consistency, or it can't happen with your workload.
Even if all of these conditions apply, I would still test performance with and without the SplitQuery feature. My guess is that in most cases, it would be better not to use this feature.
And even if it runs better when splitting the query, I would still prefer to write a stored procedure with an optimized SQL batch to handle this case, rather than let EF to generate the queries. But that's me.
Bottom line: avoid using the SplitQuery feature in Entity-Framework, unless you have a good reason and you know what you're doing.