Recently I’ve stumbled upon an apparently little known fact about SQL Server – specifically SQL Server collations and how they affect performance.
During a normalization upgrade to their database, one of my clients had weird and inconsistent performance problems and couldn’t find a solution for it for quite a while.
They had two queries:
The Original (Un-Normalized) Query: Scanned 2.5 million rows and executed for 3 seconds.
The New (Normalized) Query: Scanned 10 million similar rows and executed for 50 seconds.
Each query was executed against a different table, but the data in the tables was identical (the difference was due to the normalization). The column we searched against was varchar(255) in both tables.
For a while we couldn’t wrap our heads around the reason for the huge difference in performance.
Something just didn’t seem right.
During my investigation I noticed an intriguing difference in the execution plans, and that is in the first query (the fast one) an implicit conversion was performed for the varchar column, but no such conversion occurred in the slow query. At first I didn’t pay attention to it because, as a DBA, I was trained to think that implicit or explicit conversions HURT performance, and not the other way around – so this couldn’t have been the reason.
But eventually (and with a little push from Shy Engelberg – Thank you Shy!) I took a closer look to find out why this conversion was happening in the first place.
We found that the collation of the varchar column was different between the two tables.
The first table had a Latin SQL collation, yet the second table had an ANSI collation (or windows collation if you prefer to call it that).
We tried to change the collation of the second table to the SQL collation as well – and presto! The query now executed for 5 seconds – 10 times faster! (and only 2 times slower than the 3 second query, a lot more logical difference)
I looked for materials regarding this issue, and couldn’t find much except this (apparently important) white paper:
The highlight:
Unicode sorting rules are much more complex than the rules for a non-Unicode SQL sort order. When SQL Server compares Unicode data, the characters are assigned a weight that is dynamically modified based on the collation’s locale. The data is also modified by comparison style settings such as width, accent, or Kana-sensitivity. The Unicode sort routines support more intelligent sort behaviors like word sorting.
[…]
• If you are storing and handling your data by using non-Unicode data types (char, varchar, text), and you are using a SQL collation, string comparisons will be performed with a non-Unicode SQL sort order.
• If you are storing and handling your data by using non-Unicode data types (char, varchar, text), and you are using a Windows collation, string comparisons will be performed with the Unicode sorting rules. This may cause certain operations that are unusually dependent on string sorting performance to take longer and to use more CPU than a similar operation that is performed with a SQL collation.
• If you are using Unicode data types (nchar, nvarchar, ntext), there is no difference in the sorting behavior for the SQL and the Windows collations. Both will use Unicode sorting rules.
In short, windows collations perform as slow as Unicode data types, and SQL collations perform much faster.
However, the sorting order must be taken into consideration because it could change the results.
Though in the case of this client, the first table was the original way to go anyway, so we’re actually “fixing” the sorting order difference (in case such existed).
Later we also tested the performance of a binary collation and saw that it was even faster in queries that required comparison and grouping, but it performed worse in queries that required sorting. This makes sense since binary collations must be parsed textually before being sorted – a heavier operation than doing the same for SQL collations – as opposed to a comparison operation which can be as fast as doing XOR of two values.
Conclusion
SQL or Windows Collations seem like another consideration we need to take when designing a database and checking for performance – but beware! Changing the collation may change the sorting order of your text, so if your application depends on a certain form of sorting – you need to make sure the new collation matches it.
The same white paper mentioned above claims that changing indexes or how a query is written will make a bigger difference than changing collations, but it’s surprising what a huge difference in performance a collation caused in the case of my client, and we did try to rewrite the query several times and changing indexes, but nothing made a big difference such as the collation.
Did you encounter a similar scenario? Did you have similar (or even more interestingly – different) results? Tell us about it in the comments!
Thanks again to Shy Engelberg for his very helpful advice on this one.
Comments