Paul White and the Code That Made Me Wow - Trigram Search
top of page

Paul White and the Code That Made Me Wow - Trigram Search

This month's T-SQL Tuesday topic is "Code That Made You Feel A Way", hosted by Erik Darling. Here is the invitation blog post.

T-SQL Tuesday

If you're not familiar with T-SQL Tuesday, then you should. It's a wonderful community initiative started by Adam Machanic many years ago and led by Steve Jones. It strengthens the SQL Server blog community in a creative and engaging way, and it's been very successful over the years.


Erik invited us to write about a "lightbulb moment" when we saw a piece of code written by someone else, which made us feel a thing.



The Background - Wildcard String Search in SQL Server


A few weeks ago someone asked in one of the Israeli forums about ideas to run a wildcard string search in SQL Server in a performant way. The goal is to be able to search for an arbitrary substring within the contents of a string column in a large table. If you do it the straightforward way (LIKE '%substring%'), then it results in a long and expensive scan. This is an old challenge with lots of different solutions.


String Search

In many cases, the best solution would be simply not to do it in SQL Server. Other platforms, such as Azure Data Explorer or Elastic, might be better tools for the job. Of course, if you're only using SQL Server today, then this might be an extreme solution that involves a lot of effort and has a learning curve.


Sometimes the right solution is to mix several platforms, replicate the data, perform the search outside of SQL Server, and then return the results to SQL Server for further processing.


If you're bound to SQL Server only (whether it's for the right reasons or not), then in some cases Full-Text Search can do the trick. This is a service that is part of the SQL Server package, and it allows you to index the string column and run full-text queries against it. For some workloads, it can work really well and yield much better results compared to a simple wildcard string search.


But sometimes even that's not a practical solution, and you need a T-SQL solution within the boundaries of the SQL Server database engine. There are still some things you can do.


Someone else in the forum responded with a link to a blog post from 2017 by Paul White - Trigram Wildcard String Search in SQL Server. The concept of n-grams was familiar to me (in general), but I haven't read this blog post before.


Well, maybe I have read it in the past, and I just don't remember, cause it was a bit familiar. One of the advantages of having poor memory is that you can enjoy things again and again as if it's the first time. Think about your favorite book or your favorite movie of all times. You can watch the movie again and again as if it's new, and every time you will go: "Wow, this is the best movie ever!". This is my life. ๐Ÿ˜Š


I love reading Paul White's blog posts. They are always deep, precise, and full of explanations, code samples, and performance analysis. I always learn a lot from reading his blog posts. So imagine my excitement seeing a blog post by Paul White that I (supposedly) hadn't read before.



The Solution - Trigram Search


I won't go into all the implementation details here. If you're interested in this type of solution, then I urge you to read Paul's blog post thoroughly.


The general idea is to break the strings in the column into trigrams - contiguous sequences of 3 characters. The trigrams are then saved in a separate table with a proper index. And then the original wildcard search is replaced with a more complicated but much more performant search against the trigrams table.


Trigram

Of course, there are a lot more implementation details, and Paul White covers them thoroughly and beautifully in his blog post.


As usual, Paul provides the code to implement the full solution, and he explains the code step by step. By the end of the blog post, I thought: "Wow, this is beautiful!".


I had to test it myself, so I set up a test environment with 10 million rows. I compared Paul's solution with the simple wildcard search, and it worked really well.


This was one of those Wow moments. Thank you, Paul!



Conclusion


Other members of the forum reported implementing the trigram solution or variations of it. Other people asked for some clarifications about it, and there was an interesting discussion about it in the forum.


We even talked about it in the last SQL Server Radio episode.


Next time you challenge yourself with the wildcard String Search in SQL Server, remember to read Paul's blog post about trigrams. Read it as if it's the first time you see it. ๐Ÿ˜Š

0 comments

STAY IN TOUCH

Get New posts delivered straight to your inbox

Thank you for subscribing!

bottom of page