Behold! A Wild Latch-Convoy Problem Appears!
top of page

Behold! A Wild Latch-Convoy Problem Appears!

Updated: Sep 1, 2023

The "Last Page Insert Contention" in SQL Server, also known as "Latch Convoy Problem", also known as "PageLatchEx Contention" is one of those extremely rare use cases that are very difficult to see in real-world scenarios.

Evidently, it was impactful enough that Microsoft implemented a solution for this problem back in SQL Server 2019 in the form of the new OPTIMIZE_FOR_SEQUENTIAL_KEY index option, which reportedly fixes it.

It's not a catch-all solve-all index option though, as it has some performance overhead. So, it's not something that you'd want to enable for all of your indexes wholesale.

But the use case for it is so specific and so extreme, that it's very rare to actually see it in the real world. I mean, in order to create this sort of last-page contention, you'd have to BOMB that index with rapid and concurrent insertions like CRAZY.

Last-Page Insert Contention in SQL Server, AKA The Latch Convoy Problem

I personally have never seen this specific scenario happen before. At least not outside of demos and simulations. In fact, it's so rare that I even started to lose hope of ever seeing it in my lifetime...

Until today.

Today it happened. 😲

And it caught me unprepared.

One of our customers has complained about their system being slow.

Identifying the Contention

At first, when I looked at the performance metrics, I didn't connect the dots. It looked like a server resource issue to me. The workload on the server simply increased gradually over several months, until it hit some kind of ceiling in terms of the server resources it could utilize... At least, that's what I thought. I mean, that last-page contention thing was so far away from my mind that I didn't even consider it at all. The best I could recommend with a quick impact was resizing the server to a stronger machine.

Performance analysis graphs over time, showing a gradual trend of increasing workload and waits

... But that didn't help.

High memory (latch) contention
Their main insertion process still suffered from slowness. That's the INSERT INTO command in the tooltip.

Zooming in, we could see that the main wait types were PAGELATCH_EX with PAGELATCH_SH and PAGELATCH_UP being close seconds, and intermittently, also spikes of WRITELOG waits:


High PAGELATCH_EX and PAGELATCH_SH contention
There it was... Staring me right in my face and I didn't recognize it at all...
High WRITELOG waits
Like a majestic beast, prowling in the wild... And I was none the wiser.

Resolving the Contention

And then - the customer decided to try out the OPTIMIZE_FOR_SEQUENTIAL_KEY index option (embarrassingly, that recommendation didn't come from me).

To do this, they recreated an index that was massive. It halted their entire environment for half an hour because the affected table was so central to their operation.

High lock waits while the index is being created
Those huge turquoise mountains of waits? Those are lock waits due to the index being created. They're so huge on this graph scale, that you can't even see the latch wait spikes that were prior to it.

But, eventually, it was completed.

IMPORTANT SIDE NOTE: The customer at the time was not aware of it, but you don't actually need to rebuild an index to enable the OPTIMIZE_FOR_SEQUENTIAL_KEY option. You could simply run a metadata-level command that looks like this:

ALTER INDEX [IX_IndexName] ON [dbo].[MyTableName]
 SET ( OPTIMIZE_FOR_SEQUENTIAL_KEY = ON )

Afterwards - over the period of an hour or so - the contention got reduced... more... and more... lower, and lower... until there was no contention anymore!

Latch contention going down, going down, going down...
going, going, going... aaaaaaaaand gone.

Interestingly, after enabling the optimized index option, we saw an increase in a different wait type instead: BTREE_INSERT_FLOW_CONTROL.

But these are usually short-lived, and we still noticed that the overall performance was significantly improved.

Conclusion

I decided that I would never again be so skeptical about this use case. Never again will I be unprepared for it in the future.

Thinking back on why I didn't recognize this use case sooner, I concluded that this is because it's not exactly one of those things that you tend to recognize just from staring at SQL Wait graphs, nor from looking at the top SQL commands and their resource consumption.

Identifying the "last-page insert contention" scenario is a bit more complicated than that. The PAGELATCH_EX wait could mean a lot of things. It simply means that there was write contention on a data page in memory. But it could be any type of data page. It could be an IAM page, a GAM page, or something else.

The way to detect last-page insert contention is by checking how much contention there is PER RESOURCE. This way, you could detect that the problem resides on a specific table, and on a specific page (i.e. the last page at the end of the index). It's not something you can see just from looking at charts. You'll need to go deeper than that.

This was a humbling experience. But this event got me so excited, that I decided to construct an alert that would automatically detect these scenarios, based on the sample query provided by Microsoft in this article, which performs this check by counting the PAGELATCH__EX contention per resource.

Once I have this alert automated, it will be doing all the heavy lifting for me.

You're not gonna get me next time, last-page insert contention! 💪

The real-world use case described in this article was detected for one of our customers as part of our managed remote DBA service. The charts and metrics in the screenshots were taken from SolarWinds SQL Sentry. Following the findings of this event, we've added an automated alert that would be able to detect similar scenarios for all of our other managed customers as well. This is how we continuously grow and improve our managed service. ❤

STAY IN TOUCH

Get New posts delivered straight to your inbox

Thank you for subscribing!

bottom of page