• Eitan Blumin

T-SQL Tuesday 128 - Learn from Others

Updated: Sep 11, 2020

This month's #tsql2sday is hosted by the @AirborneGeek (t|b), who asks us to take a lesson from something frequently done by pilots: learning from accidents and mistakes done by others.

As a long-time SQL Server Consultant DBA, I have learned from quite a lot of mistakes done (mostly) by others, seeing as a significant part of my job description is to come over and fix such mistakes. So, today I'll use this opportunity to talk about one such interesting incident.

She came into the darkly lit office

Our story begins with a production environment, consisting of two Azure VMs with SQL Server 2016 Standard Edition, and an AlwaysOn Basic Availability Group. The database in question holds critical CRM and financial data, implemented using Microsoft Dynamics NAV (for those who aren't familiar, it's a long-time standard ERP app from Microsoft).

In any case, this customer contacted us with complaints about recurring performance issues. They couldn't quite put their finger on anything specific, but they reported the following symptoms:

  1. End-users intermittently getting "a general sense of slowness while using their system".

  2. Users experiencing sudden disconnections from the database.

  3. Maintenance jobs failing occasionally.

Fumbling in the dark

Looking into the history of the failed jobs, they appear to be caused by disconnections in the Availability Group, and these correlated with corresponding events in the AlwaysOn_Health extended events session (as did the disconnections reported by end-users): State changes into RESOLVING and back, WSFC disconnections... The works.

However, we couldn't pinpoint the cause of these disconnections.

Asking the informant

The servers were hosted by Microsoft Azure, so we used that to our advantage.

After contacting Microsoft Azure support, they let us know that the disconnections happened because... The servers were being throttled! Something caused them to reach the disk max throughput limit periodically. We even narrowed our search down to the tempdb files (located on the VM's temporary storage disk), after finding corresponding messages in the SQL Server error log repeatedly saying: SQL Server has encountered I/O requests taking longer than 15 seconds (for the tempdb files).

Finally, we were making progress... But we needed more detailed performance metrics, and gather them over a longer period of time. We needed better visibility.

This called for the installation of SentryOne, the database monitoring platform that we love using as part of our managed database services solution.


The culprit reveals itself

Now that we knew exactly what to look for, we headed to SentryOne and looked for the Top SQL commands that had the highest TempDB utilization.

Looking at the time window of a recent disconnection event, we saw the following:

UPDATE STATISTICS eating up TempDB User Space
Wait, what? UPDATE STATISTICS?

Was this the culprit? An UPDATE STATISTICS maintenance job? Could it really be?

We widened our search to a larger time window and aggregated the results... We saw the following:

Lots of UPDATE STATISTICS taking up a LOT of reads and TempDB User space
Holy cow! What a monster!

The picture started becoming clear... A periodical UPDATE STATISTICS maintenance job was wreaking havoc on TempDB specifically, and on disk throughput in general.

We looked closer at the UPDATE STATISTICS maintenance job and the horror revealed itself before our eyes:

  • It was the vanilla maintenance plan type of job, the kind that you create in SSMS with an SSIS implementation behind the scenes.

  • The UPDATE STATISTICS was set with a sample rate of FULLSCAN for all tables.

  • It was running for all tables in the database.

  • It was scheduled to run every single night.

  • Based on its execution history, its average duration was around 32 hours.

In essence, this meant that, on average, about 70% of each day there was some kind of UPDATE STATISTICS job running in the background, scanning untold amounts of data pages from disk and memory. And every time there was some kind of additional BI process or a heavy query, it brought the server down to its knees, causing the IO to reach its max throughput, and subsequently causing the throttling and AG disconnections.

Solving the issue

Once we knew what the cause for the problems was, we knew how to deal with it:

  • We changed the maintenance job to the one implemented by Ola