This month's #tsql2sday is hosted by the @AirborneGeek (t|b), who asks us to take a lesson from something frequently done by pilots: learning from accidents and mistakes done by others.
As a long-time SQL Server Consultant DBA, I have learned from quite a lot of mistakes done (mostly) by others, seeing as a significant part of my job description is to come over and fix such mistakes. So, today I'll use this opportunity to talk about one such interesting incident.
She came into the darkly lit office
Our story begins with a production environment, consisting of two Azure VMs with SQL Server 2016 Standard Edition, and an AlwaysOn Basic Availability Group. The database in question holds critical CRM and financial data, implemented using Microsoft Dynamics NAV (for those who aren't familiar, it's a long-time standard ERP app from Microsoft).
In any case, this customer contacted us with complaints about recurring performance issues. They couldn't quite put their finger on anything specific, but they reported the following symptoms:
End-users intermittently getting "a general sense of slowness while using their system".
Users experiencing sudden disconnections from the database.
Maintenance jobs failing occasionally.
Fumbling in the dark
Looking into the history of the failed jobs, they appear to be caused by disconnections in the Availability Group, and these correlated with corresponding events in the AlwaysOn_Health extended events session (as did the disconnections reported by end-users): State changes into RESOLVING and back, WSFC disconnections... The works.
However, we couldn't pinpoint the cause of these disconnections.
Asking the informant
The servers were hosted by Microsoft Azure, so we used that to our advantage.
After contacting Microsoft Azure support, they let us know that the disconnections happened because... The servers were being throttled! Something caused them to reach the disk max throughput limit periodically. We even narrowed our search down to the tempdb files (located on the VM's temporary storage disk), after finding corresponding messages in the SQL Server error log repeatedly saying: SQL Server has encountered I/O requests taking longer than 15 seconds (for the tempdb files).
Finally, we were making progress... But we needed more detailed performance metrics, and gather them over a longer period of time. We needed better visibility.
This called for the installation of SentryOne, the database monitoring platform that we love using as part of our managed database services solution.
The culprit reveals itself
Now that we knew exactly what to look for, we headed to SentryOne and looked for the Top SQL commands that had the highest TempDB utilization.
Looking at the time window of a recent disconnection event, we saw the following:

Was this the culprit? An UPDATE STATISTICS maintenance job? Could it really be?
We widened our search to a larger time window and aggregated the results... We saw the following:

The picture started becoming clear... A periodical UPDATE STATISTICS maintenance job was wreaking havoc on TempDB specifically, and on disk throughput in general.
We looked closer at the UPDATE STATISTICS maintenance job and the horror revealed itself before our eyes:
It was the vanilla maintenance plan type of job, the kind that you create in SSMS with an SSIS implementation behind the scenes.
The UPDATE STATISTICS was set with a sample rate of FULLSCAN for all tables.
It was running for all tables in the database.
It was scheduled to run every single night.
Based on its execution history, its average duration was around 32 hours.
In essence, this meant that, on average, about 70% of each day there was some kind of UPDATE STATISTICS job running in the background, scanning untold amounts of data pages from disk and memory. And every time there was some kind of additional BI process or a heavy query, it brought the server down to its knees, causing the IO to reach its max throughput, and subsequently causing the throttling and AG disconnections.
Solving the issue
Once we knew what the cause for the problems was, we knew how to deal with it:
We changed the maintenance job to the one implemented by Ola Hallengren, where you can specify the modification rate for the Update Statistics part of the job. This should reduce the duration of each execution, and make sure that we're only updating statistics for what actually needs updating.
We modified the sampling rate of the statistics update. Instead of a FULLSCAN, we changed it to 20%. The customer's data is rather uniform, so this was quite enough for our use case.
We reduced the frequency of the job - instead of running every single night, we scheduled it to run once every weekend. This should put it out of harm's way by avoiding overlap with heavy end-user queries and/or the daily BI processes.
We still had the SentryOne monitoring in place, so if any performance degradation resulted from bad statistics, we'd be able to detect it and adjust our configurations accordingly.
After we made these changes, we saw an almost immediate effect. Getting rid of the "background noise", caused by constant statistics update jobs running all the time, has significantly reduced the stress on the server, and improved overall performance.

The SQL Wait stats didn't see much improvement, but the workload stress on the memory and disk was reduced significantly, and memory utilization has been stabilized and improved as well.
And, most importantly, the customer no longer experienced any disconnections resulting from every heavy query or BI process being run. Not even during the weekends, when the statistics job was running again.
Everyone was happy, everything was good.
So, what did we learn?
To summarize what we've learned from this use case:
First and foremost, always have a proper database monitoring solution recording your SQL Server performance for retroactive investigation and tuning. SentryOne is our personal favorite, but there are many others as well.
When using Azure VMs for SQL Server estates, be wary of the maximum disk throughput limit. These can get ya right when you least expect it, and cause availability issues - which quite ironically - can be accentuated when using Availability Groups. See the "additional resources" section below for relevant articles on this topic.
Be mindful of the scheduling you set for your database maintenance jobs. Statistics update operations can be just as resource-intensive as index rebuilds and integrity checks, depending on your data volume and hardware performance. Don't make them run too frequently so that they'd overlap with end-user activity (but also don't neglect them either... they're there for a good reason).
Be mindful of the configurations for your database maintenance jobs. Setting up default settings may not be the best ideal for all possible use cases, and neither would overtly inclusive settings (such as rebuild all indexes, update all statistics, etc.). This is a whole topic on its own, but there are quite a lot of resources available out there for this.
All in all: Don't do things blindly. A SQL Server estate is not something you can just play "fire and forget" with. It requires constant care by a professional DBA, in order to get the best value out of it (especially when dealing with high traffic and/or high volume systems).
Comments