This series of articles will cover the first steps you should take when entering a new environment, aiming to improve overall performance.
Since there are endless things and approaches, we will present the tools, methods, and tips and divide the possible scenarios you might run into and how to deal with any of them.
Each chapter focus on a different aspect you should be familiar with when facing a performance issue, either a current crisis or a general "slowness." This guide is aimed at junior-intermediate experience level DBAs but might be interesting for experts.
In our previous post, we've discussed the different strategies we'll consider, depending on the time of the performance downgrade. This chapter will focus on:
Being familiar with typical scenarios and how to handle each of them can save precious time in a crisis. Sometimes, the words we expect to hear will align perfectly with an already known scenario. Occasionally, we will receive an entirely wrong description, from which we will need to detect what is happening.
For example, a common phrase we often hear is “I have deadlocks.” Sometimes this is really a deadlock, sometimes, it is a simple lock, and sometimes it is neither.
Another example is “I get timeouts from the database.” Well, there is no such thing as a timeout in the database, except for specific configurations for specific items, which are not the case. Usually, the timeout is configured at the app - .net, for example, has a default timeout of 30 seconds for connections to SQL Server.
Let’s cover a few of these scenarios:
1. “Nothing is working, the database does not respond at all, everything is down” (usually with screams in the background).
The bigger the crisis, the easier the solution is. If nothing responds, it is rarely a performance issue.
Make sure the service is running.
Check for CPU issues; maybe other apps are installed on the server and consume the CPU.
Use DMVs to understand what is currently happening. Maybe a lock-chain, maybe one/several processes consuming too many resources.
Eventually - Find the root cauand se, understand what are your options to handle it fast to release the bottleneck (Can you kill the session? Is it essential? Will it be over soon? ...).
2. “My session is blocked,” “I’m getting timeouts,” “Things feel slower than yesterday.”
All these phrases tell us that something is different now.
“Now,” as explained in chapter 3, is relatively easier since DMVs will usually cover all the details you need without being dependent on monitoring tools (that we don’t always have). Understand what is happening. Ask questions and run queries on DMVs simultaneously.
“Different than before” can result from locks, bad statistics (that were changed for endless reasons), changes that affected something, and more.
Find out what is slower.
Find out why it is slower, what it is waiting for, and its status.
For these questions, I have a query prepared and attached “General Requests Analysis.” It is a great “first aid” to understand, track and follow the behavior of the instance.
3. Lock Chain.
One session locking another, that is locking several others, that are locking plenty more others, and so on.
That can also result in “nothing is working,” slowness and timeouts. Handling that will be finding the root session causing the entire lock-chain. For that, use DMVs such as sys.dm_exec_sessions, sys.dm_exec_requests, sys.dm_tran_locks, or simply read Guy Glantser’s blogpost (with a script for a query written just for that).
4. “I have deadlocks” – about solving deadlocks, I could fill a whole different document, but to keep it simple, assuming this is really a deadlock case (not just locks, rephrased by someone who does not understand the differences).
Deadlocks are not a cause of performance issues but the results of a performance issue.
Find the deadlock – Use Extended Events system health sessions/monitoring tools.
If you don’t have any monitoring tool collecting deadlocks / the event has already been deleted from the extended events system health session, and it is expected to happen again – configure a monitor that will collect it the next time (profiler/ extended events/…).
*Both links regard SQL Server Profiler. Graphs in different 3rd party tools are usually similar. XMLs are identical. There are things easier to do with graphs and others with XMLs, and it differs from one person to another.
After understanding the deadlock, there are mainly 4 approaches:
Harden the locking to prevent a scenario of transactions approaching the same object. That will prevent a deadlock but increase locks and might result in overall slower performance. For example, using TablockX and Holdlock query hints.
Minimize the locking to enable higher concurrency. In that case, you minimize the chances for deadlocks but cause the locking mechanism to “cost more.” Handling locks has a cost. More lock, higher cost. That also might cause performance issues.
Redesign. When two (or more) transactions collide and result in a deadlock, there are often ways to prevent it by changing the code/ its scheduling/and its implementation. It will not always be easy to change the architecture of essential processes in a production environment, but it has minimal risk and high protentional compared to a and b.
A common approach is to improve performance directly. If each of the processes will take less time and io, the chances of them colliding decrease as well. A classic example would be both processes using a Full Scan of the same table. By changing/adding an index/ changing the query, you might achieve Index Seeks that will not “bother” each other.
5. And the last one, my favorite. The developer calls you complaining, “My code/query ran fast yesterday; now it is really slow; I swear I didn’t change anything.” Things do not change from an empty void.
Something has changed. Main 4 scenarios:
1. The developer changed something and thinks it is irrelevant, but it is.
2. Someone else changed something.
3. Some background processes caused the differences.
4. Nothing has changed; something is happening now and affecting the execution.
As explained in chapter 2, investigating, understanding the scenario, and isolating the problem is crucial. Only by that can you match a strategy to solve the issue. But here are some examples of “changes” and things that might correlate with the typical “nothing changed, but it is slower” phone call:
· Code changes.
· Schema changes.
· PLE drop - an indication of something that might be relevant.
· Clear cache - plans recompilations.
· Currently running transactions (locks/ pressure).
· Sleeping sessions with open transactions - holding locks.
· Statistics update due to auto-update-statistics thresholds.
· Statistics update due to a scheduled plan.
· Data distribution/Capacity changes.
To sum up this chapter and the entire series,
endless possible scenarios might cause performance issues. The key is always to dig as deep as possible to understand what is happening fully. You can quickly start implementing your way to solve a problem, waste hours, and catch nothing. Or even gathering data, analyzing it, and eventually understanding it doesn't give you the answer to WHY but just shows you what you already knew. Trust me; it is way more frustrating than dedicating 10 more minutes to asking questions. And of course, knowing in advance the tools you have, is always a bonus. For that, here is a list of some tools mentioned throughout the series that I find important to be familiar with:
Handling with Extended Events SessionsMonitoring Deadlocks with system_health XE Session Selecting Extended Events Session (system_health) SQL Server Logs Querying Error Logs Event Viewer Performance Monitor