Written By: Guy Glantser 26/02/2015
One of the most common tasks that DBAs are required to perform is monitoring. This is not just a task. Monitoring is a strategy, and a good DBA should have a well-defined plan for carrying out this strategy. But I’m not here to talk about monitoring as a whole. I would like to talk about one of the common monitoring areas, which is query statistics.
The reason query statistics is so common is because most of the problems that users complain about are related to query performance. It might be a specific report that is taking forever or the whole system that is being slow. Whatever it is, you need to monitor query performance over time, find the problematic queries, and then do your magic in order to improve the performance of these queries.
So I’m here to talk about the part of finding the problematic queries. There are many ways to do this, which differ in their efficiency and complexity. One of the use cases that I encounter a lot as a consultant, is when the customer complains about some performance problem that occurred in the past or that is currently happening. The problem is that the customer has no monitoring strategy, and there is currently no process collecting performance metrics of any kind. So you can start such a process, but then you’ll have to wait for a while in order to let the process collect enough data to analyze. You also might not be able to identify the problem that the customer was talking about, which occurred in the past.
Fortunately, SQL Server maintains query statistics through a DMV – sys.dm_exec_query_stats. This view returns aggregated performance data about execution plans that are currently in the plan cache. It accumulates data since the last time the SQL Server instance was started, but only as long as the plan is in cache. So this is a good start for the use case described earlier. You can immediately begin to analyze performance data about the queries in the system, which has been collected and maintained for you. And the nice part is that you can do this by writing simple queries against the DMVs. No need to set up anything or to perform any complex tasks.
Let’s look at some of the columns that this view returns:
|sql_handle||This is a token that refers to the text of the batch corresponding to the plan in cache|
|statement_start_offset||The offset in bytes from the beginning of the batch text to the beginning of the text of the query represented by this row|
|statement_end_offset||The offset in bytes from the beginning of the batch text to the end of the text of the query represented by this row|
|plan_generation_num||The number of times this plan has been recompiled since it first entered the plan cache|
|plan_handle||This is a token that refers to the batch plan in cache|
|creation_time||The date & time at which the plan was last compiled|
|last_execution_time||Last date & time at which the query started executing|
|execution_count||Number of times that the plan has been executed since the last compile (creation_time)|
All the rest of the columns represent aggregated statistics about the executions of the queries.
There are 7 areas that you can monitor through this view:
For each area you can query 4 different metrics: total, last, min and max. So, for example, you can view the total aggregated worker time and the max number of logical reads for some query. You can easily calculate the average (e.g. average elapsed time) by dividing the total value by the execution count.
There are a few important things you should know about this view…
Each row represents a single query or statement in a batch. If a batch contains multiple statements, then each statement will have its own row in this view. All the columns described in the table above describe the batch as a whole, except for “statement_start_offset”, “statement_end_offset” and “last_execution_time”, so they will be duplicated across all the rows that belong to the same plan. All the rest of the columns (the statistics data) represent a single statement inside the batch.
You can use the sys.dm_exec_sql_text function to retrieve the text of the batch based on a specific “sql_handle”. You can also use the sys.dm_exec_query_plan function to retrieve the plan in XML format based on a specific “plan handle”. So the following query returns all the plans currently in cache including the text of each corresponding batch and the plan itself:
BatchText = BatchTexts.text ,
BatchPlan = BatchPlans.query_plan ,
CreationDateTime = QueryStats.creation_time ,
LastExecutionDateTime = QueryStats.last_execution_time
sys.dm_exec_query_stats AS QueryStats
sys.dm_exec_sql_text (QueryStats.sql_handle) AS BatchTexts
sys.dm_exec_query_plan (QueryStats.plan_handle) AS BatchPlans;
But this is not useful, because the goal of this view is to provide the statistics about query executions, and we need a way to present the specific statement text and the specific statement plan for each row.
The statement text can be obtained by the values of “statement_start_offset” and “statement_end_offset”. I wrote a simple scalar function that accepts the values of “sql_handle”, “statement_start_offset” and “statement_end_offset”, and returns the text of the relevant statement:
@inSQLHandle AS VARBINARY(64) ,
@inStatementStartOffset AS INT ,
@inStatementEndOffset AS INT
DECLARE @nvcResult AS NVARCHAR(MAX);
@nvcResult = SUBSTRING
@inStatementStartOffset / 2 ,
WHEN -1 THEN LEN ([text])
ELSE @inStatementEndOffset / 2
- @inStatementStartOffset / 2
In order to retrieve the plan of a specific statement, you can use the sys.dm_exec_text_query_plan function, which accepts a “plan_handle”, “statement_start_offset” and “statement_end_offset”. Notice that this function returns the plan in text format, so you need to convert it to XML. Here is the same query as above, but this time each row presents the text and plan of a single statement along with the average elapsed time for that statement:
StatementText = Execution.udf_s_StatementLevelText (QueryStats.sql_handle , QueryStats.statement_start_offset , QueryStats.statement_end_offset) ,
StatementPlan = CAST (StatementPlans.query_plan AS XML) ,
AverageElapsedTime = CAST ((CAST (QueryStats.total_elapsed_time AS DECIMAL(19,2)) / CAST (QueryStats.execution_count AS DECIMAL(19,2))) AS DECIMAL(19,2))
sys.dm_exec_query_stats AS QueryStats
sys.dm_exec_text_query_plan (QueryStats.plan_handle , QueryStats.statement_start_offset , QueryStats.statement_end_offset) AS StatementPlans;
The “last_execution_time” column is updated when a specific query begins to execute. All the other columns, such as the “execution_count” and the statistics columns, are updated when the query completes successfully. So if you have a query that fails with an error many times, the “last_execution_time” will show you that the query has been executed recently, but the “execution_count” will show a low value, which correspond only to the successful executions of the query, and the statistics will be misleading, because they will only take into account successful executions.
One common scenario is a query that suffers from such a poor performance, which causes it to time out by the application most of the times. This is exactly the query that you are looking for in order to monitor its statistics. Let’s say the query is executed 5 times each second. If you query the DMV, you’ll see that the last execution time is just now, but the execution count won’t be even close to reflect 5 executions per second. Also, the statistics will show you how the query performs when it doesn’t time out, which is the opposite from what you’re looking for…
So if this is your case, this DMV won’t be very helpful to you, and you should seek other methods, such as SQL Trace and Extended Events.
It’s important to remember that this DMV maintains statistics for a query as long as its plan is in cache. If a plan is evicted from cache, or if a plan is recompiled, then all corresponding statistics are no longer available. So you should always remember to check out the value of “creation_time”, which is the time the plan was last compiled. This is also the time at which statistics started accumulating. It is also a good idea to check out the value of “execution_count”, which is the number of times the plan was executed since the last compile.
There are 3 problems with this DMV:
In order to solve these problems, you need to take snapshots of the DMV over time and store them in your own table. This way, you will have a history of query statistics, and you won’t have to worry about losing statistics.
Once you have these snapshots, you can query your table in order to monitor query statistics in a given period of time. For example, if you take a snapshot every 5 minutes, and you maintain history for the last 10 days, then you can monitor query statistics for any given period in the last 10 days, e.g. yesterday between 09:00 and 12:00. The way to do it is to find the snapshot taken at 09:00 and the snapshot taken at 12:00, and then calculate the difference between the statistics. If a plan appears in both snapshots, then you can calculate the difference between them. If a plan appears only in the second snapshot (12:00), then you can just use the statistics since “creation_time”, which must be sometime between 09:00 and 12:00. If a plan appears only in the first snapshot, then it means it was evicted or recompiled sometime between 09:00 and 12:00, and you need to try a smaller period of time in order to capture its statistics.
So I prepared a script for you that does all that:
This script does the following:
You should adjust the schedule of the job to match your needs. On one hand, you should take as much snapshots as possible in order to increase accuracy and the ability to monitor statistics in more granular time periods. On the other hand, each snapshot contains a lot of data and consumes a lot of space, so you don’t want to take snapshots too often. After some testing and adjustments, you’ll find your sweet spot.
One thing that is not included in this script is data retention management. You should create another job to clean old data from your tables. Otherwise, it’s going to grow too much. Again, you need to decide between keeping as much history as possible and saving space.
That’s it. Everything you need to do now is to execute the script and start monitoring and analyzing query statistics the right way!