Monitoring Heartbeats in SQL Sentry

SolarWinds SQL Sentry is the leading monitoring and alerting platform for Microsoft Data Platform DBAs, and it is our platform of choice for our managed service customers. It uses a Windows service to monitor its targets. That's all great, but... What if this monitoring service is unavailable for some reason? How could you be alerted when your alerting service is down? Who is watching the watchers? Who is monitoring the monitors?

As a SQL Sentry MSP, we found a solution for this a long time ago. Join me in this blog post to learn how to do it yourself!

The Heartbeat

SQL Sentry has a built-in "heartbeat" mechanism, supposedly made exactly for this purpose. Every time a monitoring service connects to the SQL Sentry repository, it also updates a table row representing itself, with the current UTC time, to save its latest "heartbeat".

There is even a "Failsafe Condition" called "Monitoring Service: Send Heartbeat Error". But that is actually helpful only when the Monitoring Service itself is still up! After all, the monitoring service itself is the only entity that actually triggers alerts in SQL Sentry!

However, this "monitoring service heartbeat" is still something that could be of great use for us, as it is nothing more than a database table field.

This data is saved in the SQL Sentry table "ManagementEngine", in the column called "HeartbeatDateTime":

Checking the Pulse

Checking the last heartbeat for each monitoring service would be as easy as querying from a table. Although, it's often not as easy as that, as we also need to take into consideration when was the last time that an alert was already dispatched and whether we need to re-alert again or not.

Luckily for you, I have already prepared a T-SQL script for you that implements (almost) all the necessary logic for you:

-- Run this in the right SQL Sentry database
--USE [SentryOne];
--USE [SQLSentry];
GO
IF OBJECT_ID('[dbo].[heartbeat_log]') IS NULL
BEGIN
CREATE TABLE [dbo].[heartbeat_log](
[servername] [nvarchar](300) CONSTRAINT PK_Heartbeat_Log PRIMARY KEY CLUSTERED WITH(IGNORE_DUP_KEY=ON,DATA_COMPRESSION=PAGE),
[heartbeatdate] [datetime] NULL,
[ActualHeartbeatDate] [datetime] NULL
);
END
GO
DECLARE
@HeartbeatThresholdMinutes INT = 5,
@HeartbeatRepeatAlertThresholdHours INT = 1,
@GlobalObjID UNIQUEIDENTIFIER

SELECT @GlobalObjID = ObjectID FROM dbo.vwObjects_Global;
DECLARE @SubjectTitle NVARCHAR(255), @BodyText NVARCHAR(4000);
SET @SubjectTitle = N'Monitoring Service: Heartbeat Failure'

-- Delete any irrelevant heartbeat failures from log
DELETE HL
-- optionally use this output to implement an "auto-resolve" logic:
-- OUTPUT deleted.servername INTO @RestoredHeartbeats
FROM dbo.heartbeat_log AS HL
INNER JOIN dbo.ManagementEngine AS E
ON HL.servername = E.servername
WHERE E.HeartbeatDateTime > HL.HeartbeatDate

-- Get all monitoring servers with outdated heartbeats and are not snoozed
DECLARE @site NVARCHAR(200), @servername NVARCHAR(200), @HeartbeatDateTime datetime
, @text NVARCHAR(200), @title NVARCHAR(200), @HeartbeatDateTimeLocal datetime;

DECLARE serverlist CURSOR
LOCAL FAST_FORWARD
FOR
SELECT T.[Name],E.[servername], ISNULL(E.HeartbeatDateTime, E.LastInitializationDateTime)
FROM dbo.ManagementEngine E
INNER JOIN [dbo].[Site] T ON T.ID = E.SiteID
WHERE ISNULL(E.HeartbeatDateTime, E.LastInitializationDateTime) < DATEADD(MINUTE, -@HeartbeatThresholdMinutes, GETUTCDATE())
AND NOT EXISTS (SELECT st.ObjectID FROM dbo.SnoozeStatus st WHERE st.ObjectID IN (E.ObjectID,T.ObjectID,@GlobalObjID) )

OPEN serverlist

WHILE 1=1
BEGIN
FETCH NEXT FROM serverlist INTO @site, @servername, @HeartbeatDateTime
IF @@FETCH_STATUS <> 0 BREAK;

SET @HeartbeatDateTimeLocal = CONVERT(datetime, @HeartbeatDateTime AT TIME ZONE 'UTC' AT TIME ZONE 'Israel Standard Time' )

SET @title = @site + ': '+ QUOTENAME(@servername) + ' ' + @SubjectTitle
SET @text = N'Monitor server ' + @servername + N' has lost connection to the repository. Last heartbeat was at ' + convert(nvarchar(200),@HeartbeatDateTimeLocal, 121) + N' (IST)'

SET @BodyText = N'[Message]: ' + @text
+ N' [Start Time (Local)]: ' + CONVERT(nvarchar(25), @HeartbeatDateTimeLocal, 121)
+ N' [Start Time (UTC)]: ' + CONVERT(nvarchar(25), @HeartbeatDateTime, 121)
+ N' [Duration]: ' + ISNULL(NULLIF(CONVERT(varchar(100), DATEDIFF(dd,0, GETUTCDATE()-@HeartbeatDateTime)), 0) + ' day(s), ', '')
+ CONVERT(varchar(100), GETUTCDATE()-@HeartbeatDateTime, 114)
+ N' ----------------------------------------------------------------------'
+ N' [Timestamp (Local)]: ' + CONVERT(nvarchar(25), GETDATE(), 121)
+ N' [Timestamp (UTC)]: ' + CONVERT(nvarchar(25), GETUTCDATE(), 121)
+ N' [Generated By]: ' + @@SERVERNAME
+ N' [Monitor Type]: SQLSentryHeartbeatCheck [Condition]: ' + @SubjectTitle
;
WITH trgt AS (SELECT * FROM dbo.heartbeat_log WHERE servername = @servername)
MERGE INTO trgt
USING (SELECT * FROM (VALUES(@servername, GETUTCDATE())) AS v(servername, TimeNow)) AS src
ON trgt.servername = src.servername
WHEN NOT MATCHED BY TARGET THEN
INSERT (servername, heartbeatdate)
VALUES (servername, TimeNow)
WHEN MATCHED AND src.TimeNow > dateadd(hour, @HeartbeatRepeatAlertThresholdHours, trgt.heartbeatdate) THEN
UPDATE SET heartbeatdate = src.TimeNow
;

IF @@ROWCOUNT > 0
BEGIN
PRINT N'Heartbeat failure for ' + QUOTENAME(@servername)

-- Get additional metadata about the server if it's also a SQL Server
DECLARE @SQLVersion NVARCHAR(4000), @SQLEdition NVARCHAR(4000)
SELECT
@SQLVersion = CASE
WHEN MajorVersionNumber = 10 AND MinorVersionNumber = 50 THEN '2008R2'
WHEN EngineEdition IN (5,6) THEN 'sql_azure'
ELSE
CASE MajorVersionNumber
WHEN 8 THEN '2000'
WHEN 9 THEN '2005'
WHEN 10 THEN '2008'
WHEN 11 THEN '2012'
WHEN 12 THEN '2014'
WHEN 13 THEN '2016'
WHEN 14 THEN '2017'
WHEN 15 THEN '2019'
WHEN 16 THEN '2022'
ELSE ''
END
END
,@SQLEdition = CASE [EngineEdition] -- https://docs.microsoft.com/en-us/sql/t-sql/functions/serverproperty-transact-sql
WHEN 2 THEN 'standard' -- also web and BI
WHEN 3 THEN 'enterprise' -- also datacenter, developer and evaluation
WHEN 4 THEN 'express'
WHEN 5 THEN 'azure_sql_database'
WHEN 6 THEN 'azure_sql_data_warehouse'
WHEN 8 THEN 'azure_managed_instance'
ELSE ''
END
FROM [dbo].[vwSqlServer]
WHERE [ObjectName] = @servername

/* ---- TODO: Add your alert logic below this line ---- */

/*-- Example 1: Sending an e-mail using DBMail:
EXEC msdb..sp_send_dbmail
@recipients = 'dba@acme.com'
, @subject = @title
, @body = @BodyText
, @body_format = 'HTML'
, @importance = 'high'
*/

/*-- Example 2: Throwing an error to fail the job:
--RAISERROR(N'%s', 16, 1, @BodyText);
*/

/*-- Example 3: Whatever else:
EXECUTE SendSNMPAlertCLR
@SourceServer = @servername,
@Subject = @title,
@Body = @BodyText,
@Severity = 3,
@SQLVersion = @SQLVersion,
@SQLEdition = @SQLEdition,
@Condition = @SubjectTitle,
@Site = @site
*/

/* ---- TODO: Add your alert logic above this line ---- */
END
END

CLOSE serverlist
DEALLOCATE serverlist

It's also available in my GitHub Gists here

What this script does, essentially, is as follows:

If it doesn't exist already, create a table called "heartbeat_log" which will save any triggered heartbeat alerts.
Delete from the heartbeat_log table any records related to monitoring services that already updated their heartbeat since their alert was triggered.
Declare a cursor to traverse any monitoring services with an outdated heartbeat time. This will NOT include monitoring services inside a snoozed site or when all targets are snoozed.
For each such monitoring service, construct an alert message, and add or update a record in the heartbeat_log table.
Once the alert message is ready, do something with it (such as sending an e-mail, raising an error, or whatever else you want).

The above script is supposed to be executed by a SQL Agent Job that you would create on the same SQL Server instance where you have your SQL Sentry repository database (this database by default is called "SentryOne" or "SQLSentry").

This "Heartbeat Monitoring" job should be scheduled with a frequency of a few minutes apart, as per your requirements.

Modify the script to customize it for your needs however you want (it's only a starting point, after all).

And that's it! You now have not only a monitoring service that monitors your SQL Servers, but also a SQL Server that monitors your monitoring service!

To learn more about SQL Sentry Troubleshooting, you can check out this article at the official SQL Sentry documentation site.