SolarWinds SQL Sentry is the leading monitoring and alerting platform for Microsoft Data Platform DBAs, and it is our platform of choice for our managed service customers. It uses a Windows service to monitor its targets. That's all great, but... What if this monitoring service is unavailable for some reason? How could you be alerted when your alerting service is down? Who is watching the watchers? Who is monitoring the monitors?
As a SQL Sentry MSP, we found a solution for this a long time ago. Join me in this blog post to learn how to do it yourself!
SQL Sentry has a built-in "heartbeat" mechanism, supposedly made exactly for this purpose. Every time a monitoring service connects to the SQL Sentry repository, it also updates a table row representing itself, with the current UTC time, to save its latest "heartbeat".
There is even a "Failsafe Condition" called "Monitoring Service: Send Heartbeat Error". But that is actually helpful only when the Monitoring Service itself is still up! After all, the monitoring service itself is the only entity that actually triggers alerts in SQL Sentry!
However, this "monitoring service heartbeat" is still something that could be of great use for us, as it is nothing more than a database table field.
This data is saved in the SQL Sentry table "ManagementEngine", in the column called "HeartbeatDateTime":
Checking the Pulse
Checking the last heartbeat for each monitoring service would be as easy as querying from a table. Although, it's often not as easy as that, as we also need to take into consideration when was the last time that an alert was already dispatched and whether we need to re-alert again or not.
Luckily for you, I have already prepared a T-SQL script for you that implements (almost) all the necessary logic for you:
-- Run this in the right SQL Sentry database --USE [SentryOne]; --USE [SQLSentry]; GO IF OBJECT_ID('[dbo].[heartbeat_log]') IS NULL BEGIN CREATE TABLE [dbo].[heartbeat_log]( [servername] [nvarchar](300) CONSTRAINT PK_Heartbeat_Log PRIMARY KEY CLUSTERED WITH(IGNORE_DUP_KEY=ON,DATA_COMPRESSION=PAGE), [heartbeatdate] [datetime] NULL, [ActualHeartbeatDate] [datetime] NULL ); END GO DECLARE @HeartbeatThresholdMinutes INT = 5, @HeartbeatRepeatAlertThresholdHours INT = 1, @GlobalObjID UNIQUEIDENTIFIER SELECT @GlobalObjID = ObjectID FROM dbo.vwObjects_Global; DECLARE @SubjectTitle NVARCHAR(255), @BodyText NVARCHAR(4000); SET @SubjectTitle = N'Monitoring Service: Heartbeat Failure' -- Delete any irrelevant heartbeat failures from log DELETE HL -- optionally use this output to implement an "auto-resolve" logic: -- OUTPUT deleted.servername INTO @RestoredHeartbeats FROM dbo.heartbeat_log AS HL INNER JOIN dbo.ManagementEngine AS E ON HL.servername = E.servername WHERE E.HeartbeatDateTime > HL.HeartbeatDate -- Get all monitoring servers with outdated heartbeats and are not snoozed DECLARE @site NVARCHAR(200), @servername NVARCHAR(200), @HeartbeatDateTime datetime , @text NVARCHAR(200), @title NVARCHAR(200), @HeartbeatDateTimeLocal datetime; DECLARE serverlist CURSOR LOCAL FAST_FORWARD FOR SELECT T.[Name],E.[servername], ISNULL(E.HeartbeatDateTime, E.LastInitializationDateTime) FROM dbo.ManagementEngine E INNER JOIN [dbo].[Site] T ON T.ID = E.SiteID WHERE ISNULL(E.HeartbeatDateTime, E.LastInitializationDateTime) < DATEADD(MINUTE, -@HeartbeatThresholdMinutes, GETUTCDATE()) AND NOT EXISTS (SELECT st.ObjectID FROM dbo.SnoozeStatus st WHERE st.ObjectID IN (E.ObjectID,T.ObjectID,@GlobalObjID) ) OPEN serverlist WHILE 1=1 BEGIN FETCH NEXT FROM serverlist INTO @site, @servername, @HeartbeatDateTime IF @@FETCH_STATUS <> 0 BREAK; SET @HeartbeatDateTimeLocal = CONVERT(datetime, @HeartbeatDateTime AT TIME ZONE 'UTC' AT TIME ZONE 'Israel Standard Time' ) SET @title = @site + ': '+ QUOTENAME(@servername) + ' ' + @SubjectTitle SET @text = N'Monitor server <b>' + @servername + N'</b> has lost connection to the repository.<br/>Last heartbeat was at <b>' + convert(nvarchar(200),@HeartbeatDateTimeLocal, 121) + N' (IST)</b>' SET @BodyText = N'[Message]: ' + @text + N'<br/><br/>[Start Time (Local)]: ' + CONVERT(nvarchar(25), @HeartbeatDateTimeLocal, 121) + N'<br/>[Start Time (UTC)]: ' + CONVERT(nvarchar(25), @HeartbeatDateTime, 121) + N'<br/>[Duration]: ' + ISNULL(NULLIF(CONVERT(varchar(100), DATEDIFF(dd,0, GETUTCDATE()-@HeartbeatDateTime)), 0) + ' day(s), ', '') + CONVERT(varchar(100), GETUTCDATE()-@HeartbeatDateTime, 114) + N'<br/>----------------------------------------------------------------------' + N'<br/>[Timestamp (Local)]: ' + CONVERT(nvarchar(25), GETDATE(), 121) + N'<br/>[Timestamp (UTC)]: ' + CONVERT(nvarchar(25), GETUTCDATE(), 121) + N'<br/>[Generated By]: ' + @@SERVERNAME + N'<br/>[Monitor Type]: SQLSentryHeartbeatCheck<br/>[Condition]: ' + @SubjectTitle ; WITH trgt AS (SELECT * FROM dbo.heartbeat_log WHERE servername = @servername) MERGE INTO trgt USING (SELECT * FROM (VALUES(@servername, GETUTCDATE())) AS v(servername, TimeNow)) AS src ON trgt.servername = src.servername WHEN NOT MATCHED BY TARGET THEN INSERT (servername, heartbeatdate) VALUES (servername, TimeNow) WHEN MATCHED AND src.TimeNow > dateadd(hour, @HeartbeatRepeatAlertThresholdHours, trgt.heartbeatdate) THEN UPDATE SET heartbeatdate = src.TimeNow ; IF @@ROWCOUNT > 0 BEGIN PRINT N'Heartbeat failure for ' + QUOTENAME(@servername) -- Get additional metadata about the server if it's also a SQL Server DECLARE @SQLVersion NVARCHAR(4000), @SQLEdition NVARCHAR(4000) SELECT @SQLVersion = CASE WHEN MajorVersionNumber = 10 AND MinorVersionNumber = 50 THEN '2008R2' WHEN EngineEdition IN (5,6) THEN 'sql_azure' ELSE CASE MajorVersionNumber WHEN 8 THEN '2000' WHEN 9 THEN '2005' WHEN 10 THEN '2008' WHEN 11 THEN '2012' WHEN 12 THEN '2014' WHEN 13 THEN '2016' WHEN 14 THEN '2017' WHEN 15 THEN '2019' WHEN 16 THEN '2022' ELSE '' END END ,@SQLEdition = CASE [EngineEdition] -- https://docs.microsoft.com/en-us/sql/t-sql/functions/serverproperty-transact-sql WHEN 2 THEN 'standard' -- also web and BI WHEN 3 THEN 'enterprise' -- also datacenter, developer and evaluation WHEN 4 THEN 'express' WHEN 5 THEN 'azure_sql_database' WHEN 6 THEN 'azure_sql_data_warehouse' WHEN 8 THEN 'azure_managed_instance' ELSE '' END FROM [dbo].[vwSqlServer] WHERE [ObjectName] = @servername /* ---- TODO: Add your alert logic below this line ---- */ /*-- Example 1: Sending an e-mail using DBMail: EXEC msdb..sp_send_dbmail @recipients = 'firstname.lastname@example.org' , @subject = @title , @body = @BodyText , @body_format = 'HTML' , @importance = 'high' */ /*-- Example 2: Throwing an error to fail the job: --RAISERROR(N'%s', 16, 1, @BodyText); */ /*-- Example 3: Whatever else: EXECUTE SendSNMPAlertCLR @SourceServer = @servername, @Subject = @title, @Body = @BodyText, @Severity = 3, @SQLVersion = @SQLVersion, @SQLEdition = @SQLEdition, @Condition = @SubjectTitle, @Site = @site */ /* ---- TODO: Add your alert logic above this line ---- */ END END CLOSE serverlist DEALLOCATE serverlist
What this script does, essentially, is as follows:
If it doesn't exist already, create a table called "heartbeat_log" which will save any triggered heartbeat alerts.
Delete from the heartbeat_log table any records related to monitoring services that already updated their heartbeat since their alert was triggered.
For each such monitoring service, construct an alert message, and add or update a record in the heartbeat_log table.
Once the alert message is ready, do something with it (such as sending an e-mail, raising an error, or whatever else you want).
The above script is supposed to be executed by a SQL Agent Job that you would create on the same SQL Server instance where you have your SQL Sentry repository database (this database by default is called "SentryOne" or "SQLSentry").
This "Heartbeat Monitoring" job should be scheduled with a frequency of a few minutes apart, as per your requirements.
Modify the script to customize it for your needs however you want (it's only a starting point, after all).
And that's it! You now have not only a monitoring service that monitors your SQL Servers, but also a SQL Server that monitors your monitoring service!
To learn more about SQL Sentry Troubleshooting, you can check out this article at the official SQL Sentry documentation site.