The Strange Story About the MERGE that Killed the Cluster

Madeira Team
Feb 20, 2014
2 min read

I don’t like the MERGE statement. The syntax is weird, it’s tricky in terms of locks, and it has a history of weird errors. The other day a developer wanted my help in troubleshooting a MERGE that caused an ACCESS VIOLATION error and a stack dump generation. That led me to tweet this:

It seems the MERGE statement will eventually destroy the universe — Matan Yungman (@MatanYungman) February 11, 2014

I may have exaggerated a bit. It might be limited just to our galaxy (thanks Doc Brown). Anyways, we started troubleshooting and eventually figured out the above error happens because of a specific code section (the whole scenario is dependent on other factors and is very hard to reproduce. We opened a bug and I’ll update when I have more details).

The problematic code block looked like this:

Transact-SQL

DECLARE @Var int = 0; MERGE TargetTable AS target USING (Column_A, Column_B, Column_C FROM SourceTable) AS source (Column_A, Column_B, Column_C) ON (target.Column_A = source.Column_A) WHEN MATCHED THEN UPDATE SET @Var = 1 WHEN NOT MATCHED THEN INSERT (Column_A, Column_B, Column_C) VALUES (Column_A, Column_B, Column_C);

DECLARE @Var int = 0;

MERGE TargetTable AS target

USING (Column_A, Column_B, Column_C FROM SourceTable)

AS source (Column_A, Column_B, Column_C)

ON (target.Column_A = source.Column_A)

WHEN MATCHED THEN

UPDATE SET @Var = 1

WHEN NOT MATCHED THEN

INSERT (Column_A, Column_B, Column_C)

VALUES (Column_A, Column_B, Column_C);

Unlike me, the developer likes MERGE, and what he accomplished here is “Insert where not exists”. But for some reason, updating the variable was problematic.

Meanwhile:

We started getting alerts about an Availability Group that kept failing. I went to Failover Cluster Manager and saw the Availability Group was offline. I brought it back online, but it failed again almost immediately. I did it a few more times, but the Availability Group refused to stay online.

And then I recalled I once saw a video by Jonathan Kehayias, where he talked about “FailureConditionLevel”, a Windows Cluster configuration option that determines the severity at which there will be a cluster Failover:

0 – No automatic failover
1 – Failover/Restart when the primary node is down
2 – Failover/Restart when the SQL Server service is up but not responding
3 – Failover/Restart on critical errors like a stack dump
4 – Failover/Restart when a resource is unhealthy, for example, memory
5 – Failover/Restart on any qualifying condition and when there’s a query processing error

* Based on the output of sp_server_diagnostics. Click here for a more detailed exaplanation

The value in our cluster was the default – 3. Obviously the cluster kept failing and stayed offline because of the stack dumps generated by the MERGE statement. I changed the value to 1, and the cluster resource stopped failing.

We then moved on to fixing the MERGE code, which was very simple: Omitting the WHEN MATCHED part.