This month's #TSQL2sday is hosted by Brent Ozar who invites us to write about the Most Recent Issue You Closed.
One of my most recent issues was an interesting one - involving AlwaysOn Availability Groups and a valuable lesson to be learned about cluster configurations.
The incident started with a late-night phone call from one of our customers (it's always a late-night phone call, isn't it?).
They reported that during a DR exercise on their production environment (Chaos Engineering, anyone?) their entire cluster failed and they weren't able to bring any of the replicas back online.
Their description of their DR exercise was as follows:
"We wanted to make sure that our DR site can be operational when our main site is down. So, we gradually took our main site down until only the DR was left, but instead of remaining operational, it shut down entirely, and now we can't bring anything back even after turning the main site back online."
To make things a bit easier to understand, I prepared the diagram below to illustrate what their main site and DR site consisted of:
On their "main" production site, they had 2 SQL Server replicas: PROD-1 and PROD-2.
On their "DR" site, they had only one SQL Server replica: PROD-3-DR (to save costs. "it's only a DR, after all").
And to sweeten the deal, they also had a "File Share Witness" as part of the cluster Quorum, which is accessible on both sites.
So, the steps they took during the DR exercise were in fact:
Fail-over from PROD-1 to PROD-2
Shut down PROD-1
Fail-over from PROD-2 to PROD-3-DR
Shut down PROD-2
and this is when PROD-3-DR suddenly took itself down and everything else with it
But... Why?
So, what happened here? Why did PROD-3-DR suddenly take a nose-dive without anyone asking?
Well, this behavior is actually intentional, it's called a "Quorum Failure" and it's part of how this sort of cluster should work.
It works this way to prevent something called a "Split-Brain Scenario".
What is a "Split-Brain Scenario"?
It's what happens when a node in the cluster thinks that all of the other nodes are down, concluding that it's the 'last one standing', and therefore should become the new primary node. However, this is something that could technically happen to more than one node simultaneously, for example when there's a network disconnection between all nodes. In such a scenario, several nodes are online, but they don't know that each other is online, and therefore you get more than one node that's the primary writeable node.
In other words, you get more than one node acting as the "brain" of the cluster, and therefore the "split-brain" scenario. If such a scenario was allowed, it could cause serious data corruption and data loss.
Coming back to our incident, PROD-3-DR remained the 'last one standing'. It was only itself and the fileshare witness that remained active.
That means a vote of 2 (one DR replica, and the fileshare witness, each being worth 1 vote), versus 2 other replicas that were down.
However, with respect for Democracy, 2 versus 2 is not considered a 'vote majority'. Technically, it could be possible that nodes PROD-1 and PROD-2 were actually online but the network got disconnected, isolating PROD-3-DR and the fileshare witness. In such a case, it could've been possible that either PROD-1 or PROD-2 could become the primary node, causing the split-brain scenario.
Instead, the Quorum mechanism simply concludes that because there is no 'vote majority', the availability group goes into a mode called "Forced Quorum Mode" where it simply takes down the entire cluster to avoid data corruption and prevents any replica from becoming online.
In this scenario, ALL replicas are essentially "stuck" in the "Resolving" state, even if you bring all instances back online.
How to bring it back online?
So how do we recover the cluster from this "Forced Quorum Mode" with minimal data loss?
By executing the following command on the replica that was the last one acting as the primary:
ALTER AVAILABILITY GROUP your_AG_name_here FORCE_FAILOVER_ALLOW_DATA_LOSS;
You run this on a secondary replica to FORCE it to become the new primary, basically hitting "I ACCEPT" to the disclaimer that this might possibly mean data loss (that's why it's important to run this on the replica that was the most recent primary, to minimize the data loss).
In our case, this replica would be PROD-3-DR because it was the last one acting as primary before the cluster got shut down.
That's not enough, though, because after doing that, we would also need to manually resume each suspended database on each secondary replica.
(source)
Once we finish resuming all databases on all secondary replicas, we can fail-over to whichever replica we actually want to become primary again, and continue on with our lives.
Preventing this from recurring
So, how are we to prevent this scenario from recurring? Specifically, preventing the entire cluster from taking itself down?
Well, first of all, we must avoid having an even number of votes in your cluster.
In this specific incident, there were 4 total votes in the cluster:
PROD-1
PROD-2
PROD-3-DR
The Fileshare Witness (accessible on both sites)
As long as there is an even number of votes, it's highly likely for a "split-brain" scenario to occur (i.e. 2 vs 2 in this case).
The best practice would be to always have an odd number of total votes in your cluster, so that there would always be a tie-breaker.
In our case, this could be done by doing one of the following:
Remove PROD-2 from the cluster, or:
Add a fourth server to the DR site - like PROD-4-DR, or:
Change the vote weight of PROD-3-DR from 1 vote to 2 votes (so that it would be worth the equivalent of 2 nodes) - this can only be done via command line or Powershell, by the way.
This way, the DR site could function even if all replicas in the main site are down, and vice versa - the main site could function even if the entire DR site is down.
DR or HA? Can't we have both?
An interesting question that was raised by this client was this:
Could we make this cluster work even in case one of the replicas in one site is down and also the other site is down? For example, can the cluster work if PROD-1 is down, the entire DR site is down, and only PROD-2 and the fileshare witness are online?
Well, if we were to leave the cluster configuration as is, we would have the same "split-brain" scenario causing the cluster to go into "Forced Quorum Mode" because, once again, we'd get a 2 vs 2 scenario (PROD-2 + fileshare witness vs. PROD-1 and PROD-3-DR).
Even if we were to fix the total vote count to an odd number (for example, by changing the vote weight of PROD-3-DR to 2 votes), we would still find ourselves in a bad spot, because we would be in a vote minority situation of 2 vs. 3 (PROD-2 + fileshare witness vs. PROD-1 and PROD-3-DR). This, too, would cause the cluster to go into "Forced Quorum Mode".
The only way to have PROD-2 stay online as the only primary node in this scenario would be by changing the Quorum Mode entirely.
You see, in our original scenario, the quorum mode we were working with is called "Node and File Share Majority".
There are a few other Quorum Modes supported in Windows clusters, but there's only one mode that can support the scenario this client was asking for:
No Majority: Disk Only.
In this Quorum Mode, a shared disk cluster resource is designated as the one and only witness, and connectivity by any node to that shared disk is counted as an affirmative vote (or rather "heartbeat").
(source)
In other words: If you're one of the cluster nodes and you can't access the shared disk, then you ain't living. You're out of the game.
There is no 'vote majority' here, no 'vote minority', and no 'split-brain'. Either you're "alive" or you're not.
In this case, if PROD-2 is the only living replica then it can definitely become the primary replica, even if it's in the "minority". Because there's no "minority" or "majority" here. There's just life, and death, and the Shared Quorum Disk sitting in the middle deciding everyone's fate.
Well, I guess that sometimes, Democracy just isn't the right way to go - at least when it comes to cluster quorums ;)
This also means, of course, that your shared Quorum disk becomes the single point of failure. So, you may need to work hard to keep it highly available.
Recommended Adjustments to Quorum Voting
Microsoft documentation offers the general recommendations below.
When enabling or disabling a given WSFC node's vote, follow these guidelines:
No vote by default. Assume that each node should not vote without explicit justification.
Include all primary replicas. Each WSFC node that hosts an availability group primary replica or is the preferred owner of an FCI should have a vote.
Include possible automatic failover owners. Each node that could host a primary replica, as the result of an automatic availability group failover or FCI failover, should have a vote. If there is only one availability group in the WSFC cluster and availability replicas are hosted only by standalone instances, this rule includes only the secondary replica which is the automatic failover target.
Exclude secondary site nodes. In general, do not give votes to WSFC nodes that reside at a secondary disaster recovery site. You do not want nodes in the secondary site to contribute to a decision to take the cluster offline when there is nothing wrong with the primary site.
Odd number of votes. If necessary, add a witness file share, a witness node, or a witness disk to the cluster and adjust the quorum mode to prevent possible ties in the quorum vote.
Re-assess vote assignments post-failover. You do not want to fail over into a cluster configuration that does not support a healthy quorum.
(source)
Dynamic Quorum Management
In Windows Server 2012, as an advanced quorum configuration option, you can choose to enable dynamic quorum management by cluster. For more details on how dynamic quorum works, see this explanation about Dynamic Quorum Behaviour.
With dynamic quorum management, it is also possible for a cluster to run on the last surviving cluster node. By dynamically adjusting the quorum majority requirement, the cluster can sustain sequential node shutdowns to a single node.
(source)
Want to learn more? Here are some additional resources:
Disaster Recovery through forced quorum - SQL Server Always On | Microsoft Learn
WSFC quorum modes & voting configuration - SQL Server Always On | Microsoft Learn
Manually force a failover of an availability group - SQL Server Always On | Microsoft Learn
Failover Cluster Step-by-Step Guide: Configuring the Quorum in a Failover Cluster | Microsoft Learn
Configure and manage the quorum in a failover cluster | Microsoft Learn
Configure cluster quorum NodeWeight settings - SQL Server Always On | Microsoft Learn
Configure cluster quorum - SQL Server on Azure VMs | Microsoft Learn
Windows Failover Cluster Quorum Modes in SQL Server Always On Availability Groups (sqlshack.com)
Understanding Quorum in a Failover Cluster - Microsoft Community Hub
Nice post, sir! Thanks for doing it for T-SQL Tuesday. --Brent