Availability Group on SQL Server 2016

Feb 1, 2017

The Challenge

One of our clients in the gaming industry wanted to set up a high availability solution in their data center in the UK. They had a single standalone SQL Server 2014 instance (Enterprise Edition) running on Windows Server 2012 R2, and they wanted to make sure that if anything happens to the server, the website can immediately continue to operate as usual. In addition, due to regulations, they needed to maintain a separate copy of the data in a different data center in Malta. Another important requirement is to implement a solution that does not require a domain controller. They didn’t use Active Directory, and they preferred not to start using one, if possible.

So what would you do? What would you offer the client to do?

The Solution

After some discussion with the client, we decided to set up an AlwaysOn Availability Group with 3 replicas: a primary replica in the UK, a secondary replica in the UK with synchronous commit and automatic failover, and a third replica in Malta with asynchronous commit for the sake of regulation. By leveraging the capability of the availability group to have multiple replicas in different networks with different synchronization modes, we can address both requirements (high availability and regulation) in a single solution.

But what about the requirement to do all this without a domain controller? Setting up availability group requires a Windows Server Failover Cluster (WSFC), and the cluster requires Active Directory… Fortunately, Windows Server 2016 has a new feature that allows to create a workgroup cluster without Active Directory. And SQL Server 2016 can leverage this new feature, and it supports availability groups in such environment. So we decided to set up 3 new servers with Windows Server 2016 and SQL Server 2016 (Enterprise Edition).

Creating the Windows Server Failover Cluster

I installed Windows Server 2016 on all 3 servers with the exact version, edition and build number. It’s important for all the members in the cluster to have the same configuration. It’s even more important to verify this in the case of a workgroup cluster, because there is no option to configure a policy for all nodes through Active Directory.
I also made sure that all 3 servers have the same disk layout with the same drive letters. I used C: for the operating system, D: for the data files, E: for the log files and F: for the backup files. This layout was consistent on all servers.
Initially the two sites (UK and Malta) weren’t connected through a VPN, but rather through the internet. This required to open a lot of ports between the servers in order to create the cluster. After some discussions, we decided to set up a VPN between the sites. This is the recommended approach, and it made life much easier and also more secured.
I added the “Failover Clustering” Windows feature to all 3 servers.
I added the same primary DNS suffix to all 3 servers. This operation required a restart of the servers.
I chose an available IP address in each subnet (UK and Malta) to be the cluster IP addresses. The client didn’t use any internal DNS server, so I added to the hosts file of each server the other two servers as well as the two cluster IP addresses. Each server had 4 entries in the hosts file (excluding itself).
I created a new Windows account on all 3 servers (with the exact name and password) to create and manage the cluster. I also added this account to the Windows Administrators group. Then I logged in with the new account.
I executed the following Powershell script to allow the new Windows account to create the cluster:

PowerShell

new-itemproperty -path HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System -Name LocalAccountTokenFilterPolicy -Value 1

new-itemproperty -path HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System

-Name LocalAccountTokenFilterPolicy -Value 1

I created the cluster using the following Powershell script:

PowerShell

new-cluster -name ClusterName –Node ServerA.DNSSuffix,ServerB.DNSSuffix,ServerC.DNSSuffix -StaticAddress UKIPAddress,MaltaIPAddress -NoStorage –AdministrativeAccessPoint DNS

new-cluster -name ClusterName –Node ServerA.DNSSuffix,ServerB.DNSSuffix,ServerC.DNSSuffix

-StaticAddress UKIPAddress,MaltaIPAddress -NoStorage –AdministrativeAccessPoint DNS

I ran all validation tests against the cluster. All tests passed successfully. There were some warnings, but none of them is critical. Here are the important warnings that I extracted from the report along with my comments:
“All the servers are not joined to a domain. Ensure that the nodes have a consistent configuration. Multi-domain and Workgroup clusters (clusters with non-domain joined nodes) introduce higher risk of configuration drift. When deploying ensure that the same set of Windows patches are applied to all nodes in the cluster. Also, if group policies are rolled out to the cluster nodes, they should not be conflicting. Finally, ensure that the cluster node and network names are replicated to the DNS servers authoritative for the cluster nodes.” As mentioned above, it is important to make sure that all 3 servers have the same OS configuration. This is also important going forward. If we install any patches or use any group policies, we should apply them on all 3 servers. As for the DNS servers, the client doesn’t use them in this environment. We use IP addresses or the hosts files, so this is not an issue.
“The DNS Suffix Search Lists do not match for all nodes or is unconfigured for some nodes.” This warning was displayed because we were not using a DNS server. Again, this is not an issue, because we don’t rely on a DNS server in this environment.
“Node ServerA is reachable from Node ServerB by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.” This warning was generated for every pair of nodes in every direction (6 warnings altogether). It’s quite self explanatory…
“The cluster is not configured with a quorum witness. As a best practice, configure a quorum witness to help achieve the highest availability of the cluster.” This is an important issue. We discussed this with the client. Currently we decided not to use a quorum witness, but rather use the 3 servers with equal votes. This gives us an odd number of votes, which is good for high availability in case of a single server failure. In the future, we might remove the vote from Server C (Malta) and add a cloud witness instead. This is a better solution, because it does not rely on the Server C to be up and running in order to provide high availability for the servers in the UK.

Installing SQL Server

I created a local Windows account with the same password on all 3 servers, which I will use as the logon account for all the SQL Server services.
Then, I installed SQL Server 2016 Enterprise Edition on all servers:
I chose the appropriate disks and folders for the data, log and backup files.
I chose the same collation on all 3 instances.
On the services tab, I used the newly created account mentioned above as the logon account for all the SQL Server services.
I chose mixed authentication, and specified the same password for the sa account on all 3 instances.
I enabled Instant File Initialization.
Finally, I downloaded and installed service pack 1 for SQL Server 2016.

Creating the Availability Group

I copied all the application SQL Server logins from the current production instance to the two UK instances (we don’t need these logins in Server C, because it is not intended to serve the application). I used this article to accomplish this task. This method copies the SIDs and the passwords with the logins.
I enabled the Availability Groups feature in SQL Server Configuration Manager on all 3 instances. This operation required a restart of the SQL Server service.
I created a master key and a certificate in the master database of each instance. Then, I backed up the public key of the certificate.
I created a database mirroring endpoint on each instance.
I copied the 3 certificate backup files to all 3 servers to the same locations, so that each server had the 3 certificate backups in the same folder.
On each instance I did the following for each one of the other instances:
Create a SQL Server login for the other server.
Create a user for the login in the master database.
Import the certificate of the other server from the backup, and make the login its owner.
Grant CONNECT permission on the endpoint to the login.
I verified that all the relevant user databases (there were 5 of them) were in the Full recovery model and with the AUTO_CLOSE property disabled.
We stopped the application, and then I took a full backup of the databases on the current production instance. I copied the database backups to Server A and restored them. We then modified the connections strings in the application to use the IP address of Server A, and brought the application back online. The databases were not too large, so we only had a downtime of around 20 minutes. This was acceptable by the client. If it wasn’t acceptable, then we could, of course, choose another method, such as log shipping, in order to move the databases to Server A with minimum downtime.
I took another full backup of all the databases in Server A.
On Server A, I created the availability group on all 5 databases and all 3 replicas. I configured Server A and server B as synchronous and automatic failover, and both of them don’t allow connections as secondaries. As for Server C, I configured it as asynchronous, and it allows all connections as a secondary (because the client is going to use it for reporting).
I added a listener to the availability group with new available IP addresses in UK and Malta, and with port 1433.
On Server B and Server C, I then joined the availability group.
I took a log backup of all the user databases in Server A.
I copied all the backup files (full and log) to the same location in the two other servers (Server B and Server C).
Then, I restored the full backups of all 5 databases with NORECOVERY on Server B and Server C, and then I restored the log backups of all 5 databases with NORECOVERY.
I moved the 5 databases in Server B and Server C into the availability group.
At this point, the availability group was up and running. I verified its health through the SSMS dashboard, some DMVs and the Failover Cluster Manager. I also verified that I can use the listener to access the primary replica, that I can’t read from Server B, and that I can read from Server C.

Here is the full script I used (more or less) in order to create the availability group.

Final Steps

I copied the maintenance plans from the old production instance to Server A and Server B (export form msdb and import to msdb). Then, I changed the server connection on all tasks and recreated the schedules. I then disabled the corresponding jobs on Server B.
I scripted and copied all the jobs from the old production instance to Server A and Server B, except for the maintenance plan jobs and the default “syspolicy_purge_history” job. In addition, I verified that they all have the same owner, and I disabled all the jobs on Server B.
I created an alert for availability group role change (error 1480) on both instances (Server A and Server B). This alert will fire a new job, which I called “HandleJobs”. This job will execute the “dbo.usp_EnableOrDisableJobs” stored procedure in the master database. The procedure will use the “dbo.fn_hadr_group_is_primary” function (also in the master database) to check whether the current replica is the primary replica or not. If it is, then the procedure will enable the relevant jobs. If the current replica is not the primary replica, then the procedure will simply disable all jobs.
I created the “dbo.fn_hadr_group_is_primary” function, the “dbo.usp_EnableOrDisableJobs” stored procedure and the “HandleJobs” job on both instances (Server A and Server B). Here is the full script.
I also used the role change alert to send an email notification to the operations team. In order to do that, I created a new database mail profile and account on both instances, and enabled that profile for SQL Server Agent. I tested the mail profile, and it sent emails successfully. Of course, I also created an operator for the operations team.
I created two additional alerts: one for when data movement is suspended (error 35264), and the other when data movement is resumed (error 35265).
For all 3 alerts, I configured a delay between responses of 60 seconds.
We copied all other server resources, such as folders, files and SSIS packages, from the old production instance to Server A and Server B.
I verified that there are no other instance-level objects that should be copied from the old production instance to Server A and Server B.
We changed all the connection strings in the application from the IP address of Server A to the listener IP address. I verified that everything is working properly.
We performed a failover test by manually failing over from Server A to Server B. The test was successful.
We then performed a second test for fail back to Server A. In this test, I stopped the SQL Server service on Server B in order to test that the cluster is working properly. The test was successful, but the jobs still weren’t enabled/disabled properly. It was caused by two different reasons. On Server A, an alert was raised at the time of failover, and as a response to the alert, the “HandleJobs” job was invoked to enable/disable all the other jobs. The job was supposed to enable the other jobs only if the current instance is the primary replica. But at this point, the state of the replica was “Resolving”, so the jobs weren’t enabled. As a solution, I added a 5 seconds delay before checking the state of the current instance. On Server B the alert didn’t fire, because the service was down, so there was nothing to invoke the job. As a solution, I added a schedule to the job, so that it will run automatically when the SQL Server Agent service starts. I have already included both fixes in the script (see step 4 above).

Summary

Windows Server 2016 allows to create a Workgroup Cluster between servers, which are not joined to a domain. SQL Server 2016 can leverage this new capability, and it allows to create an availability group between such servers without the requirement for Active Directory.

SQL Server also allows to create an availability group over multiple replicas in different subnets, while configuring each replica separately with properties, such as: synchronization mode, automatic failover, readable as secondary, etc. These capabilities are not new in SQL Server 2016, but they certainly helped me in this scenario.

In this post I described, step by step, how I created an availability group on top of 3 replicas between two different sites, without a domain controller.

#WindowsServer2016 #highavailability #AvailabilityGroup #SQLserver2016 #AlwaysOn