Availability Group on SQL Server 2016
One of our clients in the gaming industry wanted to set up a high availability solution in their data center in the UK. They had a single standalone SQL Server 2014 instance (Enterprise Edition) running on Windows Server 2012 R2, and they wanted to make sure that if anything happens to the server, the website can immediately continue to operate as usual. In addition, due to regulations, they needed to maintain a separate copy of the data in a different data center in Malta. Another important requirement is to implement a solution that does not require a domain controller. They didn’t use Active Directory, and they preferred not to start using one, if possible.
So what would you do? What would you offer the client to do?
After some discussion with the client, we decided to set up an AlwaysOn Availability Group with 3 replicas: a primary replica in the UK, a secondary replica in the UK with synchronous commit and automatic failover, and a third replica in Malta with asynchronous commit for the sake of regulation. By leveraging the capability of the availability group to have multiple replicas in different networks with different synchronization modes, we can address both requirements (high availability and regulation) in a single solution.
But what about the requirement to do all this without a domain controller? Setting up availability group requires a Windows Server Failover Cluster (WSFC), and the cluster requires Active Directory… Fortunately, Windows Server 2016 has a new feature that allows to create a workgroup cluster without Active Directory. And SQL Server 2016 can leverage this new feature, and it supports availability groups in such environment. So we decided to set up 3 new servers with Windows Server 2016 and SQL Server 2016 (Enterprise Edition).
Creating the Windows Server Failover Cluster
I installed Windows Server 2016 on all 3 servers with the exact version, edition and build number. It’s important for all the members in the cluster to have the same configuration. It’s even more important to verify this in the case of a workgroup cluster, because there is no option to configure a policy for all nodes through Active Directory.
I also made sure that all 3 servers have the same disk layout with the same drive letters. I used C: for the operating system, D: for the data files, E: for the log files and F: for the backup files. This layout was consistent on all servers.
Initially the two sites (UK and Malta) weren’t connected through a VPN, but rather through the internet. This required to open a lot of ports between the servers in order to create the cluster. After some discussions, we decided to set up a VPN between the sites. This is the recommended approach, and it made life much easier and also more secured.
I added the “Failover Clustering” Windows feature to all 3 servers.
I added the same primary DNS suffix to all 3 servers. This operation required a restart of the servers.
I chose an available IP address in each subnet (UK and Malta) to be the cluster IP addresses. The client didn’t use any internal DNS server, so I added to the hosts file of each server the other two servers as well as the two cluster IP addresses. Each server had 4 entries in the hosts file (excluding itself).
I created a new Windows account on all 3 servers (with the exact name and password) to create and manage the cluster. I also added this account to the Windows Administrators group. Then I logged in with the new account.
I executed the following Powershell script to allow the new Windows account to create the cluster:
new-itemproperty -path HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System -Name LocalAccountTokenFilterPolicy -Value 1
new-itemproperty -path HKLM:\SOFTWARE\Microsoft\Windows\CurrentVersion\Policies\System
-Name LocalAccountTokenFilterPolicy -Value 1
I created the cluster using the following Powershell script:
new-cluster -name ClusterName –Node ServerA.DNSSuffix,ServerB.DNSSuffix,ServerC.DNSSuffix -StaticAddress UKIPAddress,MaltaIPAddress -NoStorage –AdministrativeAccessPoint DNS
new-cluster -name ClusterName –Node ServerA.DNSSuffix,ServerB.DNSSuffix,ServerC.DNSSuffix
-StaticAddress UKIPAddress,MaltaIPAddress -NoStorage –AdministrativeAccessPoint DNS
I ran all validation tests against the cluster. All tests passed successfully. There were some warnings, but none of them is critical. Here are the important warnings that I extracted from the report along with my comments:
“All the servers are not joined to a domain. Ensure that the nodes have a consistent configuration. Multi-domain and Workgroup clusters (clusters with non-domain joined nodes) introduce higher risk of configuration drift. When deploying ensure that the same set of Windows patches are applied to all nodes in the cluster. Also, if group policies are rolled out to the cluster nodes, they should not be conflicting. Finally, ensure that the cluster node and network names are replicated to the DNS servers authoritative for the cluster nodes.” As mentioned above, it is important to make sure that all 3 servers have the same OS configuration. This is also important going forward. If we install any patches or use any group policies, we should apply them on all 3 servers. As for the DNS servers, the client doesn’t use them in this environment. We use IP addresses or the hosts files, so this is not an issue.
“The DNS Suffix Search Lists do not match for all nodes or is unconfigured for some nodes.” This warning was displayed because we were not using a DNS server. Again, this is not an issue, because we don’t rely on a DNS server in this environment.
“Node ServerA is reachable from Node ServerB by only one pair of network interfaces. It is possible that this network path is a single point of failure for communication within the cluster. Please verify that this single path is highly available, or consider adding additional networks to the cluster.” This warning was generated for every pair of nodes in every direction (6 warnings altogether). It’s quite self explanatory…
“The cluster is not configured with a quorum witness. As a best practice, configure a quorum witness to help achieve the highest availability of the cluster.” This is an important issue. We discussed this with the client. Currently we decided not to use a quorum witness, but rather use the 3 servers with equal votes. This gives us an odd number of votes, which is good for high availability in case of a single server failure. In the future, we might remove the vote from Server C (Malta) and add a cloud witness instead. This is a better solution, because it does not rely on the Server C to be up and running in order to provide high availability for the servers in the UK.
Installing SQL Server
I created a local Windows account with the same password on all 3 servers, which I will use as the logon account for all the SQL Server services.
Then, I installed SQL Server 2016 Enterprise Edition on all servers:
I chose the appropriate disks and folders for the data, log and backup files.
I chose the same collation on all 3 instances.
On the services tab, I used the newly created account mentioned above as the logon account for all the SQL Server services.
I chose mixed authentication, and specified the same password for the sa account on all 3 instances.
I enabled Instant File Initialization.
Finally, I downloaded and installed service pack 1 for SQL Server 2016.
Creating the Availability Group
I copied all the application SQL Server logins from the current production instance to the two UK instances (we don’t need these logins in Server C, because it is not intended to serve the application). I used this article to accomplish this task. This method copies the SIDs and the passwords with the logins.
I enabled the Availability Groups feature in SQL Server Configuration Manager on all 3 instances. This operation required a restart of the SQL Server service.
I created a master key and a certificate in the master database of each instance. Then, I backed up the public key of the certificate.
I created a database mirroring endpoint on each instance.
I copied the 3 certificate backup files to all 3 servers to the same locations, so that each server had the 3 certificate backups in the same folder.
On each instance I did the following for each one of the other instances:
Create a SQL Server login for the other server.
Create a user for the login in the master database.
Import the certificate of the other server from the backup, and make the login its owner.
Grant CONNECT permission on the endpoint to the login.
I verified that all the relevant user databases (there were 5 of them) were in the Full recovery model and with the AUTO_CLOSE property disabled.
We stopped the application, and then I took a full backup of the databases on the current production instance. I copied the database backups to Server A and restored them. We then modified the connections strings in the application to use the IP address of Server A, and brought the application back online. The databases were not too large, so we only had a downtime of around 20 minutes. This was acceptable by the client. If it wasn’t acceptable, then we could, of course, choose another method, such as log shipping, in order to move the databases to Server A with minimum downtime.
I took another full backup of all the databases in Server A.
On Server A, I created the availability group on all 5 databases and all 3 replicas. I configured Server A and server B as synchronous and automatic failover, and both of them don’t allow connections as secondaries. As for Server C, I configured it as asynchronous, and it allows all connections a