Last week, I gave a presentation about SQL Server’s data warehouse capabilities, and specifically about APS (Analytics Platform System) and Azure SQL Data Warehouse.
The best way to learn about a subject is to teach it, and indeed, I learned a lot. Let’s talk about it a bit.
APS (Analytics Platform System) is an appliance. This means that when you buy it, you get both software and hardware (you can choose between HP, Dell and Quanta). You then get a big rack of servers that is brought into your data center.
The appliance is divided into two – the PDW (Parallel Data Warehouse) part and the Hadoop part.
The PDW part is built out of a few SQL Servers of two types: The control node and the compute nodes.
The control node is essentially a pipe which directs queries to the compute nodes, gathers the results from them, and returns the results to the client.
Each compute node holds a part of the data. If, let’s say, you have a 10 billion rows Fact table, you can divide those rows between a few nodes, which will execute a query on this table in parallel.
The data is distributed between the nodes by applying a hash function on one of the table columns. Because of that, it’s important to select a column that will be distributed well across nodes and don’t generate a data skew which will result in one node doing most of the work. The appliance also leverages Columnstore in order to compress the data and return it as fast as possible.
As for the hardware of each node, it’s not something very bombastic. Currently they use 16 core machines with 256GB RAM and non-SSD storage treated as JBOD. The disks are then striped using Windows Server Storage Spaces. This can give you an idea that the power of the appliance is not in the hardware but in the architecture. By using commodity hardware and software solutions, Microsoft can keep the appliance relatively cheap.
The Hadoop part is Microsoft’s version of Hadoop – HDInsight, which is built on top of the Hortonworks distribution of Hadoop.
In-between, we have a component called Polybase, which connects between SQL Server to Hadoop. It allows us to execute T-SQL queries that will go and grab data from Hadoop, without requiring us to learn how to write Map-Reduce jobs (Polybase is also a part of SQL Server 2016).
If you already have a Hadoop cluster, you can but APS without the Hadoop part and connect the appliance to your existing cluster.
Because there can be some data movement between the nodes (whether they are Hadoop nodes or SQL Server nodes), the appliance is equipped with fast 56 GB/s infiniBand switches.
The Next Level – Azure SQL Data Warehouse
With Azure SQL Data Warehouse, Microsoft took things one step further and separated the compute part from the storage part. By doing this, Microsoft allows us a great elasticity and the ability to quickly spin up a bunch of compute nodes, perform a big process and then take the compute nodes back down. The data will still be there the whole time on storage, even when we don’t need the compute nodes.
Another addition in SQL Data Warehouse is the ability to distribute data between nodes in a round-robin algorithm rather than using a hash function.
Why should you care?
The first and more apparent reason is so that you should know they exist and use them if you need to (and can afford it).
The second reason, and perhaps the more important one, is that by knowing how APS and Azure SQL Data Warehouse are built, you learn how to architect modern database systems.
Those systems leverage the knowledge that by using many commodity-class machines and letting each machine perform a small part of the job, you can do much more and get much better performance than using a single machine, even if it’s very strong.
A great analogy I heard from Matt Usher was that “if you need to hammer 100 nails, letting 100 people hammer 1 nail each will always be faster than giving one person hammer 100 nails, even if he’s the best construction guy in the world and he has the best hammer in the world”.
As stated above, Azure SQL Data Warehouse can also give an idea of the capabilities of the cloud, especially the elasticity part.
The world of data is changing. Using commodity hardware on many machines is used for years by other systems like Hadoop, and the elasticity of the cloud is used by other systems as well. Those two trends now come to SQL Server, and it’s exciting!