Compression – Much more than saving space

I remember when SQL server 2008 was first released and I have examined some of its features, when I came across Compression, I was very impressed and thought “cool !” Actually i have found it much more than that: friendly to use , easy to maintain , saves storage space and here is the best part, can bring better performance (yes queries performance). But somehow I have neglected this feature, as lots of other DBA, with no particular reason. In this post I’m going to present Compression in SQL Server 2008 Enterprise (only in enterprise edition) and explain how it works, and show you a short demo.

So what does it mean compression?

Compression – the process of encoding information using fewer bits (or other information-bearing units) than an unencoded representation would use, through use of specific encoding schemes. Meaning, storing the same data with less space.

We all know compressions like ZIP files on any type of file or using JPEG instead of BMP or MP3 instead of WAV. While some of those types of compressions affect the quality of data being saved, other works on a static file that is being compressed one time and doesn’t change. Of course SQL needs a costume solution that can compress data while it is generated. SQL server offers Backup compression and data Compression. In this post I’ll focus on Data Compression.

There are 2 types of data compression SQL server offers: Row compression and page compression – will explain each one later on. So here are some questions regarding compression, you probably asked yourself:

On what kind of data, compression can be applied?

We can compress table, index or specific partition. Data compression can be applied only during creation or rebuild of a table, index or partition.

How does it work?

Row compression– Reduces metadata overhead associated with the row. It uses variable-length storage format for numeric type. NULL and 0 values across all data types are optimized and take no bytes

Page compression– Page compression consists of 3 operations:

  1. Row Compression The same as described earlier

  2. Prefix Compression For each column, a repeated prefix value that can reduce storage space is identified and stored in the CI (Compression Information) structure that follows the page header. For each value in the column an attempt is made to match partial or full prefix. The matched value is then replaced by a reference to the corresponding prefix Have a look at this example from BOL. This is how an ordinary page looks like:

Now we will see the same page after prefix compression:

The compression algorithm found the longest prefix that fits each column (in the page) so each column now refer to the prefix and how many chars it has in common with it. For example in column 2, the longest prefix is aaaacc. First values has the first 4 chars in common so the first 4 chars are replaced with the number 4 , the second value has no chars in common, so it is replaced with the number 0 and the last value is a perfect match, so no value there

  1. Dictionary Compression Occurs after prefix compression. Repeated values are searched all over the page and replaced by references that are stored in the CI area of the page

In this stage repeated values are searched all over the page and if exact matches are found, then the matched value is written in the CI area of the page. Remember that it doesn’t mean that the values replaced were the same values before the compression, for instance the value 4b, appears in 2 columns but it’s not the same original value (since the prefix compression changed it).

Why using this type of method to compress data?

When fetching a compressed page we don’t want to wait for a long time in order to read its uncompressed data, because it will mean that the compression overhead is too much to bear. Microsoft needed to find a balance between good compression method and the time it takes to compress and uncompress data. Considering the kind of data that is stored in tables and indexes (usually same type of data which make sense (not gibberish) and sometimes it is sorted) so prefix compression as first step is a good idea.

How can compression bring better query performance?

It seems that when working with compression we have much more overhead, since we need to compress every page and when fetching it or looking for data inside the compressed page, we will need to uncompress it. So it seems that working with a compressed page takes longer time. But in fact it takes more CPU time, since the work of un/compression is made by the CPU. Compression saves us storage space, meaning the same data can be stored in less pages, so when fetching this data it will take less I/O and hence usually less time. Today most systems are I/O bound and have plenty of free CPU, so in those kinds of systems, using compression can bring better performance.

What about queries that return a small result set?

Even if a query returns a small result set, it still can process a lot of data. For example have a look at the attached script:  Compression Demo I have created 2 tables with 2 million customers with randomly generated names based on the 20 most popular na