Contact us

Madeira Data Solutions

Your Data, Our Solutions

When Should You Use Identity as a Clustered Index Key?

Written By: Matan Yungman 16/12/2015

A few days ago, while delivering my session, The Data-Loading Performance Presentation, I was asked when, performance-wise, Identity should be used as the clustered index key, and when it shouldn’t.

So first, starting SQL Server 2012, a Sequence is actually a better choice than identity. It’s more flexible, more adjustable, and since it’s external to the table, it allows easier migrations.

After we got that out the way, let’s dive to performance.

Traditionally, we want our clustered index key to be:

Narrow: So that the clustered index and the non-clustered indexes which point to it will be as slim as possible.

Static: So that we don’t have to update the clustered and non-clustered indexes, generate fragmentation by doing it and incur locks and potentially deadlocks.

Ever-increasing: So that we don’t generate page splits and fragmentation while inserting data to the table and thus hurt both insert and query performance.

 

For the vast majority of systems (something like 95% of them), this will be a very good choice.

The problem starts when MANY sessions insert data to the table in parallel.

Looking at the structure of an index, when working with an ever-increasing key, all of the rows will go to the same page, which is the last page of the index.

Tail Insert Problem

When inserting data into any page, a session has to acquire a Latch on that page. A Latch is synchronization object for in memory structures like data pages that are in the buffer pool. When inserting data into the page, each session has to acquire an exclusive latch so that no other session can change the structure of the page while the session is inserting data. This means that only one session can insert data into the page at any given time, and if you have many sessions that write to the table at the same time, each one of them will have to wait a considerable amount of time until its turn reaches to insert data. This problem is called the Tail Insert problem.

How to know if you have a problem:

  1. While inserts are running against the suspicious table, query sys.dm_os_waiting_tasks
  2. Check whether you have many sessions waiting with a PAGELATCH wait on the same page
    PAGELATCH Waits
  3. Enable trace flag 3604 using the syntax:

4. Run DBCC PAGE in order to see the contents of the relevant page. In our case:

5. In the results pane, look for ObjectId:

DBCC PAGE ObjectId

6. Then take that ObjectId and find the associated table using the Object_Name function (in our case, 1013578649):

7. Is that your table? You’re most likely suffering from the Tail Insert problem.

Update:

Kenneth Fisher (Blog | Twitter) has suggested a better way for replacing steps 3 to 6 and finding the associated table faster.

We will use the resource description details we grabbed above with the sys.dm_database_page_allocations function. 8 is the database id, 1 is the file id, and 10078 is the latched page id. This brings us to the following query:

 

So when should I use Identity?

If you don’t suffer from the Tail Insert problem, you’re most likely good to go with an ever-increasing key like Identity. If you do suffer from it, you have to start thinking about other solutions.

Solutions:

When you have this problem, it’s time to leave Identity (or Sequence) behind and work with other methods. There a few main options:

  1. Load to a Heap, which works better with parallel loading
  2. Work with In-Memory OLTP, which uses Lock-Free data structures in order to mitigate latches
  3. Work with a non-sequential key like GUID
  4. Work with Hash Partitioning, which separates the Clustered Index to a few separate physical index trees
  5. Pad the table so every page holds only a single row (if your table is small)
  6. Use a business key that spreads all over the index and not in a specific spot in it
  7. Implement a “reverse index”

 

In the next post, I go over the solutions in detail.

10 responses to “When Should You Use Identity as a Clustered Index Key?”

  1. Eitan Blumin says:

    Can’t wait for the next post, Matan.
    I’m curious about the additional solutions you mentioned 🙂

  2. ScottPletcher says:

    As I’ve tune indexes over the years, I’ve found that identity is generally *not* the best clustering key for the majority of tables it’s specified on, excepting typical “master tables”, such as customer_master, where customer_id is the natural clustering key. You have to consider the overall performance of the table, not just the initial INSERT. Keep in mind that a row is typically read at least 10 times, and often 100x or 1000x, after its *1 time* insert.

    Consider an order items table. It’s almost always best to cluster that table on ( order_id, $identity ) rather than on just ( $identity ). Is the key wider? Sure, but it’s vastly more effective in joins.

    Thus, “narrow, static and ever-increasing” should be taken as general rules and not as absolute commandments. In fact, I believe the idea that (virtually) every table should be clustered on identity is the most damaging myth in physical table design (and even logical table design, as many people are prone to include the “automatic” identity in logical modeling, though of course it doesn’t belong there).

    Also critical is that the wrong clustering key almost inevitably forces creation of more “covering” indexes, further damaging overall performance.

    • Matan Yungman says:

      Hi Scott,
      This post took came to analyze things from the insert side.
      Of course the querying side should also be taken into account, and there are various options for that.
      A lot of times creating a clustered index on a date column is good, but it will still have latch issues if many sessions load to the table at the same time and the date is ever increasing.
      As in real life, every option has its pros and cons.

  3. Gus says:

    This article has awesome timing. I have been recently tasked with evaluating some stored procedures for performance issues on a database I have never seen before. I’ve discovered that 61% (123) of the tables in the database have no clustered index. The primary keys on tables with no clustered index are all on non-clustered indexes. At least one of these tables has over 1 million rows, others are several hundred thousand and others have tens of thousands of rows. What are your general thoughts on this situation?

    • Matan Yungman says:

      Hi Gus,

      It’s usually better to work with clustered tables rather than heaps, but there are always exceptions.
      As a general rule, clustered indexes should support range queries, joins to other tables and key lookups from the non clustered indexes.
      Also take into account whether the column can be updated or not, and the other points raised in this post.

      I suggest you start with your most problematic tables – learn how the users interact with it in terms of inserting and querying data. Most likely one of the columns users filter on the most will be your clustered index key.

  4. Uwe Ricken says:

    Hi Matan,

    great subsumption of the problem with increasing key values as clustered key! But I wouldn’t agree that a heap is “always” a better solution. As you have described very good in your post; you will generate another hot spot in your scenarios: It is the PFS Page.

    Please keep in mind that every single transaction (if it hits the barrier of 50, 85, 95, 100 percent) needs to update the PFS-Page which covers the information of the filling grade of the pages of a heap.

    The only benefit of a heap (instead of a CI) is the flat structure because no intermediate levels need to be updated 🙂

    Best to you from Germany, Uwe

    • Matan Yungman says:

      Hey Uwe,
      You are completely right, and I will cover the pros and cons of each solution in the next post (including PFS contention in the case of heaps).

      Matan

  5. Ivan Radchenko says:

    Great post, Matan, but it doesn’t cover another important case: when you have another column, besides identity/sequence, that is also static and ever-increasing, such as date (datetime) or date_id (integer). This scenario might not be as wide-spread and not as narrow as identity/sequence , but sometimes a datetime column is certainly more useful in non-clustered indexes than some meaningless number.

    • Matan Yungman says:

      Hi Ivan,
      The post focuses on Identity/sequence, but if you have latch problems with those, you will generally have this problem for any ever-increasing column, such as Datetime.

Leave a Reply

Your email address will not be published. Required fields are marked *