Foreign keys and how they help the optimizer

Madeira Team
Feb 17, 2012
6 min read

I love foreign keys. In my opinion they just make perfect sense.

They keep my data “clean” and consistent. Whenever two tables need to correspond with each other, I like to make sure that they do. For me as a DBA, the easiest and most intuitive way to do that is to create a FK constraint. Some of you (maybe even most of you) will say that FKs can be replaced by the application. Let it do all of the “dirty work” of making sure that the data in our DB makes sense business wise. But if there’s something I hate, it’s cleaning up after some programmer’s bug. Data fixes are a pain in the ***.

However, there is another reason why FKs are important. They can help the optimizer make better and more efficient decisions when creating an execution plan. To explain exactly how and why, let first see what a FK is.

What’s a Foreign Key ?

A foreign key represents a relation between two tables in a database. The relation is determined by matching the value in a column (or combination of columns) in the referencing table to the values in the columns of the referenced table. The values in the referenced table must be unique, either a primary key or a unique constraint.

These two conditions make sure two things will always be true:

For every value in the FK there is a matching record in the referenced table
There is only one match

How does it help the optimizer ?

Naturally, the optimizer knows how to take these two trues into consideration when creating the execution plan. To demonstrate I’ve created two tables with a FK relation between them.

The NameList table contains 100 names and Ids with the Id as a primary key for the table.

Transact-SQL

-- Create the list tables CREATE TABLE dbo.NameList ( Id INT NOT NULL , Name VARCHAR(10) NOT NULL , CONSTRAINT pk_NameId_c PRIMARY KEY CLUSTERED (Id) ); INSERT INTO dbo.NameList(Id, Name) SELECT TOP 100 ROW_NUMBER()OVER(ORDER BY T1.name) , 'Name_'+CAST(ROW_NUMBER()OVER(ORDER BY T1.name) AS VARCHAR(4)) FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO

-- Create the list tables

CREATE TABLE

dbo.NameList

(

Id INT NOT NULL ,

Name VARCHAR(10) NOT NULL ,

CONSTRAINT pk_NameId_c PRIMARY KEY CLUSTERED (Id)

);

INSERT INTO

dbo.NameList(Id, Name)

SELECT TOP 100

ROW_NUMBER()OVER(ORDER BY T1.name) ,

'Name_'+CAST(ROW_NUMBER()OVER(ORDER BY T1.name) AS VARCHAR(4))

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2;

The RandomNames table contains 100,000 records with 4 columns. The Id is the PK for the table. The other 3 columns hold values between 1 and 100 and the values are exactly the same for all 3. The difference between them is that the first column is constrained by a FK, the second column isn’t, and the third also constrained by a FK but is nullable.

Transact-SQL

-- Create the data table CREATE TABLE dbo.RandomNames ( Id INT NOT NULL IDENTITY(1,1) , NameId INT NOT NULL , NameId_NoFK INT NOT NULL , NameId_NULL INT NULL , CONSTRAINT pk_RandNameId_c PRIMARY KEY CLUSTERED (Id) , CONSTRAINT fk_NameID FOREIGN KEY (NameId) REFERENCES dbo.NameList (Id) , CONSTRAINT fk_NameID_NULL FOREIGN KEY (NameId_NULL) REFERENCES dbo.NameList (Id) ); INSERT INTO dbo.RandomNames ( NameId , NameId_NoFK , NameId_NULL ) SELECT RandId , RandId , RandId FROM( SELECT TOP 100000 ABS(CHECKSUM(NEWID())%100)+1 AS RandId FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2) AS T; GO

-- Create the data table

CREATE TABLE

dbo.RandomNames

(

Id INT NOT NULL IDENTITY(1,1) ,

NameId INT NOT NULL ,

NameId_NoFK INT NOT NULL ,

NameId_NULL INT NULL ,

CONSTRAINT pk_RandNameId_c PRIMARY KEY CLUSTERED (Id) ,

CONSTRAINT fk_NameID FOREIGN KEY (NameId)

REFERENCES dbo.NameList (Id) ,

CONSTRAINT fk_NameID_NULL FOREIGN KEY (NameId_NULL)

REFERENCES dbo.NameList (Id)

);

INSERT INTO

dbo.RandomNames

(

NameId ,

NameId_NoFK ,

NameId_NULL

)

SELECT

RandId ,

RandId

FROM(

SELECT TOP 100000

ABS(CHECKSUM(NEWID())%100)+1 AS RandId

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2) AS T;

Now, let’s join the two tables, every time using a different column (NameId, NameId_NoFK, NameId_NULL), and see how the execution plan differs for every query. Notice that I’m only selecting values from the RandomNames table.

Transact-SQL

SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId = NL.Id; GO SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NoFK = NL.Id; GO SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NULL = NL.Id; GO

SELECT TOP 100

RN.Id

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId = NL.Id;

SELECT TOP 100

RN.Id

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId_NoFK = NL.Id;

SELECT TOP 100

RN.Id

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId_NULL = NL.Id;

When using the FK column, the optimizer recognizes that the inner join is meaningless in this query. Because there must be a match and only one match, every record in the RandomNames must return and no duplicate can be created, so the optimizer simple scans the table. When using the NoFK column the optimizer has no way of knowing this and must perform the join. Notice that the nullable column behaves just like the NoFK column. This is because the optimizer can’t say every row as a match so the column is treated like it doesn’t have a FK (at least when it comes to executing the query).

Now you might say a better query would eliminate the need for the join. If you know your schema well enough you can just SELECT TOP 100 Id FROM dbo.RandomNames; and get the same result (that’s what the optimizer did), but a lot of times the person writing the query doesn’t know the schema very well. Other time the query is already written for us in a form of a view.

Transact-SQL

CREATE VIEW dbo.VW_RandomNames AS SELECT RN.Id , NL.Name FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId = NL.Id; GO CREATE VIEW dbo.VW_RandomNames_NoFK AS SELECT RN.Id , NL.Name FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NoFK = NL.Id; GO CREATE VIEW dbo.VW_RandomNames_NULL AS SELECT RN.Id , NL.Name FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NULL = NL.Id; GO

CREATE VIEW

dbo.VW_RandomNames

SELECT

RN.Id ,

NL.Name

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId = NL.Id;

CREATE VIEW

dbo.VW_RandomNames_NoFK

SELECT

RN.Id ,

NL.Name

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId_NoFK = NL.Id;

CREATE VIEW

dbo.VW_RandomNames_NULL

SELECT

RN.Id ,

NL.Name

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId_NULL = NL.Id;

Now when we use the view, there is no way of rewriting the query.

Transact-SQL

SELECT TOP 100 Id FROM dbo.VW_RandomNames; GO SELECT TOP 100 Id FROM dbo.VW_RandomNames_NoFK; GO SELECT TOP 100 Id FROM dbo.VW_RandomNames_NULL; GO

SELECT TOP 100 Id

FROM dbo.VW_RandomNames;

SELECT TOP 100 Id

FROM dbo.VW_RandomNames_NoFK;

SELECT TOP 100 Id

FROM dbo.VW_RandomNames_NULL;

Good thing the optimizer’s there to help us out. The execution plans are just like before.

The FK is also helpful when we simply want to filter out records that don’t have a match.

Transact-SQL

SELECT TOP 100 Id FROM dbo.RandomNames WHERE NameId IN (SELECT Id FROM dbo.NameList); GO SELECT TOP 100 Id FROM dbo.RandomNames WHERE NameId_NoFK IN (SELECT Id FROM dbo.NameList); GO SELECT TOP 100 Id FROM dbo.RandomNames WHERE NameId_NULL IN (SELECT Id FROM dbo.NameList); GO

SELECT TOP 100 Id

FROM dbo.RandomNames

WHERE NameId IN (SELECT Id FROM dbo.NameList);

SELECT TOP 100 Id

FROM dbo.RandomNames

WHERE NameId_NoFK IN (SELECT Id FROM dbo.NameList);

SELECT TOP 100 Id

FROM dbo.RandomNames

WHERE NameId_NULL IN (SELECT Id FROM dbo.NameList);

Again, the optimizer knows there is no need to check the NameList table when a FK is present (and it’s not nullable).

However, an odd thing happened when I tried to run these queries with a little change. I used NOT IN instead of IN to get only the records that don’t have a match.

Transact-SQL

SELECT TOP 100 Id FROM dbo.RandomNames WHERE NameId NOT IN (SELECT Id FROM dbo.NameList); GO SELECT TOP 100 Id FROM dbo.RandomNames WHERE NameId_NoFK NOT IN (SELECT Id FROM dbo.NameList); GO SELECT TOP 100 Id FROM dbo.RandomNames WHERE NameId_NULL NOT IN (SELECT Id FROM dbo.NameList); GO

SELECT TOP 100 Id

FROM dbo.RandomNames

WHERE NameId NOT IN (SELECT Id FROM dbo.NameList);

SELECT TOP 100 Id

FROM dbo.RandomNames

WHERE NameId_NoFK NOT IN (SELECT Id FROM dbo.NameList);

SELECT TOP 100 Id

FROM dbo.RandomNames

WHERE NameId_NULL NOT IN (SELECT Id FROM dbo.NameList);

I thought that the optimizer would immediately recognized that when a FK is present this is not a possible scenario and would automatically return an empty record set without scanning any table. I was wrong…

The optimizer still goes over both tables and there is no difference between the FK column and the NoFK column. It gets even worse in the nullable FK column. It’s clear that the optimizer made a wrong decision, and I must say I couldn’t figure out why. My only guess is that the optimizer is not very effective when NULL values might be involved.

Bottom line

FK can help the optimizer. They make queries run faster and prevent unnecessary reads (less shared locks). This is true as long as no NULL values are involved. You can avoid that by using an Id that is not used instead of using NULL (-1 for example). That will eliminate the need for a nullable FK column.

It’s true that FKs make inserts and updates run slower, so the overall benefits of FKs are not that great performance wise. But we are still left with my main reason for using FKs. Consistent, “Clean” data.

As I’ve said before… “Data fixes are a pain in the ***”.

#optimizer #foreignkeys #executionplans #performancetuning #performancetuningservice