• Madeira Team

Foreign keys and how they help the optimizer

I love foreign keys. In my opinion they just make perfect sense.

They keep my data “clean” and consistent. Whenever two tables need to correspond with each other, I like to make sure that they do. For me as a DBA, the easiest and most intuitive way to do that is to create a FK constraint. Some of you (maybe even most of you) will say that FKs can be replaced by the application. Let it do all of the “dirty work” of making sure that the data in our DB makes sense business wise. But if there’s something I hate, it’s cleaning up after some programmer’s bug. Data fixes are a pain in the ***.

However, there is another reason why FKs are important. They can help the optimizer make better and more efficient decisions when creating an execution plan. To explain exactly how and why, let first see what a FK is.

What’s a Foreign Key ?

A foreign key represents a relation between two tables in a database. The relation is determined by matching the value in a column (or combination of columns) in the referencing table to the values in the columns of the referenced table. The values in the referenced table must be unique, either a primary key or a unique constraint.

These two conditions make sure two things will always be true:

  1. For every value in the FK there is a matching record in the referenced table

  2. There is only one match

How does it help the optimizer ?

Naturally, the optimizer knows how to take these two trues into consideration when creating the execution plan. To demonstrate I’ve created two tables with a FK relation between them.

The NameList table contains 100 names and Ids with the Id as a primary key for the table.

Transact-SQL

-- Create the list tables CREATE TABLE dbo.NameList ( Id INT NOT NULL , Name VARCHAR(10) NOT NULL , CONSTRAINT pk_NameId_c PRIMARY KEY CLUSTERED (Id) ); INSERT INTO dbo.NameList(Id, Name) SELECT TOP 100 ROW_NUMBER()OVER(ORDER BY T1.name) , 'Name_'+CAST(ROW_NUMBER()OVER(ORDER BY T1.name) AS VARCHAR(4)) FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

-- Create the list tables

CREATE TABLE

dbo.NameList

(

Id INT NOT NULL ,

Name VARCHAR(10) NOT NULL ,

CONSTRAINT pk_NameId_c PRIMARY KEY CLUSTERED (Id)

);

INSERT INTO

dbo.NameList(Id, Name)

SELECT TOP 100

ROW_NUMBER()OVER(ORDER BY T1.name) ,

'Name_'+CAST(ROW_NUMBER()OVER(ORDER BY T1.name) AS VARCHAR(4))

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2;

GO

The RandomNames table contains 100,000 records with 4 columns. The Id is the PK for the table. The other 3 columns hold values between 1 and 100 and the values are exactly the same for all 3. The difference between them is that the first column is constrained by a FK, the second column isn’t, and the third also constrained by a FK but is nullable.

Transact-SQL

-- Create the data table CREATE TABLE dbo.RandomNames ( Id INT NOT NULL IDENTITY(1,1) , NameId INT NOT NULL , NameId_NoFK INT NOT NULL , NameId_NULL INT NULL , CONSTRAINT pk_RandNameId_c PRIMARY KEY CLUSTERED (Id) , CONSTRAINT fk_NameID FOREIGN KEY (NameId) REFERENCES dbo.NameList (Id) , CONSTRAINT fk_NameID_NULL FOREIGN KEY (NameId_NULL) REFERENCES dbo.NameList (Id) ); INSERT INTO dbo.RandomNames ( NameId , NameId_NoFK , NameId_NULL ) SELECT RandId , RandId , RandId FROM( SELECT TOP 100000 ABS(CHECKSUM(NEWID())%100)+1 AS RandId FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2) AS T; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

-- Create the data table

CREATE TABLE

dbo.RandomNames

(

Id INT NOT NULL IDENTITY(1,1) ,

NameId INT NOT NULL ,

NameId_NoFK INT NOT NULL ,

NameId_NULL INT NULL ,

CONSTRAINT pk_RandNameId_c PRIMARY KEY CLUSTERED (Id) ,

CONSTRAINT fk_NameID FOREIGN KEY (NameId)

REFERENCES dbo.NameList (Id) ,

CONSTRAINT fk_NameID_NULL FOREIGN KEY (NameId_NULL)

REFERENCES dbo.NameList (Id)

);

INSERT INTO

dbo.RandomNames

(

NameId ,

NameId_NoFK ,

NameId_NULL

)

SELECT

RandId ,

RandId ,

RandId

FROM(

SELECT TOP 100000

ABS(CHECKSUM(NEWID())%100)+1 AS RandId

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2) AS T;

GO

Now, let’s join the two tables, every time using a different column (NameId, NameId_NoFK, NameId_NULL), and see how the execution plan differs for every query. Notice that I’m only selecting values from the RandomNames table.

Transact-SQL

SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId = NL.Id; GO SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NoFK = NL.Id; GO SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NULL = NL.Id; GO

1

2

3

4

5

6

7

8

9

10

11

12

13