Foreign keys and how they help the optimizer

I love foreign keys. In my opinion they just make perfect sense.

They keep my data “clean” and consistent. Whenever two tables need to correspond with each other, I like to make sure that they do. For me as a DBA, the easiest and most intuitive way to do that is to create a FK constraint. Some of you (maybe even most of you) will say that FKs can be replaced by the application. Let it do all of the “dirty work” of making sure that the data in our DB makes sense business wise. But if there’s something I hate, it’s cleaning up after some programmer’s bug. Data fixes are a pain in the ***.

However, there is another reason why FKs are important. They can help the optimizer make better and more efficient decisions when creating an execution plan. To explain exactly how and why, let first see what a FK is.

What’s a Foreign Key ?

A foreign key represents a relation between two tables in a database. The relation is determined by matching the value in a column (or combination of columns) in the referencing table to the values in the columns of the referenced table. The values in the referenced table must be unique, either a primary key or a unique constraint.

These two conditions make sure two things will always be true:

  1. For every value in the FK there is a matching record in the referenced table

  2. There is only one match

How does it help the optimizer ?

Naturally, the optimizer knows how to take these two trues into consideration when creating the execution plan. To demonstrate I’ve created two tables with a FK relation between them.

The NameList table contains 100 names and Ids with the Id as a primary key for the table.

Transact-SQL

-- Create the list tables CREATE TABLE dbo.NameList ( Id INT NOT NULL , Name VARCHAR(10) NOT NULL , CONSTRAINT pk_NameId_c PRIMARY KEY CLUSTERED (Id) ); INSERT INTO dbo.NameList(Id, Name) SELECT TOP 100 ROW_NUMBER()OVER(ORDER BY T1.name) , 'Name_'+CAST(ROW_NUMBER()OVER(ORDER BY T1.name) AS VARCHAR(4)) FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

-- Create the list tables

CREATE TABLE

dbo.NameList

(

Id INT NOT NULL ,

Name VARCHAR(10) NOT NULL ,

CONSTRAINT pk_NameId_c PRIMARY KEY CLUSTERED (Id)

);

INSERT INTO

dbo.NameList(Id, Name)

SELECT TOP 100

ROW_NUMBER()OVER(ORDER BY T1.name) ,

'Name_'+CAST(ROW_NUMBER()OVER(ORDER BY T1.name) AS VARCHAR(4))

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2;

GO

The RandomNames table contains 100,000 records with 4 columns. The Id is the PK for the table. The other 3 columns hold values between 1 and 100 and the values are exactly the same for all 3. The difference between them is that the first column is constrained by a FK, the second column isn’t, and the third also constrained by a FK but is nullable.

Transact-SQL

-- Create the data table CREATE TABLE dbo.RandomNames ( Id INT NOT NULL IDENTITY(1,1) , NameId INT NOT NULL , NameId_NoFK INT NOT NULL , NameId_NULL INT NULL , CONSTRAINT pk_RandNameId_c PRIMARY KEY CLUSTERED (Id) , CONSTRAINT fk_NameID FOREIGN KEY (NameId) REFERENCES dbo.NameList (Id) , CONSTRAINT fk_NameID_NULL FOREIGN KEY (NameId_NULL) REFERENCES dbo.NameList (Id) ); INSERT INTO dbo.RandomNames ( NameId , NameId_NoFK , NameId_NULL ) SELECT RandId , RandId , RandId FROM( SELECT TOP 100000 ABS(CHECKSUM(NEWID())%100)+1 AS RandId FROM sys.all_columns AS T1 CROSS JOIN sys.all_columns AS T2) AS T; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

-- Create the data table

CREATE TABLE

dbo.RandomNames

(

Id INT NOT NULL IDENTITY(1,1) ,

NameId INT NOT NULL ,

NameId_NoFK INT NOT NULL ,

NameId_NULL INT NULL ,

CONSTRAINT pk_RandNameId_c PRIMARY KEY CLUSTERED (Id) ,

CONSTRAINT fk_NameID FOREIGN KEY (NameId)

REFERENCES dbo.NameList (Id) ,

CONSTRAINT fk_NameID_NULL FOREIGN KEY (NameId_NULL)

REFERENCES dbo.NameList (Id)

);

INSERT INTO

dbo.RandomNames

(

NameId ,

NameId_NoFK ,

NameId_NULL

)

SELECT

RandId ,

RandId ,

RandId

FROM(

SELECT TOP 100000

ABS(CHECKSUM(NEWID())%100)+1 AS RandId

FROM

sys.all_columns AS T1

CROSS JOIN

sys.all_columns AS T2) AS T;

GO

Now, let’s join the two tables, every time using a different column (NameId, NameId_NoFK, NameId_NULL), and see how the execution plan differs for every query. Notice that I’m only selecting values from the RandomNames table.

Transact-SQL

SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId = NL.Id; GO SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NoFK = NL.Id; GO SELECT TOP 100 RN.Id FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NULL = NL.Id; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

SELECT TOP 100

RN.Id

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId = NL.Id;

GO

SELECT TOP 100

RN.Id

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId_NoFK = NL.Id;

GO

SELECT TOP 100

RN.Id

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId_NULL = NL.Id;

GO


FK optimizer Exec_1

When using the FK column, the optimizer recognizes that the inner join is meaningless in this query. Because there must be a match and only one match, every record in the RandomNames must return and no duplicate can be created, so the optimizer simple scans the table. When using the NoFK column the optimizer has no way of knowing this and must perform the join. Notice that the nullable column behaves just like the NoFK column. This is because the optimizer can’t say every row as a match so the column is treated like it doesn’t have a FK (at least when it comes to executing the query).

Now you might say a better query would eliminate the need for the join. If you know your schema well enough you can just SELECT TOP 100 Id FROM dbo.RandomNames; and get the same result (that’s what the optimizer did), but a lot of times the person writing the query doesn’t know the schema very well. Other time the query is already written for us in a form of a view.

Transact-SQL

CREATE VIEW dbo.VW_RandomNames AS SELECT RN.Id , NL.Name FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId = NL.Id; GO CREATE VIEW dbo.VW_RandomNames_NoFK AS SELECT RN.Id , NL.Name FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NoFK = NL.Id; GO CREATE VIEW dbo.VW_RandomNames_NULL AS SELECT RN.Id , NL.Name FROM dbo.RandomNames AS RN INNER JOIN dbo.NameList AS NL ON RN.NameId_NULL = NL.Id; GO

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

25

26

27

28

29

30

31

32

33

34

35

36

37

38

CREATE VIEW

dbo.VW_RandomNames

AS

SELECT

RN.Id ,

NL.Name

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId = NL.Id;

GO

CREATE VIEW

dbo.VW_RandomNames_NoFK

AS

SELECT

RN.Id ,

NL.Name

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId_NoFK = NL.Id;

GO

CREATE VIEW

dbo.VW_RandomNames_NULL

AS

SELECT

RN.Id ,

NL.Name

FROM

dbo.RandomNames AS RN

INNER JOIN

dbo.NameList AS NL

ON RN.NameId_NULL = NL.Id;

GO