A Useful Script to Delete Data from Multiple Tables


Multiple Tables

Sometimes you need to delete data from multiple tables in a database. For example, you might have a multi-tenant database, and you need to delete all the data that belong to some tenant.


The problem is that there are many tables in the database that contain data, which you need to delete. If you have a column like "TenantId" in all tables, then your life is easier, because you have a simple predicate to apply to your DELETE statements against all tables. But even then, if there are foreign keys between tables to enforce referential integrity, then things get more complicated.



The Problem


As an example, consider the following diagram:


Hierarchy of Tables with Referential Integrity
Hierarchy of Tables with Referential Integrity

In this example, the Tenants table contains a row for each tenant, and it includes the TenantId column with a primary key. There is a one-to-many relationship between that table and the Customers table. The Customers table has a TenantId column with a foreign key referencing the primary key of the Tenants table.


In the same way, there is a primary key in the Customers table (CustomerId) and a foreign key in the Orders table (CustomerId) referencing that primary key.


Now, let's say you need to delete all the data that belong to tenant 28. If you try to delete the row with TenantId = 28 in the Tenants table first, the statement will fail, because there are customers that belong to that tenant. In other words: there are rows in the Customers table that reference the row that you are trying to delete from the Tenants table.



Possible Solutions


One way to solve this problem is to add the ON DELETE CASCADE option to all foreign keys. With this option, when you delete a row that has referencing rows in another table, instead of failing the operation, the database engine will delete the row and all the referencing rows as well.


If you have multiple levels in the hierarchy, as in our example, and all foreign keys use this option, then all the relevant rows will be deleted from all tables recursively and transparently. All you need to do is delete the single row from the Tenants table, and the database engine will do all the rest behind the scenes.


This is nice, but it can be dangerous. You might not realize exactly how many rows you are going to delete in total, and this number might surprise you. If you are going to end up deleting a large number of rows from multiple tables in a single operation, then you might run into other problems related to transaction log throughput, disk space, and recovery time. In general, I don't like things that happen in the background, just like with triggers or plan guides. It can be quite confusing and difficult when troubleshooting stuff.


Another option is to delete the data from the tables in the correct order, from the lowest level to the highest. In our simple example, you can delete all the rows from the Orders table first, then from the Customers table, and only then delete the single row from the Tenants table.


When you have many tables with many references, it can become quite complicated to sort the tables in the correct order. Sometimes, you may not even be able to do it, because there is a closed loop of references, so there isn't really a "lowest" table.


Also, If you don't have a simple predicate like a TenantId column in all tables, then the only way to decide which rows to delete is by joining each table all the way up to the root table. In our example, you will need to delete data from the Orders table first. You will need to join the Orders table to the Customers table, and then delete all the rows from the Orders table that reference rows in the Customers table that have the value 28 in the TenantId column. Only then you can delete rows from the Customers table, and so on.


It's not that complicated when you have only 3 tables, but what if you have 10 levels in the hierarchy with multiple paths? It can become pretty ugly.



So What Do You Suggest?


I suggest another solution to this problem. It is composed of 4 steps:

  1. Drop all the foreign keys in the database.

  2. Delete the row from the top table (Tenants.TenantId = 28).

  3. For each (dropped) foreign key, delete all the rows from the child table that have no corresponding rows in the parent table according to the foreign key definition.

  4. Recreate all the foreign keys.

In step 1, we need to also store the definitions of all foreign keys in a table. We will use these definitions later in step 3 and in step 4.


Notice that we need to execute step 3 multiple times until there are no more foreign key violations (child rows without corresponding parent rows). Only then we can move to step 4 and recreate the foreign keys.


Doing all that manually is a lot of work, and quite error-prone. And if you need to do it multiple times with different tenants and different environments, then it really doesn't make sense to do it manually.


Developing a script that will do all that is also quite a complicated task. But guess what? I have already done that for you. I wrote such a script for one of our customers, and we have already used it many times on multiple environments, so it's well-tested.


In addition to the 4 steps mentioned above, the script also creates an audit table and generates an audit of all the operations it performed along with row count of each DELETE statement. It also calculates and compares the row counts in each table before and after the script execution, and the total number of rows deleted from each table.


The script includes 11 steps:

  1. Create the TableRowCount table to store row counts in all tables

  2. Populate the TableRowCount table before we delete any rows

  3. Store the foreign key information in a table

  4. Create an audit table

  5. Drop all the foreign keys

  6. Delete data based on your logic...

  7. Delete rows that violate foreign keys

  8. Recreate the foreign keys

  9. Populate the TableRowCount table after we deleted the rows

  10. Compare the number of rows before and after the delete

  11. View the audit


This script can potentially run a lot of DDL and DML statements on many tables, affecting many rows. So I wouldn't use it on a production database while there is activity against the tables. First, it might cause heavy blocking. Second, there are no foreign key constraints for some period of time, so the database can be exposed to data integrity issues. I would run the script during a maintenance window, when there is no other activity against the tables, or at least during a low activity timeframe.


All is left for you to do is to download the script from our Madeira Toolbox, insert your logic in step 6, and execute the script.


Notice that the script assumes that all foreign keys have a single key. This was the case of my customer. If you have foreign keys with multiple keys, then you need to make some changes to the script to support this case.

0 comments

STAY IN TOUCH

Get New posts delivered straight to your inbox

Thank you for subscribing!