Change data capture on Azure data factory revisited

Chen Hirsh
Jul 31, 2023
2 min read

A few months ago I wrote a post about the new feature of change data capture (CDC) on Azure data factory (ADF) - https://www.madeiradata.com/post/the-wind-of-change-change-data-capture-in-data-factory

Change data capture, as the name suggests, gets the data changes on one system, and replicates them to another. Since this is a task that data engineers do a lot, this was a very welcome addition to ADF.

In this post, we’ll explore what is new on this front.

New entities

ADF can now create the destination tables for you. Under targets, you can click on new target, and add a table to be created.

Keys

You can now specify key columns for your table.

If you do not specify a key, all changes will be appended. When you specify a key, if the same key is changed on the source, it will update the destination rather than adding a new row.

Real time

In my previous post, this option was grayed out. Now you can choose “real time” under “Latency”.

What it actually means is every few seconds, but that’s still very useful.

Please note:

This feature is still in preview. use it carefully.
CDC will only take rows where the update date is larger than the moment you enable it. It will not pick up all your history data. Which mean you have to copy it manually (I have no idea why Microsoft built is that way. Maybe they’ll fix it in some future version)
Continuing on point 1, if you choose to create a new table, it will only be created on the first actual data change.
To monitor and replicate changes, ADF bring up a spark cluster. If you use real time latency, this cluster is always up, so the cost may be high.

Despite these limitations, this is still a useful feature, and its nice to see that we get all kinds of improvements. Lets wait and see what we'll get next!

Change data capture on Azure data factory revisited

Recent Posts

Comments