A few months ago I wrote a post about the new feature of change data capture (CDC) on Azure data factory (ADF) - https://www.madeiradata.com/post/the-wind-of-change-change-data-capture-in-data-factory
Change data capture, as the name suggests, gets the data changes on one system, and replicates them to another. Since this is a task that data engineers do a lot, this was a very welcome addition to ADF.
In this post, we’ll explore what is new on this front.
ADF can now create the destination tables for you. Under targets, you can click on new target, and add a table to be created.
You can now specify key columns for your table.
If you do not specify a key, all changes will be appended. When you specify a key, if the same key is changed on the source, it will update the destination rather than adding a new row.
In my previous post, this option was grayed out. Now you can choose “real time” under “Latency”.
What it actually means is every few seconds, but that’s still very useful.
This feature is still in preview. use it carefully.
CDC will only take rows where the update date is larger than the moment you enable it. It will not pick up all your history data. Which mean you have to copy it manually (I have no idea why Microsoft built is that way. Maybe they’ll fix it in some future version)
Continuing on point 1, if you choose to create a new table, it will only be created on the first actual data change.
To monitor and replicate changes, ADF bring up a spark cluster. If you use real time latency, this cluster is always up, so the cost may be high.
Despite these limitations, this is still a useful feature, and its nice to see that we get all kinds of improvements. Lets wait and see what we'll get next!