Updated: Aug 29, 2021
Data management tools are evolving at a great speed, and this can be overwhelming.
Data volumes and variety evolve and grow as well.
Data Engineers are required to transform those waterfalls of data into business insights.
The data is arriving from a vast range of sources, like social media networks, 3rd party partners or internal microservices. If you are experienced in SQL Server DBA, you know how versatile the product is.
It is very tempting and feels correct to use the tool that you know the best.
We can use SQL Server for almost any data management task. We can use SQL Server to watch over the storage for new unprocessed files.
We can load the raw data into the SQL Server Staging Area SQL Server database.
We can efficiently clean, enrich and aggregate the data using highly expensive relational database resources (even if you are not using SQL Database in the cloud, every Enterprise edition core still cost about $7K)
SQL Server Relational engine's high cost echoes the product complexity and extraordinary features that help to manage data consistency and integrity under high transactional load. Those expensive resources should be mainly dedicated to OLTP workloads as well as to condensed, well aggregated and focused Data Marts.
I want to suggest adding to your DBA toolbelt an emerging class of data management technologies called Data Lakes.
Data Lakes are designed to hold, process and store data and they are very cost-effective. Data Lakes usually use affordable hardware and data sets do not need to be indexed. Data can be stored either in raw format (Data scientists in many cases will prefer to analyze original data) as well as in cleaned, enriched and aggregated formats.
Aggregated data can be loaded into your Data Warehouse, which can be our beloved SQL Server, serving Data Analysts queries and reports. Aggregated data sets are smaller and you will not need to dramatically increase the size of the data warehouse every few months.
Delta Lake is an open-source project that enables building data warehouses on top of storage systems, it's a storage layer from Spark which runs on top of an Azure Data Lake, ensuring data integrity with ACID transactions. Delta lake technology leverages efficient Apache Parquet files compression, allows to specify and enforce schema to make sure data types are correct and all required columns are present. Delta Lake enables table schema changes, data versioning and a full audit trail of the changes.
In Azure cloud, we have the following services that integrate with Delta Lake
1. Azure Databricks – has integrated Delta Lake into their managed services, it allows you to use SQL, Python, R or Scala to query the Delta Lake
2. Azure Synapse Analytics – serverless data pools utilize Spark to read and write data from your Delta Lake in PySpark, .NET and Scala
3. Azure Data Factory – supports Delta Lake connector in Copy activity, in Data Flow and Databricks Notebook activities
4. Azure HDInsight: in Spark and Hive you can download and install Delta Lake.