Microsoft recently announced the general availability of Microsoft Fabric, which contains all (or most) cloud Data analytics services from Microsoft. This is a good opportunity to compare it with another popular data platform, which is also available in Azure (and other cloud services) – Databricks.
Before we start, I should note that Fabric is quite new, and it’s still hard to evaluate its performance and stability. Also, both products have many features, and I only try to discuss the main differences.
When creating an Azure Databricks service, a storage account is created for you, but it contains (or should contain) only system-relevant data. To store your data, Databricks recommends creating another storage account and mounting it to a Databricks workspace. Mounting is the process of defining its location and the credentials Databricks will use to access it. In Fabric, however, all connections to storage are handled for you.
Another difference is that Fabric has the concept of Onelake, a central place to hold all your data (based on Azure data lake storage account), without the need to copy or duplicate data. Theoretically, you can do something similar in Databricks, but You’ll have to hold all your data in a single Storage account, or mount multiple storage accounts into your workspace.
In a Lakehouse environment, you have to start a compute resource to access your data. Databricks lets you create and configure a cluster, but starting it usually takes about 4-5 minutes. There is also an option to use pools (compute resources waiting to be used), but there is a cost even when you are not using them. SQL warehouse clusters turn on fast, but they cost more than regular clusters and are only for SQL. On the other side, in Databricks you have a lot of control over your cluster size and settings.
In Fabric compute is handled in the background for you and is always available or starts in a few seconds. However, you are limited by the settings you can control.
Well, that part is obvious. Fabric, like a lot of Microsoft products, is very user-friendly and will allow for almost no code operations. A beginner developer will find it easy to start working with the platform almost immediately.
Right click a file in a Lakehouse to get python code to load it into a dataframe
The new Data Wrangler can help you prep data with a user interface to choose the transformation you need, and Data Wrangler will write the needed Python code for you. See Details here - https://learn.microsoft.com/en-us/fabric/data-science/data-wrangler
Visual SQL query – an add-on to help you write SQL code. See details here - https://learn.microsoft.com/en-us/fabric/data-warehouse/visual-query-editor
Data Factory pipelines – develop ETL process with a graphic UI and minimum code writing needed
Databricks has a lot of tools to help developers like IntelliSense and suggestions, but it’s a code-based solution, and to use it you need to know (or learn) a programming language (it supports Python, SQL, Scala, and R). This is also an advantage – with code, you have more freedom to do things your way.
PowerBI items and other Fabric items are managed in the same place (workspace)
PowerBI can read data directly from the Lakehouse without uploading (processing) it. This is called Direct-lake access, and according to Microsoft, is supposed to be really fast.
Each Lakehouse has a default PowerBI dataset, that you can create reports on.
Databricks – PowerBI can use Databricks as a data source, like any other data source.
Fabric wins in ease of use and low-code solution and will let you start developing more quickly. Databricks require some code knowledge but will give you more freedom as a programmer. If you are working with PowerBI, Fabric integration is a great advantage. Databricks is a stable and mature product, whereas Fabric still must prove itself, and might take some time and bug fixes to get better.
Interested in learning more about Fabric? Join our series of sessions (Hebrew).