Optimizing Query Performance in Databricks (Part 1): Exploring Data Types, limiting Result Sets and the Explain command

Query performance is an issue on every data platform. You always want your customers to get the results as quickly as possible, and poor query run times will frustrate them (and might get the data guy fired...).

If you come, like me, from a relational database background (SQL server, in my case), the answer to poor query performance issues involves looking at the query execution plan and working with indexes and statistics.

What do we do in a data lake situation where there are no indexes?

This is the first of a series of blog posts that will discuss just that. How can we improve query performance on Databricks delta lake tables?

But first of all, we need a have a dataset to check on. luckily, Databricks has some built-in sample samples.

To view the samples list, use the following command:

display(dbutils.fs.ls('/databricks-datasets'))

I'll use the people sample, and write it to my own delta table (I created the dwh database beforehand):

people_df = spark.read.parquet("/databricks-datasets/learning-spark-v2/people/people-10m.parquet")
people_df.write.saveAsTable("dwh.people")

The code above created a folder called people under the dwh folder, that holds the table files.

As you can see it amounts to about 200 MB.

Let's use the explain command to see the execution plan and costs.

explain cost
select *
from dwh.people

These are the results:

== Optimized Logical Plan == Relation spark_catalog.dwh.people[id#1272,firstName#1273,middleName#1274,lastName#1275,gender#1276,birthDate#1277,ssn#1278,salary#1279] parquet, Statistics(sizeInBytes=207.6 MiB, ColumnStat: N/A)

To return all the data in the table, Databricks will need to scan all the files - which is 207 MB.

What if we only wanted some of the columns?

explain cost
select id,birthDate
from dwh.people

== Optimized Logical Plan == Project [id#1456, birthDate#1461], Statistics(sizeInBytes=33.5 MiB, ColumnStat: N/A) +- Relation spark_catalog.dwh.people[id#1456,firstName#1457,middleName#1458,lastName#1459,gender#1460,birthDate#1461,ssn#1462,salary#1463] parquet, Statistics(sizeInBytes=207.6 MiB, ColumnStat: N/A)

Databricks only needed to read these 2 columns, and they are only 33.5 MB. Less data to read = faster response.

Delta tables are based on the Parquet file format. Parquet files are columnar, meaning they are read column after column, and that means that if we want, we can read only some of the columns and save time and compute.

If we try to read only part of the columns in a row-based file like CSV or an SQL row-based table, the compute we use would have to read all the file\table even though it only returns a few of the columns.

So our first lesson is - read only the columns you need and avoid select * .

Now let's check if all data types are born equal.

Let's create a table and add 2 columns, one with data type double and one with data type string. Then will insert some data.

create table dwh.test_data_types (
  number_as_double double,
  number_as_string string
);

insert into dwh.test_data_types(number_as_double,number_as_string) values(18.987,'18.987');
insert into dwh.test_data_types(number_as_double,number_as_string) values(159.485,'159.485');
insert into dwh.test_data_types(number_as_double,number_as_string) values(456.987,'456.987');
insert into dwh.test_data_types(number_as_double,number_as_string) values(159.357,'159.357,');
insert into dwh.test_data_types(number_as_double,number_as_string) values(741.321,'741.321');
insert into dwh.test_data_types(number_as_double,number_as_string) values(852.369,'852.369');
insert into dwh.test_data_types(number_as_double,number_as_string) values(961.743,'961.743');

As you can see, this is the same data, once as double and once as string.

Let's compare the size and cost:

explain cost
select number_as_double
from dwh.test_data_types

== Optimized Logical Plan == Project [number_as_double#3376], Statistics(sizeInBytes=2.4 KiB, ColumnStat: N/A) +- Relation spark_catalog.dwh.test_data_types[number_as_double#3376,number_as_string#3377] parquet, Statistics(sizeInBytes=5.4 KiB, ColumnStat: N/A)

explain cost
select number_as_string
from dwh.test_data_types

== Optimized Logical Plan == Project [number_as_string#3417], Statistics(sizeInBytes=4.2 KiB, ColumnStat: N/A) +- Relation spark_catalog.dwh.test_data_types[number_as_double#3416,number_as_string#3417] parquet, Statistics(sizeInBytes=5.4 KiB, ColumnStat: N/A)

The same data as a string data type is 4.2 KB which is 75% more than the size of a double data type - 2.4 KB.

So, our second lesson of the day, when you build your tables - Use the right data type for each column.

Optimizing Query Performance in Databricks (Part 1): Exploring Data Types, limiting Result Sets and the Explain command

Recent Posts

JOIN OUR MAILING LIST

CONTACT US

keep in Touch!