DuckDB: The Data Engineer’s Secret Weapon for High-Performance Analytics

Explore DuckDB’s features, its rising popularity among data engineers, and how it integrates seamlessly with Microsoft Fabric for efficient data processing.

DuckDB: The Data Engineer’s Secret Weapon for High-Performance Analytics

DuckDB has been making waves in the data engineering community, and for good reason. This open-source, in-process analytical database is designed for high-performance on complex queries, all while being lightweight and easy to integrate. Let’s explore what makes DuckDB stand out, why it’s gaining popularity among data engineers, and how it fits into the Microsoft Fabric ecosystem.

What’s So Great About DuckDB?

DuckDB is a columnar, relational database management system (RDBMS) tailored for online analytical processing (OLAP) workloads. Unlike traditional databases that often require a separate server process, DuckDB operates entirely within the host application. This means you can embed it directly into applications without the overhead of managing a separate database server.

Key Features:

  • High Performance: DuckDB’s vectorised query execution engine allows it to process data at impressive speeds, making it suitable for complex analytical queries.
  • Ease of Use: With no external dependencies and a simple installation process, DuckDB is straightforward to set up and use. Its SQL syntax is familiar, reducing the learning curve for new users.
  • Portability: DuckDB runs on various platforms, including Linux, macOS, Windows, Android, and iOS, and supports multiple programming languages like Python, R, and Java.
  • Extensibility: The database supports extensions, allowing users to add new functionalities such as custom data types, functions, and file formats.

The data engineering community has embraced DuckDB for several reasons:

  • Integration with Data Frames: DuckDB seamlessly integrates with popular data analysis tools like Pandas and Polars, enabling efficient in-memory operations and SQL queries directly on DataFrame objects.
  • Handling of Large Datasets: Despite being lightweight, DuckDB can handle large datasets efficiently, making it a go-to choice for tasks that require processing substantial amounts of data without the need for a full-fledged database server.
  • Flexibility in Deployment: Its embeddable nature means data engineers can deploy DuckDB in various environments, from local machines to cloud-based platforms, without significant reconfiguration.

Using DuckDB with Microsoft Fabric

As you can probably tell from my other posts, I'm working quite heavily with Microsoft Fabric at the moment. This all-in-one analytics platform integrates various services, including data engineering, data science, and data warehousing.

When it first launched, the coding experience within Microsoft Fabric was based around Apache Spark notebooks. More recently Microsoft has added the ability to run T-SQL notebooks (for their Data Warehouse instances) and now Python notebooks. While I won’t discuss the pros and cons of PySpark/Spark vs Python, it’s worth noting that running pure Python scripts is advantageous.

⚠️
Note that at time of writing, the Python notebooks are in preview.

In addition, the Python experience in Fabric comes various pre-installed libraries such as Polars and DuckDB.

So what's the advantage of using DuckDB inside of Microsoft Fabric?:

  • Python Notebooks: Because Fabric’s Python notebooks come pre-installed with the DuckDB library, this allowing users to quickly perform data exploration and transformation tasks efficiently. The integration enables reading and writing Delta Lake data, facilitating seamless data operations within Fabric.
  • Performance Comparison: Studies have shown that DuckDB, along with libraries like Polars, can outperform traditional data processing tools in specific scenarios within Microsoft Fabric. This makes it a compelling choice for data engineers looking to optimise their workflows.
  • Community Tools: Open-source projects like msfabricutils provide utilities for integrating DuckDB with Microsoft Fabric Lakehouses, offering features such as automatic table registration and support for Delta Lake tables.
🗒️
Note that DuckDB can be faster when running against smaller datasets. If you are getting to the 100's of GB mark, then perhaps stick with Spark.

Example time

Here's an example of running a DuckDB query in a Python notebook in Fabric

import duckdb

display(duckdb.sql(""" select * from delta_scan('/lakehouse/default/Tables/airports/') limit 1000 """).df())

In the above example, I have a default lakehouse attached to my python notebook. There's a table in it called airports. The table contains roughly 83,000 records. Not a huge table.

  • In the Python notebook, this query takes <1 second. (915 ms).
  • If I run it from "cold" without the kernel already primed and ready, it takes about 11 seconds.

As a comparison, I ran a PySpark notebook for the same table.

df = spark.sql("SELECT * FROM lh_silver.airports LIMIT 1000")
display(df)
  • With the Spark session already started, this took about 4 seconds.
  • Without the Spark session already running, it took 30 seconds.

In summary, DuckDB’s combination of high performance, ease of use, and flexibility makes it a valuable tool for data engineers. Its integration capabilities with platforms like Microsoft Fabric further enhance its appeal, providing a robust solution for modern data processing challenges. But as with all options, assess based on your data sizes if this is the right solution for you, and choose accordingly.