Introduction to Delta Lake on Azure Databricks

June 25, 2023 7 mins to read

Introduction

Introduced by Databricks[1], Delta Lake is an open-source storage layer that brings reliability to data lakes. It provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. Delta Lake runs on top of your existing data lake and is fully compatible with Apache Spark APIs. Delta Lake has rapidly gained popularity among data engineers and data scientists as it provides a reliable and scalable way to manage data.


Image property of Databricks.com

Table of contents

Delta Table Format

Delta Lake introduces a new data storage format called the Delta table format. Delta tables are similar to Parquet files, but with additional functionality. Delta tables provide ACID transactions, schema evolution, schema enforcement, and data versioning. As they are designed to work with Spark SQL, you can use easily run SQL commands to interact with your data.

ACID transactions

Delta tables provide ACID transactions[2], which stands for Atomicity, Consistency, Isolation, and Durability.

ACID transactions ensure that data is processed reliably and consistently. With ACID transactions, you can update, delete, and insert data into Delta tables, and Delta Lake will handle the transactional guarantees for you.

Schema evolution

With schema evolution[3], you are allowed to modify the structure and format of your data over time, while still maintaining the integrity of your existing data.

For example, let’s say you have a Delta table for suppliers information, and you decide to add a new field for the supplier’s address. With schema evolution, you can modify the structure of the table to include the new field, and Delta Lake will automatically update all existing data with a null value for the new field. This means that you don’t lose any of your existing data, and you can continue to add new data with the updated structure.

Another example of schema evolution is if you decide to change the data type of a field, such as changing a string field to a numeric field. With schema evolution, you can modify the structure of the table to reflect the new data type, and Delta Lake will automatically convert the data to the new type. This means that you don’t have to manually modify all of your existing data, and you can continue to add new data with the updated structure and data type.

Schema enforcement

Delta tables provide schema enforcement which means that you can define a specific structure or format for your data, and Delta Lake will make sure that any data that does not match that structure is rejected.

In case you have a table for customer information, and you want to make sure that every row in the table has a name, an email address, and a phone number. With schema enforcement, you can define that structure, and Delta Lake will reject any data that does not have those three fields. This helps ensure that your data is consistent and of high quality, as you can be confident that every row in the table has the same structure.

An example on how schema enforcement work can be found at Microsoft’s documentation: Delta lake schema validation.

Data versioning

Delta tables provide data versioning, which means that you can keep track of changes to your data over time. Every time you make changes to your data, Delta Lake creates a new version of the data. This allows you to keep track of changes over time and easily revert to a previous version if necessary.

Given that you have to deal with a table for sales data, and you want to make some changes to the data. With data versioning, Delta Lake will automatically create a new version of the data every time you make changes. This means that you can always go back to a previous version of the data if you need to, such as if you made a mistake or if you need to audit changes to the data.

Time travel

Time travel in Delta tables means that you can query data as it existed at a specific point in time in the past. This allows you to easily access historical data and see how it has changed over time.

Given that you have a table for customer orders, and you want to see how many orders were placed on a specific day in the past. With time travel, you can query the table as it existed on that day, and see the data exactly as it was at that time. This is useful for auditing purposes, as you can see exactly what data was present at a specific point in time.

Time travel can also be useful for debugging, as you can see how data has changed over time and identify any issues that may have arisen. Additionally, it can be helpful for compliance purposes, as you can easily retrieve historical data for regulatory purposes.

Overall, time travel in Delta tables provides a way to query historical data and see how it has changed over time, making it a useful feature for auditing, debugging, and compliance purposes.

Streaming and batch processing

Delta Lake allows you to work with data in real-time as it’s generated (streaming) as well as process large sets of data in a batch (batch processing) using the same code. This means that you don’t need to write separate code to process data in different scenarios, making it easier to work with data.

As an example, let’s say you have a website that generates customer orders in real-time, and you also have a batch of historical customer orders that you want to process. With Delta Lake, you can use the same code to process both the real-time customer orders and the historical customer orders.

Delta Lake also provides support for structured streaming, which is a way to process streaming data using Spark SQL. This means that you can write SQL-like queries to process the real-time customer orders as they are generated. For example, you could use structured streaming to filter and aggregate the customer orders in real-time, and store the results in a Delta table for further analysis.

Conclusion

Compared to traditional data lakes, Delta Lake offers significant advantages in terms of reliability and scalability. Delta Lake’s ACID transactions, schema evolution, schema enforcement, data versioning, and time travel features make it easier to manage data and ensure data consistency, quality, and integrity. These features are not typically available in traditional data lakes, which can lead to data quality issues and inconsistencies. Additionally, Delta Lake’s support for both streaming and batch processing using the same code, as well as its support for structured streaming using Spark SQL, make it a more versatile and efficient tool for data processing. Overall, Delta Lake’s features and capabilities make it a better choice for organizations looking to manage and process large amounts of data with reliability and scalability.

References: