Data Science and AI

Delta Lake – Drivers, Features and a Perspective

Big data architecture has evolved a lot in the last decade so did the big data technology and products. Lambda architecture is such a very popular big data architecture which has a big data processing capability with high throughput, low latency and aims at near real time applications. Lambda architecture consists of Batch layer, Speed layer and Serving layer.

 

Refer the image above, we may have an architecture where the speed layer for near real time application is implemented by Azure Stream analytics and the batch layer with data lake gen 2 with databricks for AI ML modelling. For the serving layer we may have the Polybase to run the SQL queries.

 

The advantages of such an architecture was multiple like:

  • Mix of speed and reliability because of two distinct layers
  • Low latency and high throughput
  • Scalable and fault tolerant

 

Enterprises however did start to face challenges with such an architecture mainly due to the below reasons:

  • It needed two channels to be maintained at the same time which was a good hit on OPEX or RTB costs
  • CAB was extensive for such an architecture as it needed to back to change scheme and update downstream
  • Re process of batch jobs
  • Migrate to another technology was a challenge
  • Compliance which came in due time needed excessive reworks
  • Version management and meta data management
  • Data Governance and data quality challenges

 

 

In 2019 early, Databricks announced the Delta Lake architecture as part of the Spark summit. Delta Lake was then adopted with Linux Foundation for hosting. According to the Linux Foundation website, Delta Lake has been adopted by over 4000 organisations and process over two exabytes of data each month (Just for a quick note: 1 exabtye = 1 billion GB).

 

 

Delta Lake sits as a storage layer on top of core data lake which works with Apache Spark API’s.

 

According to official documentation Delta Lake provides:

  • ACID (atomicity, consistency, isolation, durability) transactions on Spark
  • Scalable metadata handling
  • Streaming and batch unification
  • Schema enforcement
  • Time travel.
  • Upserts and deletes

 

 

As shown in the diagram, there is now a steam and batch unification unlike Lambda architecture which needed two separate channels to be maintained. We then have three quality levels as shown in the table above. The purest form of data is in the GOLD layer used for business level aggregates.

 

 

 

 

Delta Lake can be used with Databricks and to conduct a quick POC the DBFS can be used as a directory to hold the bronze, silver and gold layer as shown in the figure above. To maintain data lineage, ACID transactions, scalable meta data handling and versioning (aka time travel in Delta Lake), the file level allocation is below.