When we hear about Delta Lake, the first question that comes to our mind is
“What is Delta Lake and How it works ?”.
“Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads”
But the question is how it is possible to maintain transactions in the Big Data world ?. The answer is very simple. It is using Delta Format.
Delta Lake stores data in Delta Format. Delta format is a versioned parquet format along with a scalable metadata. It stores the data as parquet internally and it tracks the changes happening to the data in the metadata file. So the metadata will also grow along with the data.
Delta format solved several major challenges in the Big Data Lake world. Some of them are listed below
- Transaction management
- Versioning
- Incremental Load
- Indexing
- UPSERT and DELETE operations
- Schema Enforcement and Schema Evolution
I will elaborate this post by explaining each of the above features and explain more about the internals of Delta Lake.