deltalake tutorial with code examples

deltalake tutorial with code examples


example 1: creating and writing to a delta table

from delta.tables import deltaTable
from pyspark.sql import sparkSession

spark = sparkSession.builder.appName("delta_example").getOrCreate()
df = spark.read.csv("s3://data-source/sales.csv", header=True, inferSchema=True)
df.write.format("delta").mode("overwrite").save("s3://delta-lake/sales")

explanation

  1. initialize spark sessionspark = sparkSession.builder.appName("delta_example").getOrCreate() starts a spark session to enable processing of delta lake tables.

  2. read data from csvdf = spark.read.csv("s3://data-source/sales.csv", header=True, inferSchema=True) loads raw csv data from amazon s3 into a structured dataframe.

  3. write data to deltadf.write.format("delta").mode("overwrite").save("s3://delta-lake/sales") stores data in delta format, ensuring transaction support and schema enforcement.

  4. ensure data consistency – delta lake enforces aciD transactions, preventing partial writes and ensuring reliable storage for analytics and machine learning workloads.


example 2: reading and querying a delta table

df = spark.read.format("delta").load("s3://delta-lake/sales")
df.createOrReplaceTempView("sales_data")
result = spark.sql("SELECT product, SUM(amount) FROM sales_data GROUP BY product")
result.show()

explanation

  1. read delta tabledf = spark.read.format("delta").load("s3://delta-lake/sales") loads structured delta lake data, ensuring consistency and optimized performance for queries.

  2. create sql viewdf.createOrReplaceTempView("sales_data") registers the dataset as a temporary sql table, enabling querying with spark sql.

  3. execute sql queryresult = spark.sql("SELECT product, SUM(amount) FROM sales_data GROUP BY product") retrieves aggregated sales data by grouping products and summing sales amounts.

  4. display query outputresult.show() prints the computed sales totals per product, making real-time analytics possible within a databricks or spark environment.


example 3: implementing time travel in delta lake

df_v1 = spark.read.format("delta").option("versionAsOf", 1).load("s3://delta-lake/sales")
df_v1.show()

df_date = spark.read.format("delta").option("timestampAsOf", "2024-02-24").load("s3://delta-lake/sales")
df_date.show()

explanation

  1. retrieve past versiondf_v1 = spark.read.format("delta").option("versionAsOf", 1).load("s3://delta-lake/sales") loads an earlier version of the delta table for historical analysis.

  2. display old recordsdf_v1.show() prints the dataset as it existed at version 1, enabling debugging and rollback capabilities in delta lake.

  3. query by timestampdf_date = spark.read.format("delta").option("timestampAsOf", "2024-02-24").load("s3://delta-lake/sales") retrieves data as it was on a specific date.

  4. verify historical statedf_date.show() allows comparing past dataset states, ensuring consistency and enabling audits in regulatory or analytical workflows.


example 4: optimizing and vacuuming a delta table

from delta.tables import deltaTable

delta_table = deltaTable.forPath(spark, "s3://delta-lake/sales")
delta_table.optimize().executeCompaction()
delta_table.vacuum(168)

explanation

  1. initialize delta tabledelta_table = deltaTable.forPath(spark, "s3://delta-lake/sales") loads a delta lake table for maintenance operations like optimization and cleanup.

  2. run compactiondelta_table.optimize().executeCompaction() merges smaller files into larger ones, reducing i/o operations and improving query performance.

  3. remove old filesdelta_table.vacuum(168) deletes obsolete data older than 168 hours (7 days), reclaiming storage space while maintaining access to recent records.

  4. boost query speed – optimization and vacuuming reduce data fragmentation, making analytics queries faster and more cost-efficient in large-scale data environments.


these four examples demonstrate essential deltalake functionalities, covering data storage, querying, time travel, and performance optimization.