databricks tutorial with code examples

databricks tutorial with code examples


example 1: creating an etl pipeline in databricks

from pyspark.sql import sparkSession

spark = sparkSession.builder.appName("etl_pipeline").getOrCreate()
df = spark.read.csv("s3://data-source/customers.csv", header=True, inferSchema=True)
df.write.mode("overwrite").parquet("s3://data-warehouse/customers.parquet")

explanation

  1. initialize spark sessionspark = sparkSession.builder.appName("etl_pipeline").getOrCreate() creates a spark session in databricks, enabling distributed data processing for large-scale etl workloads.

  2. read data from s3df = spark.read.csv("s3://data-source/customers.csv", header=True, inferSchema=True) loads a csv dataset from amazon s3, automatically inferring schema and setting column headers.

  3. write data to parquetdf.write.mode("overwrite").parquet("s3://data-warehouse/customers.parquet") saves the transformed data as a parquet file in an s3 data warehouse, optimizing storage.

  4. ensure efficient querying – parquet files improve performance, and mode("overwrite") ensures new runs replace old data, preventing duplication issues in analytics queries.


example 2: training a machine learning model in databricks

from pyspark.ml.classification import logisticRegression
from pyspark.ml.feature import vectorAssembler

df = spark.read.parquet("s3://data-warehouse/customers.parquet")
features = vectorAssembler(inputCols=["age", "income"], outputCol="features").transform(df)
model = logisticRegression(labelCol="churn", featuresCol="features").fit(features)
model.save("s3://models/churn_model")

explanation

  1. load data from parquetdf = spark.read.parquet("s3://data-warehouse/customers.parquet") reads structured data from an optimized parquet storage, ensuring fast and efficient data retrieval.

  2. assemble featuresfeatures = vectorAssembler(inputCols=["age", "income"], outputCol="features").transform(df) converts numerical columns into a single feature vector required for machine learning models.

  3. train logistic regressionmodel = logisticRegression(labelCol="churn", featuresCol="features").fit(features) trains a classification model on customer churn data using logistic regression in databricks.

  4. save trained modelmodel.save("s3://models/churn_model") persists the trained machine learning model, allowing it to be reused for predictions without retraining.


example 3: running a real-time streaming job in databricks

from pyspark.sql.functions import from_json, col
from pyspark.sql.types import structType, stringType, integerType

schema = structType().add("user_id", integerType()).add("action", stringType())
stream = spark.readStream.format("kafka").option("subscribe", "user_events").load()
parsed = stream.select(from_json(col("value").cast("string"), schema).alias("data")).select("data.*")
parsed.writeStream.format("console").start()

explanation

  1. define streaming schemaschema = structType().add("user_id", integerType()).add("action", stringType()) specifies the structure of incoming kafka messages for real-time processing.

  2. connect to kafkastream = spark.readStream.format("kafka").option("subscribe", "user_events").load() establishes a streaming connection to a kafka topic to ingest live data in databricks.

  3. parse streaming dataparsed = stream.select(from_json(col("value").cast("string"), schema).alias("data")).select("data.*") extracts and structures incoming kafka messages based on the defined schema.

  4. write stream to consoleparsed.writeStream.format("console").start() outputs processed streaming data to the console, enabling real-time monitoring and debugging of incoming user events.


example 4: optimizing queries using databricks delta lake

from delta.tables import deltaTable

delta_table = deltaTable.forPath(spark, "s3://delta-lake/sales")
delta_table.optimize().executeCompaction()
delta_table.vacuum(168)

explanation

  1. initialize delta tabledelta_table = deltaTable.forPath(spark, "s3://delta-lake/sales") loads a delta lake table for transactional data management within databricks.

  2. run optimizationdelta_table.optimize().executeCompaction() compacts small files into larger ones, improving query performance by reducing i/o overhead in databricks.

  3. clean up old filesdelta_table.vacuum(168) removes outdated files older than 168 hours (7 days), preventing unnecessary storage costs and maintaining data efficiency.

  4. ensure query speed – delta lake optimization ensures faster analytics by structuring data efficiently, enabling low-latency queries on large datasets within databricks.


these four examples demonstrate essential databricks functionalities, covering etl pipelines, machine learning, streaming, and delta lake optimization.