Lesson 2: Spark Tutorial

Example 1: Loading Data from CSV and Exploring It

This example loads a CSV file into a DataFrame, displays its schema, and shows some basic operations like counting rows and previewing data.

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("Lesson2CSV").getOrCreate()

# Load CSV into DataFrame
df = spark.read.option("header", True).csv("data.csv")

# Show first few rows

# Print schema

# Count rows
print(f"Total rows: {df.count()}")

# Select specific columns
df.select("Name", "Age").show()


  1. Initializing SparkSession: spark = SparkSession.builder.appName("Lesson2CSV").getOrCreate() creates a Spark session for handling CSV data.

  2. Loading CSV into DataFrame: df = spark.read.option("header", True).csv("data.csv") loads a CSV file while treating the first row as column headers.

  3. Previewing Data: df.show(10) displays the first 10 rows of the loaded DataFrame to check the structure.

  4. Schema and Column Selection: df.printSchema() prints data types of columns, and df.select("Name", "Age").show() selects specific columns for visualization.

Example 2: Data Cleaning and Handling Missing Values

This example demonstrates how to handle missing values using Spark’s DataFrame API.

from pyspark.sql.functions import col

# Fill missing values with default
df_filled = df.fillna({"Age": 0, "City": "Unknown"})

# Drop rows with any missing values
df_cleaned = df.na.drop()

# Replace specific values
df_replaced = df.replace(["N/A", "NA", "null"], None)

# Show results


  1. Filling Missing Values: df_filled = df.fillna({"Age": 0, "City": "Unknown"}) replaces null values in "Age" with 0 and in "City" with "Unknown".

  2. Dropping Null Rows: df_cleaned = df.na.drop() removes all rows where at least one column contains a null value.

  3. Replacing Specific Values: df_replaced = df.replace(["N/A", "NA", "null"], None) converts certain placeholder values ("N/A", "NA", "null") into proper None values.

  4. Displaying Cleaned Data: df_filled.show(), df_cleaned.show(), and df_replaced.show() present the results after handling missing data.

Example 3: Using SQL Queries in Spark

This example shows how to register a DataFrame as a SQL table and run queries using Spark SQL.


# Run SQL Query
query_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")

# Show results

# Grouping and Aggregation
agg_result = spark.sql("SELECT City, COUNT(*) as count FROM people GROUP BY City")

# Display Aggregated Data


  1. Registering a Table: df.createOrReplaceTempView("people") allows the DataFrame to be queried using SQL-like syntax.

  2. Filtering with SQL: query_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25") extracts only people older than 25.

  3. Grouping and Aggregation: agg_result = spark.sql("SELECT City, COUNT(*) as count FROM people GROUP BY City") counts how many people belong to each city.

  4. Displaying Results: query_result.show() and agg_result.show() display filtered and aggregated results, useful for data analysis.

Example 4: Joining Two DataFrames

This example demonstrates how to perform inner joins and left joins using Spark.

from pyspark.sql import Row

# Create two DataFrames
df1 = spark.createDataFrame([Row(ID=1, Name="Alice"), Row(ID=2, Name="Bob")])
df2 = spark.createDataFrame([Row(ID=1, City="NY"), Row(ID=3, City="LA")])

# Perform Inner Join
inner_join = df1.join(df2, "ID", "inner")

# Perform Left Join
left_join = df1.join(df2, "ID", "left")

# Show results


  1. Creating DataFrames: df1 contains IDs and names, while df2 has IDs and cities, simulating two related datasets.

  2. Inner Join: inner_join = df1.join(df2, "ID", "inner") merges only matching ID values, keeping common records from both tables.

  3. Left Join: left_join = df1.join(df2, "ID", "left") keeps all records from df1, adding City data where available.

  4. Displaying Joined Data: inner_join.show() and left_join.show() present merged results, demonstrating different join operations.

This tutorial provides a deep dive into Spark’s DataFrame operations, covering loading, cleaning, querying, and joining data efficiently. 🚀