Spark Tutorial: Lesson 2 Overview

Example 1: Loading Data from CSV and Exploring It

This example loads a CSV file into a DataFrame, displays its schema, and shows some basic operations like counting rows and previewing data.

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("Lesson2CSV").getOrCreate()

# Load CSV into DataFrame
df = spark.read.option("header", True).csv("data.csv")

# Show first few rows
df.show(10)

# Print schema
df.printSchema()

# Count rows
print(f"Total rows: {df.count()}")

# Select specific columns
df.select("Name", "Age").show()

Explanation:

Initializing SparkSession: spark = SparkSession.builder.appName("Lesson2CSV").getOrCreate() creates a Spark session for handling CSV data.
Loading CSV into DataFrame: df = spark.read.option("header", True).csv("data.csv") loads a CSV file while treating the first row as column headers.
Previewing Data: df.show(10) displays the first 10 rows of the loaded DataFrame to check the structure.
Schema and Column Selection: df.printSchema() prints data types of columns, and df.select("Name", "Age").show() selects specific columns for visualization.

Example 2: Data Cleaning and Handling Missing Values

This example demonstrates how to handle missing values using Spark’s DataFrame API.

from pyspark.sql.functions import col

# Fill missing values with default
df_filled = df.fillna({"Age": 0, "City": "Unknown"})

# Drop rows with any missing values
df_cleaned = df.na.drop()

# Replace specific values
df_replaced = df.replace(["N/A", "NA", "null"], None)

# Show results
df_filled.show()
df_cleaned.show()
df_replaced.show()

Explanation:

Filling Missing Values: df_filled = df.fillna({"Age": 0, "City": "Unknown"}) replaces null values in "Age" with 0 and in "City" with "Unknown".
Dropping Null Rows: df_cleaned = df.na.drop() removes all rows where at least one column contains a null value.
Replacing Specific Values: df_replaced = df.replace(["N/A", "NA", "null"], None) converts certain placeholder values ("N/A", "NA", "null") into proper None values.
Displaying Cleaned Data: df_filled.show(), df_cleaned.show(), and df_replaced.show() present the results after handling missing data.

Example 3: Using SQL Queries in Spark

This example shows how to register a DataFrame as a SQL table and run queries using Spark SQL.

df.createOrReplaceTempView("people")

# Run SQL Query
query_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")

# Show results
query_result.show()

# Grouping and Aggregation
agg_result = spark.sql("SELECT City, COUNT(*) as count FROM people GROUP BY City")

# Display Aggregated Data
agg_result.show()

Explanation:

Registering a Table: df.createOrReplaceTempView("people") allows the DataFrame to be queried using SQL-like syntax.
Filtering with SQL: query_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25") extracts only people older than 25.
Grouping and Aggregation: agg_result = spark.sql("SELECT City, COUNT(*) as count FROM people GROUP BY City") counts how many people belong to each city.
Displaying Results: query_result.show() and agg_result.show() display filtered and aggregated results, useful for data analysis.

Example 4: Joining Two DataFrames

This example demonstrates how to perform inner joins and left joins using Spark.

from pyspark.sql import Row

# Create two DataFrames
df1 = spark.createDataFrame([Row(ID=1, Name="Alice"), Row(ID=2, Name="Bob")])
df2 = spark.createDataFrame([Row(ID=1, City="NY"), Row(ID=3, City="LA")])

# Perform Inner Join
inner_join = df1.join(df2, "ID", "inner")

# Perform Left Join
left_join = df1.join(df2, "ID", "left")

# Show results
inner_join.show()
left_join.show()

Explanation:

Creating DataFrames: df1 contains IDs and names, while df2 has IDs and cities, simulating two related datasets.
Inner Join: inner_join = df1.join(df2, "ID", "inner") merges only matching ID values, keeping common records from both tables.
Left Join: left_join = df1.join(df2, "ID", "left") keeps all records from df1, adding City data where available.
Displaying Joined Data: inner_join.show() and left_join.show() present merged results, demonstrating different join operations.

This tutorial provides a deep dive into Spark’s DataFrame operations, covering loading, cleaning, querying, and joining data efficiently. 🚀

Lesson 2: Spark Tutorial

Table of contents

Example 1: Loading Data from CSV and Exploring It

Explanation:

Example 2: Data Cleaning and Handling Missing Values

Explanation:

Example 3: Using SQL Queries in Spark

Explanation:

Example 4: Joining Two DataFrames

Explanation: