Lesson 2: Spark Tutorial
Example 1: Loading Data from CSV and Exploring It
This example loads a CSV file into a DataFrame, displays its schema, and shows some basic operations like counting rows and previewing data.
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("Lesson2CSV").getOrCreate()
# Load CSV into DataFrame
df = spark.read.option("header", True).csv("data.csv")
# Show first few rows
df.show(10)
# Print schema
df.printSchema()
# Count rows
print(f"Total rows: {df.count()}")
# Select specific columns
df.select("Name", "Age").show()
Explanation:
Initializing SparkSession:
spark = SparkSession.builder.appName("Lesson2CSV").getOrCreate()
creates a Spark session for handling CSV data.Loading CSV into DataFrame:
df =
spark.read
.option("header", True).csv("data.csv")
loads a CSV file while treating the first row as column headers.Previewing Data:
df.show
(10)
displays the first 10 rows of the loaded DataFrame to check the structure.Schema and Column Selection:
df.printSchema()
prints data types of columns, anddf.select
("Name", "Age").show()
selects specific columns for visualization.
Example 2: Data Cleaning and Handling Missing Values
This example demonstrates how to handle missing values using Spark’s DataFrame API.
from pyspark.sql.functions import col
# Fill missing values with default
df_filled = df.fillna({"Age": 0, "City": "Unknown"})
# Drop rows with any missing values
df_cleaned = df.na.drop()
# Replace specific values
df_replaced = df.replace(["N/A", "NA", "null"], None)
# Show results
df_filled.show()
df_cleaned.show()
df_replaced.show()
Explanation:
Filling Missing Values:
df_filled = df.fillna({"Age": 0, "City": "Unknown"})
replaces null values in "Age" with0
and in "City" with"Unknown"
.Dropping Null Rows:
df_cleaned =
df.na
.drop()
removes all rows where at least one column contains a null value.Replacing Specific Values:
df_replaced = df.replace(["N/A", "NA", "null"], None)
converts certain placeholder values ("N/A"
,"NA"
,"null"
) into properNone
values.Displaying Cleaned Data:
df_
filled.show
()
,df_
cleaned.show
()
, anddf_
replaced.show
()
present the results after handling missing data.
Example 3: Using SQL Queries in Spark
This example shows how to register a DataFrame as a SQL table and run queries using Spark SQL.
df.createOrReplaceTempView("people")
# Run SQL Query
query_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")
# Show results
query_result.show()
# Grouping and Aggregation
agg_result = spark.sql("SELECT City, COUNT(*) as count FROM people GROUP BY City")
# Display Aggregated Data
agg_result.show()
Explanation:
Registering a Table:
df.createOrReplaceTempView("people")
allows the DataFrame to be queried using SQL-like syntax.Filtering with SQL:
query_result = spark.sql("SELECT Name, Age FROM people WHERE Age > 25")
extracts only people older than 25.Grouping and Aggregation:
agg_result = spark.sql("SELECT City, COUNT(*) as count FROM people GROUP BY City")
counts how many people belong to each city.Displaying Results:
query_
result.show
()
andagg_
result.show
()
display filtered and aggregated results, useful for data analysis.
Example 4: Joining Two DataFrames
This example demonstrates how to perform inner joins and left joins using Spark.
from pyspark.sql import Row
# Create two DataFrames
df1 = spark.createDataFrame([Row(ID=1, Name="Alice"), Row(ID=2, Name="Bob")])
df2 = spark.createDataFrame([Row(ID=1, City="NY"), Row(ID=3, City="LA")])
# Perform Inner Join
inner_join = df1.join(df2, "ID", "inner")
# Perform Left Join
left_join = df1.join(df2, "ID", "left")
# Show results
inner_join.show()
left_join.show()
Explanation:
Creating DataFrames:
df1
contains IDs and names, whiledf2
has IDs and cities, simulating two related datasets.Inner Join:
inner_join = df1.join(df2, "ID", "inner")
merges only matchingID
values, keeping common records from both tables.Left Join:
left_join = df1.join(df2, "ID", "left")
keeps all records fromdf1
, addingCity
data where available.Displaying Joined Data:
inner_
join.show
()
andleft_
join.show
()
present merged results, demonstrating different join operations.
This tutorial provides a deep dive into Spark’s DataFrame operations, covering loading, cleaning, querying, and joining data efficiently. 🚀