Lesson 1: Spark
Here's a structured tutorial for Lesson 1: Spark, including 4 code examples with explanations.
Example 1: Creating a SparkSession
from pyspark.sql import SparkSession
# Initialize SparkSession
spark = SparkSession.builder.appName("Lesson1Tutorial").getOrCreate()
# Print Spark version
print(spark.version)
Explanation:
Importing SparkSession:
from pyspark.sql import SparkSession
imports theSparkSession
class, required for working with Spark's DataFrame API in PySpark.Initializing SparkSession:
spark = SparkSession.builder.appName("Lesson1Tutorial").getOrCreate()
creates a Spark session, which is the entry point for Spark applications.Naming the Spark App:
appName("Lesson1Tutorial")
sets the name of the Spark application, making it easier to monitor in Spark UI.Printing Spark Version:
print(spark.version)
retrieves and prints the current version of Spark installed in the environment.
Example 2: Creating a DataFrame
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]
df = spark.createDataFrame(data, columns)
df.show()
Explanation:
Defining Data:
data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
represents a list of tuples, where each tuple contains a name and an age.Column Names:
columns = ["Name", "Age"]
defines column names for the DataFrame, making it structured like a table.Creating DataFrame:
df = spark.createDataFrame(data, columns)
converts the list of tuples into a Spark DataFrame, which is optimized for distributed processing.Displaying DataFrame:
df.show
()
prints the contents of the DataFrame in tabular form, showing the "Name" and "Age" columns.
Example 3: Filtering Data in a DataFrame
filtered_df = df.filter(df.Age > 28)
filtered_df.show()
Explanation:
Applying Filter:
filtered_df = df.filter(df.Age > 28)
selects only rows where the "Age" column value is greater than 28.Condition Evaluation:
df.Age > 28
acts as a boolean filter, returning only rows meeting the condition.Creating a New DataFrame:
filtered_df = df.filter(...)
results in a new DataFrame containing only the filtered results, without modifying the original DataFrame.Displaying Filtered Data:
filtered_
df.show
()
prints the new DataFrame, showing only people older than 28.
Example 4: Aggregating Data with GroupBy
from pyspark.sql.functions import avg
df_grouped = df.groupBy("Age").count()
df_grouped.show()
Explanation:
Importing Functions:
from pyspark.sql.functions import avg
imports aggregation functions likeavg()
to be used for calculations.Grouping Data:
df.groupBy("Age")
groups rows with the same "Age" together, preparing for aggregation.Applying Count Aggregation:
df_grouped = df.groupBy("Age").count()
calculates the number of times each age appears in the DataFrame.Displaying Grouped Data:
df_
grouped.show
()
prints the result, showing the count of each unique age value in the dataset.
This tutorial gives a solid foundation for working with Spark and PySpark, covering session creation, DataFrame operations, filtering, and aggregation.