Introduction to Spark: Lesson 1

Here's a structured tutorial for Lesson 1: Spark, including 4 code examples with explanations.

Example 1: Creating a SparkSession

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("Lesson1Tutorial").getOrCreate()

# Print Spark version
print(spark.version)

Explanation:

Importing SparkSession: from pyspark.sql import SparkSession imports the SparkSession class, required for working with Spark's DataFrame API in PySpark.
Initializing SparkSession: spark = SparkSession.builder.appName("Lesson1Tutorial").getOrCreate() creates a Spark session, which is the entry point for Spark applications.
Naming the Spark App: appName("Lesson1Tutorial") sets the name of the Spark application, making it easier to monitor in Spark UI.
Printing Spark Version: print(spark.version) retrieves and prints the current version of Spark installed in the environment.

Example 2: Creating a DataFrame

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

df.show()

Explanation:

Defining Data: data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)] represents a list of tuples, where each tuple contains a name and an age.
Column Names: columns = ["Name", "Age"] defines column names for the DataFrame, making it structured like a table.
Creating DataFrame: df = spark.createDataFrame(data, columns) converts the list of tuples into a Spark DataFrame, which is optimized for distributed processing.
Displaying DataFrame: df.show() prints the contents of the DataFrame in tabular form, showing the "Name" and "Age" columns.

Example 3: Filtering Data in a DataFrame

filtered_df = df.filter(df.Age > 28)

filtered_df.show()

Explanation:

Applying Filter: filtered_df = df.filter(df.Age > 28) selects only rows where the "Age" column value is greater than 28.
Condition Evaluation: df.Age > 28 acts as a boolean filter, returning only rows meeting the condition.
Creating a New DataFrame: filtered_df = df.filter(...) results in a new DataFrame containing only the filtered results, without modifying the original DataFrame.
Displaying Filtered Data: filtered_df.show() prints the new DataFrame, showing only people older than 28.

Example 4: Aggregating Data with GroupBy

from pyspark.sql.functions import avg

df_grouped = df.groupBy("Age").count()

df_grouped.show()

Explanation:

Importing Functions: from pyspark.sql.functions import avg imports aggregation functions like avg() to be used for calculations.
Grouping Data: df.groupBy("Age") groups rows with the same "Age" together, preparing for aggregation.
Applying Count Aggregation: df_grouped = df.groupBy("Age").count() calculates the number of times each age appears in the DataFrame.
Displaying Grouped Data: df_grouped.show() prints the result, showing the count of each unique age value in the dataset.

This tutorial gives a solid foundation for working with Spark and PySpark, covering session creation, DataFrame operations, filtering, and aggregation.