Lesson 1: Spark

Here's a structured tutorial for Lesson 1: Spark, including 4 code examples with explanations.


Example 1: Creating a SparkSession

from pyspark.sql import SparkSession

# Initialize SparkSession
spark = SparkSession.builder.appName("Lesson1Tutorial").getOrCreate()

# Print Spark version
print(spark.version)

Explanation:

  1. Importing SparkSession: from pyspark.sql import SparkSession imports the SparkSession class, required for working with Spark's DataFrame API in PySpark.

  2. Initializing SparkSession: spark = SparkSession.builder.appName("Lesson1Tutorial").getOrCreate() creates a Spark session, which is the entry point for Spark applications.

  3. Naming the Spark App: appName("Lesson1Tutorial") sets the name of the Spark application, making it easier to monitor in Spark UI.

  4. Printing Spark Version: print(spark.version) retrieves and prints the current version of Spark installed in the environment.


Example 2: Creating a DataFrame

data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)]
columns = ["Name", "Age"]

df = spark.createDataFrame(data, columns)

df.show()

Explanation:

  1. Defining Data: data = [("Alice", 25), ("Bob", 30), ("Charlie", 35)] represents a list of tuples, where each tuple contains a name and an age.

  2. Column Names: columns = ["Name", "Age"] defines column names for the DataFrame, making it structured like a table.

  3. Creating DataFrame: df = spark.createDataFrame(data, columns) converts the list of tuples into a Spark DataFrame, which is optimized for distributed processing.

  4. Displaying DataFrame: df.show() prints the contents of the DataFrame in tabular form, showing the "Name" and "Age" columns.


Example 3: Filtering Data in a DataFrame

filtered_df = df.filter(df.Age > 28)

filtered_df.show()

Explanation:

  1. Applying Filter: filtered_df = df.filter(df.Age > 28) selects only rows where the "Age" column value is greater than 28.

  2. Condition Evaluation: df.Age > 28 acts as a boolean filter, returning only rows meeting the condition.

  3. Creating a New DataFrame: filtered_df = df.filter(...) results in a new DataFrame containing only the filtered results, without modifying the original DataFrame.

  4. Displaying Filtered Data: filtered_df.show() prints the new DataFrame, showing only people older than 28.


Example 4: Aggregating Data with GroupBy

from pyspark.sql.functions import avg

df_grouped = df.groupBy("Age").count()

df_grouped.show()

Explanation:

  1. Importing Functions: from pyspark.sql.functions import avg imports aggregation functions like avg() to be used for calculations.

  2. Grouping Data: df.groupBy("Age") groups rows with the same "Age" together, preparing for aggregation.

  3. Applying Count Aggregation: df_grouped = df.groupBy("Age").count() calculates the number of times each age appears in the DataFrame.

  4. Displaying Grouped Data: df_grouped.show() prints the result, showing the count of each unique age value in the dataset.


This tutorial gives a solid foundation for working with Spark and PySpark, covering session creation, DataFrame operations, filtering, and aggregation.