Welcome To TechBrothersIT

How to Add Columns to DataFrame and Check Schema in PySpark

In this tutorial, we’ll cover how to add columns to a DataFrame and also how to check the schema of a DataFrame using PySpark.

1. Creating a DataFrame

data = [
    (1, "Alice", 25),
    (2, "Bob", 30),
    (3, "Charlie", 35),
    (4, "David", 40)
]

df = spark.createDataFrame(data, ["id", "name", "age"])
df.show()

2. Adding New Columns

We can add new columns using the withColumn() function.

from pyspark.sql.functions import lit

df_new = df.withColumn("country", lit("USA"))
df_new.show()

3. Adding Columns Using Expressions

from pyspark.sql.functions import col

df_exp = df.withColumn("age_double", col("age") * 2)
df_exp.show()

4. Adding Multiple Columns

df_multi = df \
    .withColumn("country", lit("USA")) \
    .withColumn("age_plus_ten", col("age") + 10)

df_multi.show()

5. Checking the Schema of DataFrame

df.printSchema()

This command prints the schema of the DataFrame, showing column names and data types.

Conclusion

Adding columns in PySpark is simple and flexible. The withColumn() method is the most common way to add or modify columns, and the printSchema() method provides a quick view of the DataFrame’s structure.

Watch the Tutorial Video

Watch on YouTube