How to Add Columns to DataFrame and Check Schema in PySpark
In this tutorial, we’ll cover how to add columns to a DataFrame and also how to check the schema of a DataFrame using PySpark.
1. Creating a DataFrame
data = [
(1, "Alice", 25),
(2, "Bob", 30),
(3, "Charlie", 35),
(4, "David", 40)
]
df = spark.createDataFrame(data, ["id", "name", "age"])
df.show()
2. Adding New Columns
We can add new columns using the withColumn()
function.
from pyspark.sql.functions import lit
df_new = df.withColumn("country", lit("USA"))
df_new.show()
3. Adding Columns Using Expressions
from pyspark.sql.functions import col
df_exp = df.withColumn("age_double", col("age") * 2)
df_exp.show()
4. Adding Multiple Columns
df_multi = df \
.withColumn("country", lit("USA")) \
.withColumn("age_plus_ten", col("age") + 10)
df_multi.show()
5. Checking the Schema of DataFrame
df.printSchema()
This command prints the schema of the DataFrame, showing column names and data types.
Conclusion
Adding columns in PySpark is simple and flexible. The withColumn()
method is the most common way to add or modify columns, and the printSchema()
method provides a quick view of the DataFrame’s structure.