How to Use columns
Function in PySpark | Step-by-Step Guide
Author: Aamir Shahzad
Published: March 2025
๐ Introduction
The columns
attribute in PySpark is a quick and effective way to retrieve the list of column names from a DataFrame. This is useful when you're dynamically working with column names in big data pipelines.
๐ What is columns() in PySpark?
The columns
attribute returns a Python list of column names from the DataFrame. You can use this list to inspect, iterate, or programmatically manipulate columns in your PySpark applications.
๐งพ Sample Dataset
Name Department Salary
Aamir Shahzad IT 5000
Ali Raza HR 4000
Bob Finance 4500
Lisa HR 4000
๐ง Create DataFrame in PySpark
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("columnsFunctionExample").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", "IT", 5000),
("Ali Raza", "HR", 4000),
("Bob", "Finance", 4500),
("Lisa", "HR", 4000)
]
# Create DataFrame
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, columns)
# Show DataFrame
df.show()
๐ Using columns Attribute
# Get list of column names
print("List of columns in the DataFrame:")
print(df.columns)
# Loop through column names
print("\nLooping through columns:")
for col_name in df.columns:
print(col_name)
✅ Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad| IT| 5000|
| Ali Raza| HR| 4000|
| Bob| Finance| 4500|
| Lisa| HR| 4000|
+-------------+-----------+------+
List of columns in the DataFrame:
['Name', 'Department', 'Salary']
Looping through columns:
Name
Department
Salary
๐ Explanation
df.columns
returns a list of all column names in the DataFrame.- Useful for dynamic column operations like renaming, filtering, or applying transformations.
- You can loop over
df.columns
for custom logic on each column.