How to Use colRegex()
Function in PySpark | Step-by-Step Guide
Author: Aamir Shahzad
Published: March 2025
๐ Introduction
The colRegex()
function in PySpark allows you to select multiple columns using regular expressions. It is especially useful when dealing with DataFrames that have dynamic or similar column naming patterns.
๐ What is colRegex() in PySpark?
The colRegex()
method is part of the DataFrame API. It enables you to use regular expressions to match and select column names based on patterns — such as prefix, suffix, or substring — instead of specifying each column manually.
๐งพ Sample Dataset
Name Department Salary Country
Aamir Shahzad IT 5000 US
Ali Raza HR 4000 CA
Bob Finance 4500 UK
Lisa HR 4000 CA
๐ง Create DataFrame in PySpark
from pyspark.sql import SparkSession
# Create Spark session
spark = SparkSession.builder.appName("colRegexExample").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", "IT", 5000, "US"),
("Ali Raza", "HR", 4000, "CA"),
("Bob", "Finance", 4500, "UK"),
("Lisa", "HR", 4000, "CA")
]
# Create DataFrame
columns = ["Name", "Department", "Salary", "Country"]
df = spark.createDataFrame(data, columns)
# Show DataFrame
df.show()
๐ Using colRegex() for Column Selection
# Example 1: Select columns starting with 'Dep'
df.select(df.colRegex("`^Dep.*`")).show()
# Example 2: Select columns ending with 'Name'
df.select(df.colRegex("`.*Name$`")).show()
# Example 3: Select columns containing 'try'
df.select(df.colRegex("`.*try.*`")).show()
✅ Expected Output - Example 1
+-----------+
|Department |
+-----------+
|IT |
|HR |
|Finance |
|HR |
+-----------+
✅ Expected Output - Example 2
+-------------+
| Name|
+-------------+
|Aamir Shahzad|
| Ali Raza|
| Bob|
| Lisa|
+-------------+
✅ Expected Output - Example 3
+--------+
|Country |
+--------+
|US |
|CA |
|UK |
|CA |
+--------+
๐ Explanation
^Dep.*
matches any column name starting with 'Dep'..*Name$
matches any column name ending with 'Name'..*try.*
matches any column name that contains 'try' anywhere.