PySpark Tutorial: How to Use colRegex() for Column Selection

How to Use colRegex() Function in PySpark | Step-by-Step Guide

How to Use colRegex() Function in PySpark | Step-by-Step Guide

Author: Aamir Shahzad

Published: March 2025

๐Ÿ“˜ Introduction

The colRegex() function in PySpark allows you to select multiple columns using regular expressions. It is especially useful when dealing with DataFrames that have dynamic or similar column naming patterns.

๐Ÿ“Œ What is colRegex() in PySpark?

The colRegex() method is part of the DataFrame API. It enables you to use regular expressions to match and select column names based on patterns — such as prefix, suffix, or substring — instead of specifying each column manually.

๐Ÿงพ Sample Dataset

Name           Department    Salary     Country
Aamir Shahzad   IT            5000       US
Ali Raza        HR            4000       CA
Bob             Finance       4500       UK
Lisa            HR            4000       CA

๐Ÿ”ง Create DataFrame in PySpark

from pyspark.sql import SparkSession

# Create Spark session
spark = SparkSession.builder.appName("colRegexExample").getOrCreate()

# Sample data
data = [
    ("Aamir Shahzad", "IT", 5000, "US"),
    ("Ali Raza", "HR", 4000, "CA"),
    ("Bob", "Finance", 4500, "UK"),
    ("Lisa", "HR", 4000, "CA")
]

# Create DataFrame
columns = ["Name", "Department", "Salary", "Country"]
df = spark.createDataFrame(data, columns)

# Show DataFrame
df.show()

๐Ÿ“Š Using colRegex() for Column Selection

# Example 1: Select columns starting with 'Dep'
df.select(df.colRegex("`^Dep.*`")).show()

# Example 2: Select columns ending with 'Name'
df.select(df.colRegex("`.*Name$`")).show()

# Example 3: Select columns containing 'try'
df.select(df.colRegex("`.*try.*`")).show()

✅ Expected Output - Example 1

+-----------+
|Department |
+-----------+
|IT         |
|HR         |
|Finance    |
|HR         |
+-----------+

✅ Expected Output - Example 2

+-------------+
|         Name|
+-------------+
|Aamir Shahzad|
|     Ali Raza|
|          Bob|
|         Lisa|
+-------------+

✅ Expected Output - Example 3

+--------+
|Country |
+--------+
|US      |
|CA      |
|UK      |
|CA      |
+--------+

๐Ÿ“Œ Explanation

  • ^Dep.* matches any column name starting with 'Dep'.
  • .*Name$ matches any column name ending with 'Name'.
  • .*try.* matches any column name that contains 'try' anywhere.

๐ŸŽฅ Video Tutorial

Watch on YouTube

© 2025 Aamir Shahzad. All rights reserved.

Visit TechBrothersIT for more tutorials.