Extracting Substrings in PySpark
In this tutorial, you'll learn how to use PySpark string functions like substr()
, substring()
, overlay()
, left()
, and right()
to manipulate string columns in DataFrames.
1️⃣ Initialize Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("SubstringTutorial").getOrCreate()
2️⃣ Sample DataFrame
data = [("EMP-0001-NC", "maria.fernandez@healthcare.org"),
("EMP-0002-TX", "john.davis@fintech.com"),
("EMP-0003-CA", "lucy.liu@retailmart.net"),
("EMP-0004-FL", "peter.parker@dailybugle.com"),
("EMP-0005-NY", "bruce.wayne@gotham.org")]
columns = ["employee_code", "email"]
df = spark.createDataFrame(data, columns)
df.show(truncate=False)
Output:
+-------------+-----------------------------+
|employee_code|email |
+-------------+-----------------------------+
|EMP-0001-NC |maria.fernandez@healthcare.org|
|EMP-0002-TX |john.davis@fintech.com |
|EMP-0003-CA |lucy.liu@retailmart.net |
|EMP-0004-FL |peter.parker@dailybugle.com |
|EMP-0005-NY |bruce.wayne@gotham.org |
+-------------+-----------------------------+
3️⃣ substr() Function
Extract substring starting at a specific position with defined length.
from pyspark.sql.functions import col, substr
df_substr = df.withColumn("emp_id", substr("employee_code", 5, 4))
df_substr.show(truncate=False)
Output:
+-------------+-----------------------------+------+
|employee_code|email |emp_id|
+-------------+-----------------------------+------+
|EMP-0001-NC |maria.fernandez@healthcare.org|0001 |
|EMP-0002-TX |john.davis@fintech.com |0002 |
|EMP-0003-CA |lucy.liu@retailmart.net |0003 |
|EMP-0004-FL |peter.parker@dailybugle.com |0004 |
|EMP-0005-NY |bruce.wayne@gotham.org |0005 |
+-------------+-----------------------------+------+
4️⃣ substring() Function
Alias for substr()
- works the same way.
from pyspark.sql.functions import substring
df_substring = df.withColumn("emp_id", substring("employee_code", 5, 4))
df_substring.show(truncate=False)
5️⃣ overlay() Function
Replaces part of the string with another string.
from pyspark.sql.functions import overlay
df_overlay = df.withColumn("masked_email", overlay("email", "XXXXXX", 1, 5))
df_overlay.show(truncate=False)
6️⃣ left() and right() Functions
Extract leftmost or rightmost characters from string.
from pyspark.sql.functions import left, right
df_left = df.withColumn("left_code", left("employee_code", 3))
df_right = df.withColumn("right_code", right("employee_code", 2))
df_left.show(truncate=False)
df_right.show(truncate=False)