String Search in PySpark
Learn how to perform string filtering and matching in PySpark using functions like contains()
, startswith()
, endswith()
, like
, rlike
, and locate()
.
🔍 Sample DataFrame
+-------------+------------------------+--------+
| customer_name | email | status |
+-------------+------------------------+--------+
| Aamir Khan | aamir@example.com | Active |
| Bob Smith | bob.smith@mail.com | Inactive |
| Lisa Jones | lisa.j@data.org | Active |
| Charlie Brown| charlie@peanuts.net | Pending |
| Alice Wonder | alice.w@wonder.org | Active |
| Eve H | eve.h@allabout.eve | Inactive |
| John Doe | john.doe@unknown.com | Active |
+-------------+------------------------+--------+
✅ contains()
Find customers whose address
column contains 'Anytown'.
df.filter(col("address").contains("Anytown")).show(truncate=False)
✅ startswith() and endswith()
Filter emails starting with 'a' and ending with '.org'.
df.filter(col("email").startswith("a")).show(truncate=False)
df.filter(col("email").endswith(".org")).show(truncate=False)
✅ like()
Use SQL-style wildcard match.
df.filter(col("customer_name").like("%is%")).show(truncate=False)
✅ rlike()
Use regex pattern match (e.g., find email domains starting with m or d).
df.filter(col("email").rlike("^[md]")).show(truncate=False)
✅ locate()
Find position of substring in the email (e.g., '@' position).
df.select("email", instr(col("email"), "@").alias("at_pos")).show(truncate=False)