PySpark Tutorial : String Search in PySpark - How to Use contains, startswith, endswith, like, rlike, locate

String Search in PySpark | contains, startswith, endswith, like, rlike, locate

String Search in PySpark

Learn how to perform string filtering and matching in PySpark using functions like contains(), startswith(), endswith(), like, rlike, and locate().

🔍 Sample DataFrame

+-------------+------------------------+--------+
| customer_name | email                  | status |
+-------------+------------------------+--------+
| Aamir Khan   | aamir@example.com       | Active |
| Bob Smith    | bob.smith@mail.com      | Inactive |
| Lisa Jones   | lisa.j@data.org         | Active |
| Charlie Brown| charlie@peanuts.net     | Pending |
| Alice Wonder | alice.w@wonder.org      | Active |
| Eve H        | eve.h@allabout.eve      | Inactive |
| John Doe     | john.doe@unknown.com    | Active |
+-------------+------------------------+--------+

✅ contains()

Find customers whose address column contains 'Anytown'.

df.filter(col("address").contains("Anytown")).show(truncate=False)

✅ startswith() and endswith()

Filter emails starting with 'a' and ending with '.org'.

df.filter(col("email").startswith("a")).show(truncate=False)
df.filter(col("email").endswith(".org")).show(truncate=False)

✅ like()

Use SQL-style wildcard match.

df.filter(col("customer_name").like("%is%")).show(truncate=False)

✅ rlike()

Use regex pattern match (e.g., find email domains starting with m or d).

df.filter(col("email").rlike("^[md]")).show(truncate=False)

✅ locate()

Find position of substring in the email (e.g., '@' position).

df.select("email", instr(col("email"), "@").alias("at_pos")).show(truncate=False)

🎥 Watch the Tutorial

Some of the contents in this website were created with assistance from ChatGPT and Gemini.