PySpark replace() Function Tutorial | Replace Values in DataFrames Easily
Learn how to use the replace()
function in PySpark to replace values in one or more columns of a DataFrame. This guide explains everything step-by-step with examples and expected outputs.
What is replace() in PySpark?
The replace()
function in PySpark allows you to replace specific values in one or more columns of a DataFrame. It is helpful for:
- Data cleaning and preparation
- Handling inconsistent data
- Replacing multiple values at once
Step 1: Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \\
.appName("PySpark replace() Example") \\
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
(1, "Aamir Shahzad", "Pakistan", 5000),
(2, "Ali Raza", "Pakistan", 6000),
(3, "Bob", "USA", 5500),
(4, "Lisa", "Canada", 7000),
(5, "Unknown", "Unknown", None)
]
columns = ["ID", "Name", "Country", "Salary"]
df = spark.createDataFrame(data, columns)
df.show()
Step 3: Replace a Single Value in One Column
# Replace 'Unknown' with 'Not Provided' in the Name column
df_replaced_name = df.replace("Unknown", "Not Provided", subset=["Name"])
df_replaced_name.show()
Step 4: Replace Multiple Values in a Single Column
# Replace country names: Pakistan -> PK, USA -> US, Canada -> CA
df_replaced_country = df.replace({
"Pakistan": "PK",
"USA": "US",
"Canada": "CA"
}, subset=["Country"])
df_replaced_country.show()
Step 5: Replace a Single Value Across Multiple Columns
# Replace 'Unknown' with 'Not Provided' in Name and Country columns
df_replaced_multiple = df.replace("Unknown", "Not Provided", subset=["Name", "Country"])
df_replaced_multiple.show()
📺 Watch the Full Tutorial Video
For a complete step-by-step video guide, watch the tutorial below: