PySpark replace() Function Tutorial | Replace Values in DataFrames Easily | PySpark Tutorial

PySpark replace() Function Tutorial | Replace Values in DataFrames Easily

PySpark replace() Function Tutorial | Replace Values in DataFrames Easily

Learn how to use the replace() function in PySpark to replace values in one or more columns of a DataFrame. This guide explains everything step-by-step with examples and expected outputs.

What is replace() in PySpark?

The replace() function in PySpark allows you to replace specific values in one or more columns of a DataFrame. It is helpful for:

  • Data cleaning and preparation
  • Handling inconsistent data
  • Replacing multiple values at once

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("PySpark replace() Example") \\
    .getOrCreate()

Step 2: Create a Sample DataFrame

data = [
    (1, "Aamir Shahzad", "Pakistan", 5000),
    (2, "Ali Raza", "Pakistan", 6000),
    (3, "Bob", "USA", 5500),
    (4, "Lisa", "Canada", 7000),
    (5, "Unknown", "Unknown", None)
]

columns = ["ID", "Name", "Country", "Salary"]

df = spark.createDataFrame(data, columns)
df.show()
Original DataFrame: +---+--------------+--------+------+ | ID| Name| Country|Salary| +---+--------------+--------+------+ | 1| Aamir Shahzad| Pakistan| 5000| | 2| Ali Raza| Pakistan| 6000| | 3| Bob| USA| 5500| | 4| Lisa| Canada| 7000| | 5| Unknown| Unknown| null| +---+--------------+--------+------+

Step 3: Replace a Single Value in One Column

# Replace 'Unknown' with 'Not Provided' in the Name column
df_replaced_name = df.replace("Unknown", "Not Provided", subset=["Name"])
df_replaced_name.show()
Expected Output: +---+--------------+--------+------+ | ID| Name| Country|Salary| +---+--------------+--------+------+ | 1| Aamir Shahzad| Pakistan| 5000| | 2| Ali Raza| Pakistan| 6000| | 3| Bob| USA| 5500| | 4| Lisa| Canada| 7000| | 5| Not Provided| Unknown| null| +---+--------------+--------+------+

Step 4: Replace Multiple Values in a Single Column

# Replace country names: Pakistan -> PK, USA -> US, Canada -> CA
df_replaced_country = df.replace({
    "Pakistan": "PK",
    "USA": "US",
    "Canada": "CA"
}, subset=["Country"])
df_replaced_country.show()
Expected Output: +---+--------------+--------+------+ | ID| Name| Country|Salary| +---+--------------+--------+------+ | 1| Aamir Shahzad| PK| 5000| | 2| Ali Raza| PK| 6000| | 3| Bob| US| 5500| | 4| Lisa| CA| 7000| | 5| Unknown| Unknown| null| +---+--------------+--------+------+

Step 5: Replace a Single Value Across Multiple Columns

# Replace 'Unknown' with 'Not Provided' in Name and Country columns
df_replaced_multiple = df.replace("Unknown", "Not Provided", subset=["Name", "Country"])
df_replaced_multiple.show()
Expected Output: +---+--------------+-------------+------+ | ID| Name| Country|Salary| +---+--------------+-------------+------+ | 1| Aamir Shahzad| Pakistan| 5000| | 2| Ali Raza| Pakistan| 6000| | 3| Bob| USA| 5500| | 4| Lisa| Canada| 7000| | 5| Not Provided| Not Provided| null| +---+--------------+-------------+------+

📺 Watch the Full Tutorial Video

For a complete step-by-step video guide, watch the tutorial below:

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2024 PySpark Tutorials. All rights reserved.