How to Use corr() Function in PySpark : Finding Correlation Between Columns with corr() #pyspark | PySpark Tutorial

PySpark corr() Function Tutorial | Find Correlation Between Columns

PySpark corr() Function Tutorial

Finding Correlation Between Columns

In this tutorial, you'll learn how to use PySpark’s corr() function to find the Pearson correlation coefficient between two columns of a DataFrame.

📌 What is corr() in PySpark?

The corr() function computes the Pearson Correlation Coefficient between two numeric columns of a DataFrame. Currently, only the Pearson method is supported.

🧪 Step-by-Step Example

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark corr() Example") \
    .getOrCreate()

Step 2: Sample Data

data = [
    ("Aamir Shahzad", 25, 150000.0),
    ("Ali Raza", 30, 160000.0),
    ("Bob", 45, 120000.0),
    ("Lisa", 35, 75000.0),
    ("Aamir Shahzad", 50, 110000.0)
]

df = spark.createDataFrame(data, ["Name", "Age", "Salary"])
df.show()
+--------------+---+--------+
| Name |Age| Salary |
+--------------+---+--------+
|Aamir Shahzad | 25|150000.0|
|Ali Raza | 30|160000.0|
|Bob | 45|120000.0|
|Lisa | 35| 75000.0|
|Aamir Shahzad | 50|110000.0|
+--------------+---+--------+

Step 3: Use corr() Function

result = df.corr("Age", "Salary")

Step 4: Print the Result

print("Correlation between Age and Salary:", result)
Correlation between Age and Salary: -0.4845537354461136

✅ Why Use corr()?

  • Quickly identify relationships between two numeric variables.
  • Helps in feature selection and exploratory data analysis (EDA).
  • Returns a single float value: the Pearson correlation coefficient.

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2025 PySpark Tutorials. All rights reserved.