PySpark corr() Function Tutorial
Finding Correlation Between Columns
In this tutorial, you'll learn how to use PySpark’s corr()
function to find the Pearson correlation coefficient between two columns of a DataFrame.
📌 What is corr() in PySpark?
The corr()
function computes the Pearson Correlation Coefficient between two numeric columns of a DataFrame. Currently, only the Pearson method is supported.
🧪 Step-by-Step Example
Step 1: Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark corr() Example") \
.getOrCreate()
Step 2: Sample Data
data = [
("Aamir Shahzad", 25, 150000.0),
("Ali Raza", 30, 160000.0),
("Bob", 45, 120000.0),
("Lisa", 35, 75000.0),
("Aamir Shahzad", 50, 110000.0)
]
df = spark.createDataFrame(data, ["Name", "Age", "Salary"])
df.show()
+--------------+---+--------+
| Name |Age| Salary |
+--------------+---+--------+
|Aamir Shahzad | 25|150000.0|
|Ali Raza | 30|160000.0|
|Bob | 45|120000.0|
|Lisa | 35| 75000.0|
|Aamir Shahzad | 50|110000.0|
+--------------+---+--------+
| Name |Age| Salary |
+--------------+---+--------+
|Aamir Shahzad | 25|150000.0|
|Ali Raza | 30|160000.0|
|Bob | 45|120000.0|
|Lisa | 35| 75000.0|
|Aamir Shahzad | 50|110000.0|
+--------------+---+--------+
Step 3: Use corr() Function
result = df.corr("Age", "Salary")
Step 4: Print the Result
print("Correlation between Age and Salary:", result)
Correlation between Age and Salary: -0.4845537354461136
✅ Why Use corr()?
- Quickly identify relationships between two numeric variables.
- Helps in feature selection and exploratory data analysis (EDA).
- Returns a single float value: the Pearson correlation coefficient.