PySpark Tutorial: How to Use cov()
Function
This tutorial explains how to use the cov()
function in PySpark to calculate the covariance between two numeric columns of a DataFrame.
📌 What is cov() in PySpark?
The cov()
function is used to compute the sample covariance between two numeric columns in a PySpark DataFrame.
- Positive covariance: both variables increase together.
- Negative covariance: one increases while the other decreases.
Step 1: Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark cov() Example") \
.getOrCreate()
Step 2: Create Sample Data
data = [
("Aamir Shahzad", 25, 150000.0),
("Ali Raza", 30, 160000.0),
("Bob", 45, 120000.0),
("Lisa", 35, 75000.0),
("Aamir Shahzad", 50, 110000.0)
]
df = spark.createDataFrame(data, ["Name", "Age", "Salary"])
df.show()
+--------------+---+--------+
| Name|Age| Salary|
+--------------+---+--------+
| Aamir Shahzad| 25|150000.0|
| Ali Raza| 30|160000.0|
| Bob| 45|120000.0|
| Lisa| 35| 75000.0|
| Aamir Shahzad| 50|110000.0|
+--------------+---+--------+
| Name|Age| Salary|
+--------------+---+--------+
| Aamir Shahzad| 25|150000.0|
| Ali Raza| 30|160000.0|
| Bob| 45|120000.0|
| Lisa| 35| 75000.0|
| Aamir Shahzad| 50|110000.0|
+--------------+---+--------+
Step 3: Calculate Covariance
covariance_result = df.cov("Age", "Salary")
print("Covariance between Age and Salary:", covariance_result)
Covariance between Age and Salary: -170000.0
✅ Why Use cov()?
- Quickly identify the direction of the relationship between two variables.
- Useful for exploratory data analysis and feature selection.