PySpark cov() Tutorial | Covariance Analysis for Numerical Columns | Pyspark Tutorial

PySpark Tutorial: How to Use cov() Function | Covariance Analysis

PySpark Tutorial: How to Use cov() Function

This tutorial explains how to use the cov() function in PySpark to calculate the covariance between two numeric columns of a DataFrame.

📌 What is cov() in PySpark?

The cov() function is used to compute the sample covariance between two numeric columns in a PySpark DataFrame.

  • Positive covariance: both variables increase together.
  • Negative covariance: one increases while the other decreases.

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("PySpark cov() Example") \
    .getOrCreate()

Step 2: Create Sample Data

data = [
    ("Aamir Shahzad", 25, 150000.0),
    ("Ali Raza", 30, 160000.0),
    ("Bob", 45, 120000.0),
    ("Lisa", 35, 75000.0),
    ("Aamir Shahzad", 50, 110000.0)
]

df = spark.createDataFrame(data, ["Name", "Age", "Salary"])

df.show()
+--------------+---+--------+
| Name|Age| Salary|
+--------------+---+--------+
| Aamir Shahzad| 25|150000.0|
| Ali Raza| 30|160000.0|
| Bob| 45|120000.0|
| Lisa| 35| 75000.0|
| Aamir Shahzad| 50|110000.0|
+--------------+---+--------+

Step 3: Calculate Covariance

covariance_result = df.cov("Age", "Salary")
print("Covariance between Age and Salary:", covariance_result)
Covariance between Age and Salary: -170000.0

✅ Why Use cov()?

  • Quickly identify the direction of the relationship between two variables.
  • Useful for exploratory data analysis and feature selection.

📺 Watch the Full Tutorial Video

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2025 PySpark Tutorials. All rights reserved.