PySpark Tutorial: How to Use crosstab() to Analyze Relationships Between Columns
This tutorial will show you how to use the crosstab()
function in PySpark to create frequency tables and understand the relationship between two categorical columns.
1. What is crosstab() in PySpark?
The crosstab()
function in PySpark generates a contingency table (cross-tabulation) between two columns. It counts the occurrences of combinations between two categorical variables.
2. Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySpark Crosstab Example") \
.getOrCreate()
3. Create Sample DataFrame
data = [
(1, "Aamir Shahzad", "Pakistan"),
(2, "Ali Raza", "Pakistan"),
(3, "Bob", "USA"),
(4, "Lisa", "Canada"),
(5, "Aamir Shahzad", "Pakistan"),
(6, "Ali Raza", "Pakistan"),
(7, "Bob", "USA"),
(8, "Lisa", "Canada"),
(9, "Aamir Shahzad", "Pakistan"),
(10, "Ali Raza", "USA"),
(11, "Bob", "USA"),
(12, "Lisa", "Canada")
]
columns = ["ID", "Name", "Country"]
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
+---+--------------+--------+
| ID| Name| Country|
+---+--------------+--------+
| 1| Aamir Shahzad|Pakistan|
| 2| Ali Raza|Pakistan|
| 3| Bob| USA|
| 4| Lisa| Canada|
| 5| Aamir Shahzad|Pakistan|
| 6| Ali Raza|Pakistan|
| 7| Bob| USA|
| 8| Lisa| Canada|
| 9| Aamir Shahzad|Pakistan|
| 10| Ali Raza| USA|
| 11| Bob| USA|
| 12| Lisa| Canada|
+---+--------------+--------+
| ID| Name| Country|
+---+--------------+--------+
| 1| Aamir Shahzad|Pakistan|
| 2| Ali Raza|Pakistan|
| 3| Bob| USA|
| 4| Lisa| Canada|
| 5| Aamir Shahzad|Pakistan|
| 6| Ali Raza|Pakistan|
| 7| Bob| USA|
| 8| Lisa| Canada|
| 9| Aamir Shahzad|Pakistan|
| 10| Ali Raza| USA|
| 11| Bob| USA|
| 12| Lisa| Canada|
+---+--------------+--------+
4. Apply crosstab() Between Name and Country
crosstab_df = df.crosstab("Name", "Country")
print("Crosstab between Name and Country:")
crosstab_df.show(truncate=False)
+----------------+------+-------+----+
|Name_Country |Canada|Pakistan|USA |
+----------------+------+--------+----+
|Aamir Shahzad |0 |3 |0 |
|Ali Raza |0 |2 |1 |
|Bob |0 |0 |3 |
|Lisa |3 |0 |0 |
+----------------+------+--------+----+
|Name_Country |Canada|Pakistan|USA |
+----------------+------+--------+----+
|Aamir Shahzad |0 |3 |0 |
|Ali Raza |0 |2 |1 |
|Bob |0 |0 |3 |
|Lisa |3 |0 |0 |
+----------------+------+--------+----+