PySpark freqItems() Function : Identify Frequent Items in Columns Fast | PySpark Tutorial

PySpark freqItems() Function | Identify Frequent Items in Columns Fast

PySpark freqItems() Function | Identify Frequent Items in Columns Fast

In this tutorial, you'll learn how to use PySpark's freqItems() function to identify frequent items in one or multiple DataFrame columns. This method is helpful when performing data analysis or finding patterns in datasets.

What is freqItems() in PySpark?

freqItems() is a PySpark DataFrame function that returns frequent items (values) in a column or multiple columns. It helps identify commonly occurring values. It's useful for:

  • Exploratory data analysis (EDA)
  • Data quality checks
  • Understanding data distributions

Step 1: Create Spark Session

from pyspark.sql import SparkSession

spark = SparkSession.builder \\
    .appName("PySpark freqItems() Example") \\
    .getOrCreate()

Step 2: Create Sample DataFrame

data = [
    (1, "Aamir Shahzad", "Pakistan"),
    (2, "Ali Raza", "Pakistan"),
    (3, "Bob", "USA"),
    (4, "Lisa", "Canada"),
    (5, "Aamir Shahzad", "Pakistan"),
    (6, "Ali Raza", "Pakistan"),
    (7, "Bob", "USA"),
    (8, "Lisa", "Canada"),
    (9, "Lisa", "Canada"),
    (10, "Aamir Shahzad", "Pakistan")
]

columns = ["ID", "Name", "Country"]

df = spark.createDataFrame(data, columns)

print("Original DataFrame:")
df.show()
+---+--------------+--------+
| ID| Name| Country|
+---+--------------+--------+
| 1| Aamir Shahzad|Pakistan|
| 2| Ali Raza|Pakistan|
| 3| Bob| USA|
| 4| Lisa| Canada|
| 5| Aamir Shahzad|Pakistan|
| 6| Ali Raza|Pakistan|
| 7| Bob| USA|
| 8| Lisa| Canada|
| 9| Lisa| Canada|
| 10| Aamir Shahzad|Pakistan|
+---+--------------+--------+

Step 3: Identify Frequent Items in a Single Column

freq_name = df.freqItems(["Name"])

print("Frequent Items in 'Name' Column:")
freq_name.show(truncate=False)
+--------------------+
|Name_freqItems |
+--------------------+
|[Lisa, Bob, Aamir Shahzad, Ali Raza]|
+--------------------+

Step 4: Identify Frequent Items in Multiple Columns

freq_name_country = df.freqItems(["Name", "Country"])

print("Frequent Items in 'Name' and 'Country' Columns:")
freq_name_country.show(truncate=False)
+--------------------+--------------------+
|Name_freqItems |Country_freqItems |
+--------------------+--------------------+
|[Lisa, Bob, Aamir Shahzad, Ali Raza]|[Pakistan, USA, Canada]|
+--------------------+--------------------+

Step 5: Adjust Support Threshold (Optional)

By default, the support threshold is 1%. You can increase it to find only more frequent items:

freq_with_support = df.freqItems(["Name"], support=0.4)

print("Frequent Items in 'Name' Column with Support = 0.4 (40%):")
freq_with_support.show(truncate=False)
+--------------------+
|Name_freqItems |
+--------------------+
|[Aamir Shahzad, Lisa]|
+--------------------+

📺 Watch the Full Tutorial Video

For a complete step-by-step video guide, watch the tutorial below:

▶️ Watch on YouTube

Author: Aamir Shahzad

© 2024 PySpark Tutorials. All rights reserved.