PySpark freqItems() Function | Identify Frequent Items in Columns Fast
In this tutorial, you'll learn how to use PySpark's freqItems()
function to identify frequent items in one or multiple DataFrame columns. This method is helpful when performing data analysis or finding patterns in datasets.
What is freqItems() in PySpark?
freqItems()
is a PySpark DataFrame function that returns frequent items (values) in a column or multiple columns. It helps identify commonly occurring values. It's useful for:
- Exploratory data analysis (EDA)
- Data quality checks
- Understanding data distributions
Step 1: Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \\
.appName("PySpark freqItems() Example") \\
.getOrCreate()
Step 2: Create Sample DataFrame
data = [
(1, "Aamir Shahzad", "Pakistan"),
(2, "Ali Raza", "Pakistan"),
(3, "Bob", "USA"),
(4, "Lisa", "Canada"),
(5, "Aamir Shahzad", "Pakistan"),
(6, "Ali Raza", "Pakistan"),
(7, "Bob", "USA"),
(8, "Lisa", "Canada"),
(9, "Lisa", "Canada"),
(10, "Aamir Shahzad", "Pakistan")
]
columns = ["ID", "Name", "Country"]
df = spark.createDataFrame(data, columns)
print("Original DataFrame:")
df.show()
| ID| Name| Country|
+---+--------------+--------+
| 1| Aamir Shahzad|Pakistan|
| 2| Ali Raza|Pakistan|
| 3| Bob| USA|
| 4| Lisa| Canada|
| 5| Aamir Shahzad|Pakistan|
| 6| Ali Raza|Pakistan|
| 7| Bob| USA|
| 8| Lisa| Canada|
| 9| Lisa| Canada|
| 10| Aamir Shahzad|Pakistan|
+---+--------------+--------+
Step 3: Identify Frequent Items in a Single Column
freq_name = df.freqItems(["Name"])
print("Frequent Items in 'Name' Column:")
freq_name.show(truncate=False)
|Name_freqItems |
+--------------------+
|[Lisa, Bob, Aamir Shahzad, Ali Raza]|
+--------------------+
Step 4: Identify Frequent Items in Multiple Columns
freq_name_country = df.freqItems(["Name", "Country"])
print("Frequent Items in 'Name' and 'Country' Columns:")
freq_name_country.show(truncate=False)
|Name_freqItems |Country_freqItems |
+--------------------+--------------------+
|[Lisa, Bob, Aamir Shahzad, Ali Raza]|[Pakistan, USA, Canada]|
+--------------------+--------------------+
Step 5: Adjust Support Threshold (Optional)
By default, the support threshold is 1%
. You can increase it to find only more frequent items:
freq_with_support = df.freqItems(["Name"], support=0.4)
print("Frequent Items in 'Name' Column with Support = 0.4 (40%):")
freq_with_support.show(truncate=False)
|Name_freqItems |
+--------------------+
|[Aamir Shahzad, Lisa]|
+--------------------+
📺 Watch the Full Tutorial Video
For a complete step-by-step video guide, watch the tutorial below: