How to Use count() Function in PySpark
The count()
function in PySpark returns the number of rows in a DataFrame. In this tutorial, you'll learn how to use count()
, distinct().count()
, and groupBy().count()
with examples and expected outputs.
1. Import SparkSession and Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("PySparkCountFunction").getOrCreate()
2. Create Sample Data
data = [
("Amir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000),
("Raza", "Marketing", 3500),
("Amir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000)
]
3. Define Schema
columns = ["Name", "Department", "Salary"]
4. Create a DataFrame
df = spark.createDataFrame(data, schema=columns)
5. Show the DataFrame
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Amir Shahzad |Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
|Amir Shahzad |Engineering| 5000|
| Ali| Sales| 4000|
+-------------+-----------+------+
6. count() - Total Number of Rows (Including Duplicates)
total_rows = df.count()
print("Total number of rows:", total_rows)
Expected Output
Total number of rows: 5
7. distinct().count() - Counts Unique Rows
distinct_rows = df.distinct().count()
print("Number of distinct rows:", distinct_rows)
Expected Output
Number of distinct rows: 3
8. groupBy() + count() - Count Occurrences of Each Name
df.groupBy("Name").count().show()
Expected Output
+-------------+-----+
| Name|count|
+-------------+-----+
| Raza| 1|
| Ali| 2|
|Amir Shahzad | 2|
+-------------+-----+
Conclusion
In this tutorial, you have learned how to use the count()
function in PySpark to get the total number of rows, unique rows with distinct().count()
, and count occurrences by grouping data using groupBy().count()
.