How to Use describe() Function in PySpark
The describe()
function in PySpark provides summary statistics for numerical columns in a DataFrame. It returns important metrics like count, mean, standard deviation, min, and max for each selected column.
Sample Data
data = [
(1, "Alice", 5000, 25),
(2, "Bob", 6000, 30),
(3, "Charlie", 7000, 35),
(4, "David", 8000, 40),
(5, "Eve", 9000, 45),
(6, "Frank", 10000, 50),
(7, "Grace", 11000, 55),
(8, "Hannah", 12000, 60),
(9, "Ian", 13000, 65),
(10, "Jack", 14000, 70)
]
Create DataFrame
df = spark.createDataFrame(data, ["id", "name", "salary", "age"])
Show the Full DataFrame
df.show()
Example 1: Basic Usage of describe()
print("Summary statistics for numerical columns:")
df.describe().show()
Example 2: describe() for Specific Columns
print("Summary statistics for 'salary' and 'age':")
df.describe("salary", "age").show()
Sample Output
+-------+------------------+------------------+
|summary| salary| age|
+-------+------------------+------------------+
| count| 10| 10|
| mean| 9500.0| 47.5|
| stddev|3027.6503540974913|15.138251770487457|
| min| 5000| 25|
| max| 14000| 70|
+-------+------------------+------------------+