How to Use dropna() Function in PySpark
The dropna()
function in PySpark is used to remove rows that contain NULL (missing) values from a DataFrame. This is crucial in data preprocessing and cleansing, ensuring your data is clean and ready for analysis.
Basic Syntax
DataFrame.dropna(how='any', thresh=None, subset=None)
- how: 'any' (default) drops a row if any nulls are present. 'all' drops a row only if all columns are null.
- thresh: Minimum number of non-null values required to retain the row.
- subset: List of column names to consider for null checking.
Sample Data
from pyspark.sql import SparkSession
from pyspark.sql import Row
# Initialize SparkSession
spark = SparkSession.builder.appName("DropNAExample").getOrCreate()
# Sample data
data = [
Row(name='John', age=30, salary=None),
Row(name='Alice', age=None, salary=5000),
Row(name='Bob', age=40, salary=6000),
Row(name=None, age=None, salary=None)
]
# Create DataFrame
df = spark.createDataFrame(data)
# Show original DataFrame
df.show()
Expected Output
+-----+----+------+
| name| age|salary|
+-----+----+------+
| John| 30| null|
|Alice|null| 5000|
| Bob| 40| 6000|
| null|null| null|
+-----+----+------+
Example 1: Remove Rows with Any NULL Values
# Remove rows with any NULL values
df_clean_any = df.dropna()
df_clean_any.show()
Expected Output
+----+---+------+
|name|age|salary|
+----+---+------+
| Bob| 40| 6000|
+----+---+------+
Example 2: Remove Rows Where All Columns Are NULL
# Remove rows where all columns are NULL
df_clean_all = df.dropna(how='all')
df_clean_all.show()
Expected Output
+-----+----+------+
| name| age|salary|
+-----+----+------+
| John| 30| null|
|Alice|null| 5000|
| Bob| 40| 6000|
+-----+----+------+
Example 3: Remove Rows with NULLs in a Specific Subset of Columns
# Remove rows with NULLs in 'name' or 'salary'
df_clean_subset = df.dropna(subset=['name', 'salary'])
df_clean_subset.show()
Expected Output
+-----+----+------+
| name| age|salary|
+-----+----+------+
|Alice|null| 5000|
| Bob| 40| 6000|
+-----+----+------+
Conclusion
The dropna()
function in PySpark is a powerful tool for handling missing data efficiently. It allows you to clean your datasets by removing incomplete rows based on flexible criteria, ensuring the quality of your data pipelines.