How to Use sample() in PySpark | Randomly Select Data from DataFrames
In this tutorial, you will learn how to use the sample()
function in PySpark to retrieve a random subset of rows from a DataFrame. This is useful for testing, performance tuning, or dealing with large datasets.
Step 1: Create Sample Data
data = [
("Aamir Shahzad", "Pakistan", 25),
("Ali Raza", "USA", 30),
("Bob", "UK", 45),
("Lisa", "Canada", 35),
("Aamir Shahzad", "Pakistan", 50),
("Aamir Shahzad", "Pakistan", 50)
]
Step 2: Create a DataFrame
df = spark.createDataFrame(data, ["Name", "Country", "Age"])
print("Original DataFrame:")
df.show()
Original DataFrame:
+--------------+---------+---+
| Name| Country|Age|
+--------------+---------+---+
|Aamir Shahzad | Pakistan| 25|
| Ali Raza| USA | 30|
| Bob| UK | 45|
| Lisa| Canada | 35|
|Aamir Shahzad | Pakistan| 50|
|Aamir Shahzad | Pakistan| 50|
+--------------+---------+---+
+--------------+---------+---+
| Name| Country|Age|
+--------------+---------+---+
|Aamir Shahzad | Pakistan| 25|
| Ali Raza| USA | 30|
| Bob| UK | 45|
| Lisa| Canada | 35|
|Aamir Shahzad | Pakistan| 50|
|Aamir Shahzad | Pakistan| 50|
+--------------+---------+---+
Step 3: Use sample() to Randomly Select Rows
# Sample 50% of the data, without replacement and fixed seed
sampled_df = df.sample(withReplacement=False, fraction=0.5, seed=70)
print("Sampled DataFrame (50% of data):")
sampled_df.show()
Sampled DataFrame (output will vary):
+--------------+---------+---+
| Name| Country|Age|
+--------------+---------+---+
| Bob| UK | 45|
| Lisa| Canada | 35|
+--------------+---------+---+
+--------------+---------+---+
| Name| Country|Age|
+--------------+---------+---+
| Bob| UK | 45|
| Lisa| Canada | 35|
+--------------+---------+---+