PySpark take() Function
Get First N Rows from DataFrame Fast
In this tutorial, you'll learn how to use the take()
function in PySpark to quickly retrieve the first N rows from a DataFrame. It's a handy method for previewing and debugging data.
Introduction
When working with PySpark DataFrames, it's often useful to retrieve a small sample of rows for inspection or validation. The take() function allows you to pull the first N rows efficiently, returning them as a list of Row objects.
PySpark Code Example
from pyspark.sql import SparkSession
# Create SparkSession
spark = SparkSession.builder.appName("PySpark take() Example").getOrCreate()
# Sample data
data = [
("Aamir Shahzad", 85, "Math"),
("Ali Raza", 78, "Science"),
("Bob", 92, "History"),
("Lisa", 80, "Math"),
("John", 88, "Science")
]
# Define columns
columns = ["Name", "Score", "Subject"]
# Create DataFrame
df = spark.createDataFrame(data, columns)
# Show complete DataFrame
print("Complete DataFrame:")
df.show()
# Example 1: Take the first 2 rows
first_two_rows = df.take(2)
print("First 2 rows using take():")
for row in first_two_rows:
print(row)
# Example 2: Take more rows than exist (request 10 rows, only 5 in DataFrame)
more_rows = df.take(10)
print("Taking more rows than available (requested 10 rows):")
for row in more_rows:
print(row)
Expected Output: Complete DataFrame
+-------------+-----+--------+
| Name |Score|Subject |
+-------------+-----+--------+
|Aamir Shahzad| 85| Math|
|Ali Raza | 78| Science|
|Bob | 92| History|
|Lisa | 80| Math|
|John | 88| Science|
+-------------+-----+--------+
Expected Output: First 2 Rows Using take()
Row(Name='Aamir Shahzad', Score=85, Subject='Math')
Row(Name='Ali Raza', Score=78, Subject='Science')
Expected Output: Taking More Rows Than Exist (Requested 10 Rows)
Row(Name='Aamir Shahzad', Score=85, Subject='Math')
Row(Name='Ali Raza', Score=78, Subject='Science')
Row(Name='Bob', Score=92, Subject='History')
Row(Name='Lisa', Score=80, Subject='Math')
Row(Name='John', Score=88, Subject='Science')