How to Use first(), head(), and tail() Functions in PySpark
Author: Aamir Shahzad
Date: March 2025
Introduction
In PySpark, the functions first()
, head()
, and tail()
are used to retrieve specific rows from a DataFrame. These functions are particularly useful for inspecting data, debugging, and performing quick checks.
Why Use These Functions?
first()
returns the first row of the DataFrame.head(n)
returns the firstn
rows of the DataFrame as a list of Row objects.tail(n)
returns the lastn
rows of the DataFrame as a list of Row objects.
Step 1: Create SparkSession
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("PySparkFirstHeadTailFunctions") \
.getOrCreate()
Step 2: Create a Sample DataFrame
data = [
("Aamir Shahzad", "Engineering", 5000),
("Ali", "Sales", 4000),
("Raza", "Marketing", 3500),
("Bob", "Sales", 4200),
("Lisa", "Engineering", 6000)
]
columns = ["Name", "Department", "Salary"]
df = spark.createDataFrame(data, schema=columns)
df.show()
Expected Output
+-------------+-----------+------+
| Name| Department|Salary|
+-------------+-----------+------+
|Aamir Shahzad|Engineering| 5000|
| Ali| Sales| 4000|
| Raza| Marketing| 3500|
| Bob| Sales| 4200|
| Lisa|Engineering| 6000|
+-------------+-----------+------+
Step 3: Using head() Function
# Get the first 3 rows using head()
head_rows = df.head(3)
# Print each row
for row in head_rows:
print(row)
Expected Output
Row(Name='Aamir Shahzad', Department='Engineering', Salary=5000)
Row(Name='Ali', Department='Sales', Salary=4000)
Row(Name='Raza', Department='Marketing', Salary=3500)
Step 4: Using first() Function
# Get the first row
first_row = df.first()
# Print the first row
print(first_row)
Expected Output
Row(Name='Aamir Shahzad', Department='Engineering', Salary=5000)
Step 5: Using tail() Function
# Get the last 2 rows
tail_rows = df.tail(2)
# Print each row
for row in tail_rows:
print(row)
Expected Output
Row(Name='Bob', Department='Sales', Salary=4200)
Row(Name='Lisa', Department='Engineering', Salary=6000)
Conclusion
PySpark provides several functions to access rows in a DataFrame. first()
, head()
, and tail()
are simple yet powerful tools for data inspection and debugging. Understanding their differences helps in retrieving data more effectively during data processing tasks.
Watch the Video Tutorial
For a complete walkthrough of first(), head(), and tail() functions in PySpark, check out this video tutorial: