PySpark take() Function | Get First N Rows from DataFrame Fast

PySpark take() Function

Get First N Rows from DataFrame Fast

In this tutorial, you'll learn how to use the take() function in PySpark to quickly retrieve the first N rows from a DataFrame. It's a handy method for previewing and debugging data.

Introduction

When working with PySpark DataFrames, it's often useful to retrieve a small sample of rows for inspection or validation. The take() function allows you to pull the first N rows efficiently, returning them as a list of Row objects.

PySpark Code Example

from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder.appName("PySpark take() Example").getOrCreate()

# Sample data
data = [
    ("Aamir Shahzad", 85, "Math"),
    ("Ali Raza", 78, "Science"),
    ("Bob", 92, "History"),
    ("Lisa", 80, "Math"),
    ("John", 88, "Science")
]

# Define columns
columns = ["Name", "Score", "Subject"]

# Create DataFrame
df = spark.createDataFrame(data, columns)

# Show complete DataFrame
print("Complete DataFrame:")
df.show()

# Example 1: Take the first 2 rows
first_two_rows = df.take(2)
print("First 2 rows using take():")
for row in first_two_rows:
    print(row)

# Example 2: Take more rows than exist (request 10 rows, only 5 in DataFrame)
more_rows = df.take(10)
print("Taking more rows than available (requested 10 rows):")
for row in more_rows:
    print(row)

Expected Output: Complete DataFrame

+-------------+-----+--------+
| Name        |Score|Subject |
+-------------+-----+--------+
|Aamir Shahzad|   85|    Math|
|Ali Raza     |   78| Science|
|Bob          |   92| History|
|Lisa         |   80|    Math|
|John         |   88| Science|
+-------------+-----+--------+

Expected Output: First 2 Rows Using take()

Row(Name='Aamir Shahzad', Score=85, Subject='Math')
Row(Name='Ali Raza', Score=78, Subject='Science')

Expected Output: Taking More Rows Than Exist (Requested 10 Rows)

Row(Name='Aamir Shahzad', Score=85, Subject='Math')
Row(Name='Ali Raza', Score=78, Subject='Science')
Row(Name='Bob', Score=92, Subject='History')
Row(Name='Lisa', Score=80, Subject='Math')
Row(Name='John', Score=88, Subject='Science')

Welcome To TechBrothersIT

Label

PySpark Tutorial : PySpark take Function | Get First N Rows from DataFrame Fast #pysparktutorial