PySpark limit() Function Explained with Examples
The limit()
function in PySpark is used to return a specified number of rows from a DataFrame. It helps in sampling data or fetching a small subset for quick analysis, especially useful for data engineers working with large datasets.
Sample Data
data = [
(1, "Alice", 5000),
(2, "Bob", 6000),
(3, "Charlie", 7000),
(4, "David", 8000),
(5, "Eve", 9000),
(6, "Frank", 10000),
(7, "Grace", 11000),
(8, "Hannah", 12000),
(9, "Ian", 13000),
(10, "Jack", 14000)
]
Create a DataFrame
df = spark.createDataFrame(data, ["id", "name", "salary"])
Show the Full DataFrame
df.show()
Example 1: Get the First 5 Rows
df.limit(5).show()
Example 2: Get the First 3 Rows
df.limit(3).show()
Example 3: Store the Limited DataFrame
df_limited = df.limit(4)
df_limited.show()