🔍 Advanced Array Manipulations in PySpark
This tutorial explores advanced array functions in PySpark including slice()
, concat()
, element_at()
, and sequence()
with real-world DataFrame examples.
📦 Sample DataFrame
data = [
("Aamir", [1, 2, 3, 4, 5, 6]),
("Sara", [7, 8, 9, 10, 11]),
("John", [12, 13, 14, 15])
]
df = spark.createDataFrame(data, ["name", "arr"])
df.show()
✅ Output
+------+------------------+
| name | arr |
+------+------------------+
| Aamir| [1, 2, 3, 4, 5, 6]|
| Sara | [7, 8, 9, 10, 11]|
| John | [12, 13, 14, 15] |
+------+------------------+
1️⃣ slice()
Definition: Returns a subset of the array starting at a specified index for a given length.
from pyspark.sql.functions import slice
df.select("name", "arr", slice("arr", 2, 3).alias("sliced_array")).show()
✅ Output
+------+------------------+-------------+
| name | arr | sliced_array|
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [2, 3, 4] |
| Sara | [7, 8, 9, 10, 11]| [8, 9, 10] |
| John | [12, 13, 14, 15] | [13, 14, 15]|
+------+------------------+-------------+
2️⃣ concat()
Definition: Combines multiple arrays into one array.
from pyspark.sql.functions import concat, array, lit
df.select("name", "arr", concat("arr", array(lit(100), lit(200))).alias("concatenated")).show()
✅ Output
+------+------------------+------------------------+
| name | arr | concatenated |
+------+------------------+------------------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [1, 2, 3, 4, 5, 6, 100, 200]|
| Sara | [7, 8, 9, 10, 11]| [7, 8, 9, 10, 11, 100, 200]|
| John | [12, 13, 14, 15] | [12, 13, 14, 15, 100, 200]|
+------+------------------+------------------------+
3️⃣ element_at()
Definition: Returns the element at the specified index in the array. Index starts at 1.
from pyspark.sql.functions import element_at
df.select("name", "arr", element_at("arr", 4).alias("element_at_4")).show()
✅ Output
+------+------------------+-------------+
| name | arr | element_at_4|
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| 4 |
| Sara | [7, 8, 9, 10, 11]| 10 |
| John | [12, 13, 14, 15] | 15 |
+------+------------------+-------------+
4️⃣ sequence()
Definition: Generates an array of numbers from start to end.
from pyspark.sql.functions import sequence
df.select("name", "arr", sequence(lit(1), lit(5)).alias("seq_1_to_5")).show()
✅ Output
+------+------------------+-------------+
| name | arr | seq_1_to_5 |
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [1, 2, 3, 4, 5] |
| Sara | [7, 8, 9, 10, 11]| [1, 2, 3, 4, 5] |
| John | [12, 13, 14, 15] | [1, 2, 3, 4, 5] |
+------+------------------+-------------+