Advanced PySpark Array Functions Tutorial | slice(), concat(), element_at(), sequence() Explained

Advanced PySpark Array Functions | slice(), concat(), element_at(), sequence()

🔍 Advanced Array Manipulations in PySpark

This tutorial explores advanced array functions in PySpark including slice(), concat(), element_at(), and sequence() with real-world DataFrame examples.

📦 Sample DataFrame

data = [
    ("Aamir", [1, 2, 3, 4, 5, 6]),
    ("Sara", [7, 8, 9, 10, 11]),
    ("John", [12, 13, 14, 15])
]
df = spark.createDataFrame(data, ["name", "arr"])
df.show()

✅ Output

+------+------------------+
| name | arr              |
+------+------------------+
| Aamir| [1, 2, 3, 4, 5, 6]|
| Sara | [7, 8, 9, 10, 11]|
| John | [12, 13, 14, 15] |
+------+------------------+

1️⃣ slice()

Definition: Returns a subset of the array starting at a specified index for a given length.

from pyspark.sql.functions import slice
df.select("name", "arr", slice("arr", 2, 3).alias("sliced_array")).show()

✅ Output

+------+------------------+-------------+
| name | arr              | sliced_array|
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [2, 3, 4]   |
| Sara | [7, 8, 9, 10, 11]| [8, 9, 10]  |
| John | [12, 13, 14, 15] | [13, 14, 15]|
+------+------------------+-------------+

2️⃣ concat()

Definition: Combines multiple arrays into one array.

from pyspark.sql.functions import concat, array, lit
df.select("name", "arr", concat("arr", array(lit(100), lit(200))).alias("concatenated")).show()

✅ Output

+------+------------------+------------------------+
| name | arr              | concatenated           |
+------+------------------+------------------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [1, 2, 3, 4, 5, 6, 100, 200]|
| Sara | [7, 8, 9, 10, 11]| [7, 8, 9, 10, 11, 100, 200]|
| John | [12, 13, 14, 15] | [12, 13, 14, 15, 100, 200]|
+------+------------------+------------------------+

3️⃣ element_at()

Definition: Returns the element at the specified index in the array. Index starts at 1.

from pyspark.sql.functions import element_at
df.select("name", "arr", element_at("arr", 4).alias("element_at_4")).show()

✅ Output

+------+------------------+-------------+
| name | arr              | element_at_4|
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| 4          |
| Sara | [7, 8, 9, 10, 11]| 10         |
| John | [12, 13, 14, 15] | 15         |
+------+------------------+-------------+

4️⃣ sequence()

Definition: Generates an array of numbers from start to end.

from pyspark.sql.functions import sequence
df.select("name", "arr", sequence(lit(1), lit(5)).alias("seq_1_to_5")).show()

✅ Output

+------+------------------+-------------+
| name | arr              | seq_1_to_5  |
+------+------------------+-------------+
| Aamir| [1, 2, 3, 4, 5, 6]| [1, 2, 3, 4, 5] |
| Sara | [7, 8, 9, 10, 11]| [1, 2, 3, 4, 5] |
| John | [12, 13, 14, 15] | [1, 2, 3, 4, 5] |
+------+------------------+-------------+

📺 Watch the Full Tutorial