How to Write DataFrame to JSON File in Azure Blob Storage Using PySpark
This tutorial demonstrates how to use PySpark's write().json()
function to export a DataFrame as a JSON file to Azure Blob Storage. You'll also see how to use coalesce(1)
and secure your connection with a SAS token.
1️⃣ Step 1: Initialize Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WriteToJSONBlob").getOrCreate()
2️⃣ Step 2: Create a Sample DataFrame
data = [
("Aamir Shahzad", "Engineer", 35),
("Ali Raza", "Data Analyst", 28),
("Bob", "Manager", 40),
("Lisa", "Developer", 25)
]
columns = ["name", "designation", "age"]
df = spark.createDataFrame(data, schema=columns)
df.show()
3️⃣ Step 3: Configure Azure Blob Storage Access
# Replace with your actual values
spark.conf.set("fs.azure.sas.<container>.<storage_account>.blob.core.windows.net", "<sas_token>")
4️⃣ Step 4: Write DataFrame to JSON
output_path = "wasbs://<container>@<storage_account>.blob.core.windows.net/output-json"
df.coalesce(1).write \
.mode("overwrite") \
.json(output_path)
print("✅ Data written to Azure Blob Storage in JSON format.")
5️⃣ Step 5: Read Data Back from JSON
df_read = spark.read.json(output_path)
df_read.show()
print("✅ Data read back from Azure Blob Storage successfully.")