How to Write a Single CSV File to Azure Blob Storage from PySpark DataFrame
By default, PySpark writes DataFrames to CSV as multiple files due to partitioning. In this tutorial, you'll learn how to use coalesce(1)
to save your DataFrame as a single CSV file to Azure Blob Storage, along with setting headers and options.
1️⃣ Step 1: Create Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("WriteSingleCSV").getOrCreate()
2️⃣ Step 2: Create Sample DataFrame
data = [
("Aamir Shahzad", "Lahore", "Pakistan"),
("Ali Raza", "Karachi", "Pakistan"),
("Bob", "New York", "USA"),
("Lisa", "Toronto", "Canada")
]
columns = ["full_name", "city", "country"]
df = spark.createDataFrame(data, schema=columns)
df.show()
3️⃣ Step 3: Configure Azure Blob Storage Access
spark.conf.set("fs.azure.account.key.<your-storage-account-name>.blob.core.windows.net", "<your-access-key>")
4️⃣ Step 4: Write to Single CSV File
output_path = "wasbs://<your-container>@<your-storage-account>.blob.core.windows.net/people_data"
df.coalesce(1).write \
.option("header", "true") \
.mode("overwrite") \
.csv(output_path)
5️⃣ Step 5: Confirm CSV File in Blob
# You can list files using dbutils if on Databricks:
files = dbutils.fs.ls(output_path)
for f in files:
print(f.name)