How to Read Multiple CSV Files from Blob Storage and Write to Azure SQL Table with Filenames
In this PySpark tutorial, you’ll learn how to load multiple CSV files from Azure Blob Storage, extract the filename from each file, and write the combined data to an Azure SQL table for tracking and processing.
1️⃣ Step 1: Setup Spark Session
from pyspark.sql import SparkSession
spark = SparkSession.builder \
.appName("ReadCSV_WriteSQL_WithFilenames") \
.getOrCreate()
2️⃣ Step 2: Set Azure Blob Storage Configuration
spark.conf.set(
"fs.azure.sas.<container>.<storage_account>.blob.core.windows.net",
"<your_sas_token>"
)
path = "wasbs://<container>@<storage_account>.blob.core.windows.net/input-folder/*.csv"
3️⃣ Step 3: Read All CSV Files and Add Filename Column
from pyspark.sql.functions import input_file_name, regexp_extract
df = spark.read.option("header", True).csv(path)
df = df.withColumn("filename", regexp_extract(input_file_name(), r"([^/]+$)", 1))
df.show()
4️⃣ Step 4: JDBC Configuration for Azure SQL
jdbcUrl = "jdbc:sqlserver://<your_server>.database.windows.net:1433;database=<your_db>;encrypt=true;trustServerCertificate=false;hostNameInCertificate=*.database.windows.net;loginTimeout=30"
connectionProperties = {
"user": "<your_username>",
"password": "<your_password>",
"driver": "com.microsoft.sqlserver.jdbc.SQLServerDriver"
}
5️⃣ Step 5: Write DataFrame to Azure SQL Table
df.write \
.mode("append") \
.jdbc(url=jdbcUrl, table="your_table_name", properties=connectionProperties)
print("✅ All CSV files uploaded to Azure SQL with filename column.")