Read CSV, Parquet & JSON Files Using Apache Spark Pool in Azure Synapse
📘 Overview
Apache Spark Pools in Azure Synapse Analytics allow you to read and process data from various file formats such as CSV, Parquet, and JSON stored in Azure Data Lake. This is especially useful for data exploration, transformation, and analytics workflows.
🛠️ Prerequisites
- Spark Pool attached to your Synapse Workspace
- Storage account with access permissions configured
- Notebook interface using
%%pyspark
or language of choice
✅ Example 1: Read a CSV File
%%pyspark
df_csv = spark.read.option("header", "true") \
.option("inferSchema", "true") \
.csv("abfss://data@yourstorageaccount.dfs.core.windows.net/input/customers.csv")
df_csv.show()
✅ Example 2: Read a Parquet File
%%pyspark
df_parquet = spark.read.parquet("abfss://data@yourstorageaccount.dfs.core.windows.net/input/sales.parquet")
df_parquet.printSchema()
df_parquet.show()
✅ Example 3: Read a JSON File
%%pyspark
df_json = spark.read.option("multiline", "true") \
.json("abfss://data@yourstorageaccount.dfs.core.windows.net/input/products.json")
df_json.display()
🔍 Explanation
option("header", "true")
: Treats first row as header (CSV)inferSchema
: Automatically detects column types (CSV/JSON)multiline
: Important when dealing with nested or pretty-printed JSON
📦 File Path Format
Use abfss://
protocol to point to Azure Data Lake Gen2:
abfss://<container>@<account>.dfs.core.windows.net/<path>/filename
📌 Tips
- For large files, prefer Parquet for performance and compression
- Validate file path and permissions to avoid access errors
- Use
display(df)
ordf.show()
to preview data - Always check and handle nulls or unexpected formats
🎯 Use Cases
- Data ingestion and transformation in Spark
- Reading raw data from staging zones
- Feeding analytics and ML pipelines
📺 Watch the Video Tutorial
📚 Credit: Content created with the help of ChatGPT and Gemini.