Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC Explained | Azure Synapse Analytics Tutorial

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC

Spark Table Formats in Azure Synapse: Parquet, CSV, Delta, JSON, Avro & ORC Explained

📘 Overview

Apache Spark in Azure Synapse supports multiple table formats, each optimized for different use cases like performance, compatibility, or schema evolution. This post explores how to create and use Parquet, Delta, CSV, JSON, Avro, and ORC table formats.

📂 Common File Formats and Their Use Cases

  • Parquet – Columnar storage, best for analytics
  • Delta – ACID compliant, supports updates/deletes
  • CSV – Simple format for interoperability
  • JSON – Best for semi-structured data
  • Avro – Row-based format, good for schema evolution
  • ORC – Columnar, highly optimized for big data

🛠️ Creating Tables in Different Formats

✅ 1. Parquet Table

%%spark
CREATE TABLE lake.parquet_table (
  id INT, name STRING
)
USING PARQUET
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/parquet'

✅ 2. Delta Table

%%spark
CREATE TABLE lake.delta_table (
  id INT, product STRING
)
USING DELTA
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/delta'

✅ 3. CSV Table

%%spark
CREATE TABLE lake.csv_table (
  id INT, region STRING
)
USING CSV
OPTIONS ('header'='true')
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/csv'

✅ 4. JSON Table

%%spark
CREATE TABLE lake.json_table
USING JSON
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/json'

✅ 5. Avro Table

%%spark
CREATE TABLE lake.avro_table
USING AVRO
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/avro'

✅ 6. ORC Table

%%spark
CREATE TABLE lake.orc_table
USING ORC
LOCATION 'abfss://data@account.dfs.core.windows.net/formats/orc'

📊 Format Comparison Table

Format Type Supports Schema Evolution ACID Transactions Best Use Case
Parquet Columnar Partial No Read-heavy analytics
Delta Columnar + Transaction Logs Yes Yes ETL with updates/deletes
CSV Row No No Simple data exchange
JSON Row No No Log data or API responses
Avro Row Yes No Data pipelines & evolution
ORC Columnar Yes No Big data batch processing

📌 Best Practices

  • Use Delta for ACID-compliant workloads
  • Use Parquet or ORC for optimized analytical queries
  • Use CSV only when human readability or legacy system compatibility is needed

📺 Watch the Full Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.