Join CSV and Parquet Files in PySpark | Filter, Aggregate & Save to CSV Using Azure Synapse Notebook
📘 Overview
In Azure Synapse Analytics, PySpark notebooks offer a great way to process and combine different file formats like CSV and Parquet. This tutorial demonstrates how to:
- Read a CSV file and a Parquet file
- Join them using a common key
- Apply filtering and aggregation
- Write the result as a single CSV file
🗂️ Sample Scenario
- CSV File: Customer details
- Parquet File: Order transactions
- Goal: Get total order amount per country by joining customer and orders
📥 Step 1: Read CSV and Parquet Files
%%pyspark
customers = spark.read.option("header", "true").csv("abfss://data@youraccount.dfs.core.windows.net/customer.csv")
orders = spark.read.parquet("abfss://data@youraccount.dfs.core.windows.net/orders.parquet")
🔗 Step 2: Join Datasets
joined_df = customers.join(orders, customers.customer_id == orders.customer_id)
🔍 Step 3: Filter and Aggregate
result = joined_df.groupBy("country").agg({"order_amount": "sum"}).withColumnRenamed("sum(order_amount)", "total_sales")
💾 Step 4: Write to CSV
result.repartition(1).write.mode("overwrite").option("header", "true") \
.csv("abfss://data@youraccount.dfs.core.windows.net/output/sales_by_country")
📌 Notes
- Use
repartition(1)
to save a single CSV file - Always validate column names and types after reading
- Optimize joins by caching if needed for large datasets
✅ Use Cases
- Customer segmentation and sales performance
- Combining multiple formats in data lakes
- ETL processing pipelines in Synapse Spark Pools
📺 Watch the Video Tutorial
📚 Credit: Content created with the help of ChatGPT and Gemini.