Join CSV & Parquet Files in PySpark | Filter, Aggregate & Save to CSV in Synapse Notebook | Azure Synapse Analytics Tutorial

Join CSV and Parquet Files in PySpark - Azure Synapse Analytics Notebook

Join CSV and Parquet Files in PySpark | Filter, Aggregate & Save to CSV Using Azure Synapse Notebook

📘 Overview

In Azure Synapse Analytics, PySpark notebooks offer a great way to process and combine different file formats like CSV and Parquet. This tutorial demonstrates how to:

  • Read a CSV file and a Parquet file
  • Join them using a common key
  • Apply filtering and aggregation
  • Write the result as a single CSV file

🗂️ Sample Scenario

  • CSV File: Customer details
  • Parquet File: Order transactions
  • Goal: Get total order amount per country by joining customer and orders

📥 Step 1: Read CSV and Parquet Files

%%pyspark
customers = spark.read.option("header", "true").csv("abfss://data@youraccount.dfs.core.windows.net/customer.csv")
orders = spark.read.parquet("abfss://data@youraccount.dfs.core.windows.net/orders.parquet")

🔗 Step 2: Join Datasets

joined_df = customers.join(orders, customers.customer_id == orders.customer_id)

🔍 Step 3: Filter and Aggregate

result = joined_df.groupBy("country").agg({"order_amount": "sum"}).withColumnRenamed("sum(order_amount)", "total_sales")

💾 Step 4: Write to CSV

result.repartition(1).write.mode("overwrite").option("header", "true") \
    .csv("abfss://data@youraccount.dfs.core.windows.net/output/sales_by_country")

📌 Notes

  • Use repartition(1) to save a single CSV file
  • Always validate column names and types after reading
  • Optimize joins by caching if needed for large datasets

✅ Use Cases

  • Customer segmentation and sales performance
  • Combining multiple formats in data lakes
  • ETL processing pipelines in Synapse Spark Pools

📺 Watch the Video Tutorial

📚 Credit: Content created with the help of ChatGPT and Gemini.