How to Create Notebook to Join Two CSV Files and Write to Lake Database in Synapse Pipelines
📘 Overview
In this tutorial, we’ll show you how to create a PySpark notebook in Azure Synapse Analytics that reads two CSV files, performs a join operation, and writes the result to a Lake Database table. This notebook can then be integrated into a Synapse pipeline for automated data workflows.
🧱 Prerequisites
- Azure Synapse workspace with Spark Pool
- Lake Database created (or Spark database)
- CSV files stored in ADLS Gen2
🛠️ Step-by-Step Instructions
✅ Step 1: Read CSV Files
%%pyspark
df_customers = spark.read.option("header", True).csv("abfss://data@yourstorage.dfs.core.windows.net/customers.csv")
df_orders = spark.read.option("header", True).csv("abfss://data@yourstorage.dfs.core.windows.net/orders.csv")
✅ Step 2: Join the Two DataFrames
df_joined = df_customers.join(df_orders, df_customers.customer_id == df_orders.customer_id, "inner")
✅ Step 3: Write the Result to a Lake Database Table
df_joined.write \
.format("delta") \
.mode("overwrite") \
.saveAsTable("LakeDB.joined_customer_orders")
📂 Result
The output Delta table will be created inside your Lake Database and will be queryable via Spark and Serverless SQL Pools.
🔁 Integrate Notebook in Synapse Pipeline
- Go to Synapse Studio → Integrate → Pipeline
- Add “Notebook” activity
- Select the notebook you just created
- Configure Spark Pool and parameters (if any)
- Publish and trigger the pipeline
📌 Tips
- Validate file schema before joining
- Use
display()
for previewing data inside the notebook - Leverage
overwrite
mode for testing; useappend
for incremental writes
🎯 Use Cases
- Merge transactional and customer data
- Create curated data layers in Lakehouse
- Automate data ingestion with Synapse Pipelines
📺 Watch the Full Video Tutorial
📚 Credit: Content created with the help of ChatGPT and Gemini.