How to Read JSON File in DataFrame from Azure Blob Storage | Step-by-Step Guide

How to Read JSON File in DataFrame from Azure Blob Storage

In this PySpark tutorial, you'll learn how to read a JSON file stored in Azure Blob Storage and load it into a PySpark DataFrame.

Step 1: Prerequisites

Azure Storage Account
Container with a JSON file
Access key or SAS token
PySpark environment (Databricks or local setup)

Step 2: Configure Spark to Access Azure Blob Storage

Configure the Spark session to access your Azure Blob Storage account using the access key or SAS token.

# Set the Spark configuration for Azure Blob Storage
spark.conf.set(
    "fs.azure.account.key.<storage_account_name>.blob.core.windows.net",
    "<access_key_or_sas_token>"
)

Step 3: Read JSON File into DataFrame

Now read the JSON file using the spark.read.json() function and load it into a PySpark DataFrame.

# Define the file path
file_path = "wasbs://<container_name>@<storage_account_name>.blob.core.windows.net/<file_name.json>"

# Read JSON file into DataFrame
df = spark.read.json(file_path)

# Display the DataFrame
df.show(truncate=False)

# Print the schema of the DataFrame
df.printSchema()

Step 4: Example Output

Output of df.show():

+------------+-------------+-----+
|        Name|  Nationality|  Age|
+------------+-------------+-----+
|Aamir Shahzad|     Pakistan|   25|
|     Ali Raza|          USA|   30|
|          Bob|           UK|   45|
|         Lisa|       Canada|   35|
+------------+-------------+-----+

Output of df.printSchema():

root
 |-- Name: string (nullable = true)
 |-- Nationality: string (nullable = true)
 |-- Age: long (nullable = true)

Summary

Set Spark configuration to access Azure Blob Storage using your account key or SAS token.
Use spark.read.json() to load JSON files into a PySpark DataFrame.
Verify your data using df.show() and df.printSchema().

Welcome To TechBrothersIT

Label

How to Read JSON File into DataFrame from Azure Blob Storage | PySpark Tutorial