How to Read JSON File in DataFrame from Azure Blob Storage
In this PySpark tutorial, you'll learn how to read a JSON file stored in Azure Blob Storage and load it into a PySpark DataFrame.
Step 1: Prerequisites
- Azure Storage Account
- Container with a JSON file
- Access key or SAS token
- PySpark environment (Databricks or local setup)
Step 2: Configure Spark to Access Azure Blob Storage
Configure the Spark session to access your Azure Blob Storage account using the access key or SAS token.
# Set the Spark configuration for Azure Blob Storage
spark.conf.set(
"fs.azure.account.key.<storage_account_name>.blob.core.windows.net",
"<access_key_or_sas_token>"
)
Step 3: Read JSON File into DataFrame
Now read the JSON file using the spark.read.json()
function and load it into a PySpark DataFrame.
# Define the file path
file_path = "wasbs://<container_name>@<storage_account_name>.blob.core.windows.net/<file_name.json>"
# Read JSON file into DataFrame
df = spark.read.json(file_path)
# Display the DataFrame
df.show(truncate=False)
# Print the schema of the DataFrame
df.printSchema()
Step 4: Example Output
Output of df.show()
:
+------------+-------------+-----+
| Name| Nationality| Age|
+------------+-------------+-----+
|Aamir Shahzad| Pakistan| 25|
| Ali Raza| USA| 30|
| Bob| UK| 45|
| Lisa| Canada| 35|
+------------+-------------+-----+
Output of df.printSchema()
:
root
|-- Name: string (nullable = true)
|-- Nationality: string (nullable = true)
|-- Age: long (nullable = true)
Summary
- Set Spark configuration to access Azure Blob Storage using your account key or SAS token.
- Use
spark.read.json()
to load JSON files into a PySpark DataFrame. - Verify your data using
df.show()
anddf.printSchema()
.