How to use createDataFrame() with Schema in PySpark
In PySpark, when creating a DataFrame using createDataFrame()
, you can specify a schema to define column names and data types explicitly. This is useful when you want to control the structure and data types of your DataFrame instead of relying on PySpark's automatic inference.
Why define a Schema?
- Ensures consistent column names and data types
- Improves data quality and validation
- Provides better control over data transformations
Example Usage
Below is a sample example of how to create a DataFrame using a schema in PySpark:
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
# Define schema
schema = StructType([
StructField("id", IntegerType(), False),
StructField("name", StringType(), True),
StructField("age", IntegerType(), True)
])
# Sample data
data = [
(1, "Alice", 25),
(2, "Bob", 30),
(3, "Charlie", 35),
(4, "Amir", 40) # None represents a NULL value in PySpark
]
# Create DataFrame using schema
df = spark.createDataFrame(data, schema=schema)
# Show the DataFrame
df.show()
# Check the schema of the DataFrame
df.printSchema()
Output
+---+-------+---+
| id| name|age|
+---+-------+---+
| 1| Alice| 25|
| 2| Bob| 30|
| 3|Charlie| 35|
| 4| Amir| 40|
+---+-------+---+
Check the Schema
root
|-- id: integer (nullable = false)
|-- name: string (nullable = true)
|-- age: integer (nullable = true)
Watch the Video Tutorial
If you prefer a video explanation, check out the tutorial below: