How to Use createDataFrame Function with Schema in PySpark to create DataFrame | PySpark Tutorial

How to use createDataFrame() with Schema in PySpark

How to use createDataFrame() with Schema in PySpark

In PySpark, when creating a DataFrame using createDataFrame(), you can specify a schema to define column names and data types explicitly. This is useful when you want to control the structure and data types of your DataFrame instead of relying on PySpark's automatic inference.

Why define a Schema?

  • Ensures consistent column names and data types
  • Improves data quality and validation
  • Provides better control over data transformations

Example Usage

Below is a sample example of how to create a DataFrame using a schema in PySpark:

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

# Define schema
schema = StructType([
    StructField("id", IntegerType(), False),
    StructField("name", StringType(), True),
    StructField("age", IntegerType(), True)
])

# Sample data
data = [
    (1, "Alice", 25),
    (2, "Bob", 30),
    (3, "Charlie", 35),
    (4, "Amir", 40)  # None represents a NULL value in PySpark
]

# Create DataFrame using schema
df = spark.createDataFrame(data, schema=schema)

# Show the DataFrame
df.show()

# Check the schema of the DataFrame
df.printSchema()

Output

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob| 30|
|  3|Charlie| 35|
|  4|   Amir| 40|
+---+-------+---+

Check the Schema

root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = true)

Watch the Video Tutorial

If you prefer a video explanation, check out the tutorial below: