What is a DataFrame in PySpark? | How to create DataFrame from Static Values | PySpark Tutorial

What is DataFrame? - PySpark Tutorial

What is DataFrame in PySpark?

A DataFrame in PySpark is a distributed collection of data organized into named columns. It is similar to a table in a relational database or an Excel spreadsheet. DataFrames allow you to process large amounts of data efficiently by using multiple computers at the same time.

Key Features

  • Structured Data: Organizes data into rows and columns.
  • Fast and Scalable: Handles large datasets effectively.
  • Data Source Flexibility: Works with CSV, JSON, Parquet, databases, etc.
  • SQL Queries: Supports SQL-like queries for filtering and grouping data.

Example: Creating a DataFrame

from pyspark.sql import SparkSession
from pyspark.sql import Row

# Create a SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()

# Create data as a list of Row objects
data = [
    Row(id=1, name="Alice", age=25),
    Row(id=2, name="Bob", age=30),
    Row(id=3, name="Charlie", age=35)
]

# Create DataFrame
df = spark.createDataFrame(data)

# Show DataFrame content
df.show()

Output

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  1|  Alice| 25|
|  2|    Bob| 30|
|  3|Charlie| 35|
+---+-------+---+

Conclusion

PySpark DataFrames are an essential tool for working with structured and semi-structured data in big data processing. They provide an easy-to-use API for data manipulation and analysis.

Watch on YouTube