PySpark cast() vs astype() Explained
In this tutorial, we'll explore how to convert PySpark DataFrame columns from one type to another using cast()
and astype()
. You'll learn how to convert string columns to integers, floats, and doubles in a clean and efficient way.
1. Sample DataFrame
from pyspark.sql import SparkSession
from pyspark.sql.functions import col
spark = SparkSession.builder.appName("CastExample").getOrCreate()
data = [
("1", "Aamir", "50000.5"),
("2", "Ali", "45000.0"),
("3", "Bob", None),
("4", "Lisa", "60000.75")
]
columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.printSchema()
df.show()
2. Using cast()
Function
Convert id
to integer and salary
to float:
df_casted = df.withColumn("id", col("id").cast("int")) \
.withColumn("salary", col("salary").cast("float"))
df_casted.printSchema()
df_casted.show()
3. Using astype()
Function
This is an alias for cast()
and used in the same way:
df_astype = df_casted.withColumn("salary", col("salary").astype("double"))
df_astype.printSchema()
df_astype.show()
Output:
Original DataFrame (all columns as strings):
+---+-----+--------+
| id| name| salary |
+---+-----+--------+
| 1 |Aamir|50000.5 |
| 2 | Ali |45000.0 |
| 3 | Bob | null |
| 4 |Lisa |60000.75|
+---+-----+--------+
After cast():
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- salary: float (nullable = true)
After astype():
root
|-- id: integer (nullable = true)
|-- name: string (nullable = true)
|-- salary: double (nullable = true)