PySpark cast() vs astype() Explained |Convert String to Int, Float & Double in DataFrame | PySpark Tutorial

PySpark cast() vs astype() Explained | Convert String to Int, Float & Double

PySpark cast() vs astype() Explained

In this tutorial, we'll explore how to convert PySpark DataFrame columns from one type to another using cast() and astype(). You'll learn how to convert string columns to integers, floats, and doubles in a clean and efficient way.

1. Sample DataFrame

from pyspark.sql import SparkSession
from pyspark.sql.functions import col

spark = SparkSession.builder.appName("CastExample").getOrCreate()

data = [
    ("1", "Aamir", "50000.5"),
    ("2", "Ali", "45000.0"),
    ("3", "Bob", None),
    ("4", "Lisa", "60000.75")
]

columns = ["id", "name", "salary"]
df = spark.createDataFrame(data, columns)
df.printSchema()
df.show()

2. Using cast() Function

Convert id to integer and salary to float:

df_casted = df.withColumn("id", col("id").cast("int")) \
              .withColumn("salary", col("salary").cast("float"))
df_casted.printSchema()
df_casted.show()

3. Using astype() Function

This is an alias for cast() and used in the same way:

df_astype = df_casted.withColumn("salary", col("salary").astype("double"))
df_astype.printSchema()
df_astype.show()

Output:

Original DataFrame (all columns as strings):
+---+-----+--------+
| id| name| salary |
+---+-----+--------+
| 1 |Aamir|50000.5 |
| 2 | Ali |45000.0 |
| 3 | Bob |  null  |
| 4 |Lisa |60000.75|
+---+-----+--------+

After cast():
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: float (nullable = true)

After astype():
root
 |-- id: integer (nullable = true)
 |-- name: string (nullable = true)
 |-- salary: double (nullable = true)

📺 Watch the Full Tutorial