PySpark Tutorial for Beginners and Advanced Users

PySpark Video Tutorial Series

The Complete Video Tutorial Series (Beginner to Advanced)

Chapter 1: Introduction to PySpark

  1. What is PySpark ? What is Apache Spark | Apache Spark vs PySpark
  2. How to Create Azure Databricks Service | PySpark Tutorial | PySpark Beginners Tutorial
  3. How to Format Notebooks | Top Markdown Formatting Tips & Tricks | PySpark Tutorial
  4. How to Use Comments in PySpark | How to Write Comments Like a Pro | PySpark Tutorial

Chapter 2: Core DataFrame Operations

  1. What is a DataFrame in PySpark? | How to create DataFrame from Static Values
  2. How to Add Columns and Check Schema in PySpark DataFrame
  3. How to Use createDataFrame Function with Schema in PySpark to create DataFrame
  4. How to Display Data in PySpark Using show() Function
  5. How to Use the display() Function in Databricks
  6. How to Use select(), selectExpr(), col(), expr(), when(), and lit() in PySpark
  7. How to Use withColumn() Function in PySpark to Add & Update Columns
  8. Use distinct() to Remove Duplicates from DataFrames | Get Unique Rows with distinct() in PySpark
  9. How to Use drop() to Remove Columns from DataFrame
  10. How to use dropDuplicates Function in PySpark
  11. How to Use dropna() Function in PySpark | Remove Null Values Easily
  12. How to use PySpark count() Function | Count Rows & Records Easily
  13. filter() vs where() | Filter DataFrames
  14. How to Use subtract to Compare and Filter DataFrames
  15. How to Use transform() for Custom DataFrame Transformationsl
  16. PySpark replace() Function | Replace Values in DataFrames Easily

Chapter 3: Reading and Writing DataFrames

  1. How to Read CSV File into DataFrame from Azure Blob Storage
  2. How to Read JSON File into DataFrame from Azure Blob Storage
  3. How to Use toJSON() - Convert DataFrame Rows to JSON Strings
  4. writeTo() Explained: Save, Append, Overwrite DataFrames to Tables
  5. How to use Write function to Create Single CSV file in Blob Storage from DataFrame
  6. How to Write DataFrame to Parquet File in Azure Blob Storage
  7. How to Write DataFrame to JSON File in Azure Blob Storage
  8. How to Write DataFrame to Azure SQL Table Using JDBC write() Function
  9. How to Read Data from Azure SQL Table and Write to JSON File in Blob Storage
  10. Read Multiple CSV Files from Blob Storage and Write to Azure SQL Table with Filenames

Chapter 4: Advanced DataFrame Operations

  1. How to Use the display() Function in Databricks
  2. How to Sort DataFrames Using orderBy()
  3. limit() Function to Display Limited Rows
  4. How to Use describe() for DataFrame Statistics
  5. Difference Between union() and unionAll() | union() vs unionAll()
  6. fillna() Function to Replace Null or Missing Values
  7. groupBy() Function | Group & Summarize DataFrames
  8. first(), head(), and tail() Functions | Retrieve Data Efficiently
  9. exceptAll() Function Explained | Subtract and Find Differences Between DataFrames
  10. How to Use Intersect() and IntersectAll() | Compare DataFrames Easily
  11. na Functions and isEmpty Explained with Examples
  12. How to Use Cube for GroupBy and Aggregations
  13. How to Aggregate Data Using agg() Function
  14. How to Use colRegex() for Column Selection
  15. How to Get Column Names Using columns Function
  16. How to Use Pivot Function | Transform and Summarize Data Easily
  17. How to Perform Unpivot | Convert Columns to Rows Easily
  18. How to Use createTempView | Run SQL Queries on DataFrames
  19. How to Use createGlobalTempView | Share Views Across Sessions
  20. Melt Function Explained | Reshape & Unpivot DataFrames Step by Step
  21. How to Use dtypes | Get DataFrame Column Names and Types
  22. How to Use randomSplit | Split Your DataFrame into Train and Test Sets
  23. DataFrame.to Function | Schema Reconciliation & Column Reordering Made Easy
  24. take Function | Get First N Rows from DataFrame Fast
  25. DataFrame summary for Statistical Summary in One Command
  26. freqItems() Function | Identify Frequent Items in Columns Fast
  27. How to Use rollup() to Aggregate Data by Groups and Subtotals
  28. How to Use crosstab() to Analyze Relationships Between Columns
  29. How to Use unionByName() to Join DataFrames by Column Names
  30. How to Use sample() to Randomly Select Data
  31. How to Use withColumnRenamed() to Rename Columns
  32. How to use withColumnsRenamed to Rename Multiple Columns
  33. stat Function Tutorial | Perform Statistical Analysis on DataFrames

Chapter 5: DataFrame Performance and Optimization

  1. How to Get a Column in PySpark by using DataFrame with Dot or DataFrame with Square Brackets
  2. How to Use approxQuantile() in PySpark | Quick Guide to Percentiles & Median
  3. How to Use Cache in PySpark to Improve Spark Performance
  4. checkpoint() : Improve Fault Tolerance & Speed in Spark Jobs - Tutorial for Beginners
  5. coalesce() Function Tutorial - Optimize Partitioning for Faster Spark Jobs
  6. repartition() Function Tutorial: Optimize Data Partitioning for Better Performance
  7. collect() Function Tutorial : Retrieve Entire DataFrame to Driver with Examples
  8. How to Use hint() for Join Optimization – Broadcast, Shuffle, Merge
  9. localCheckpoint Explained | Improve Performance with localCheckpoint()
  10. How to Use persist() Function | Cache vs Persist Explained with Examples
  11. Optimize Your Data with repartitionByRange()
  12. unpersist Explained | How to Free Memory with unpersist Function With Examples

Chapter 6: DataFrame Analysis

  1. How to Use corr() Function in PySpark : Finding Correlation Between Columns with corr()
  2. cov() Tutorial | Covariance Analysis for Numerical Columns

Chapter 7: DataFrame Joins

  1. How to Use crossJoin() Function for Cartesian Product
  2. Joins Explained | Inner, Left, Right Join with Examples in PySpark DataFrames

Chapter 8: RDDs in PySpark

  1. How to Convert PySpark DataFrame to RDD Using .rdd | PySpark RDD vs DataFrame
  2. What is RDD in PySpark? | A Beginner’s Guide to Apache Spark’s Core Data Structure
  3. Different Ways to create RDD in PySpark
  4. What Are RDD Partitions in PySpark? | How Spark Partitioning works
  5. PySpark PairRDD Transformations | groupByKey, reduceByKey, sortByKey Explained with Real Data
  6. Understanding RDD Actions in PySpark - collect() vs count() vs reduce()
  7. RDD Persistence in PySpark Explained | MEMORY_ONLY vs MEMORY_AND_DISK with Examples
  8. Optimize Spark Shuffles: Internals of groupByKey vs reduceByKey
  9. Boost PySpark Performance with Broadcast Variables & Accumulators
  10. RDD vs DataFrame in PySpark – Key Differences with Real Examples

Chapter 9: Working with Functions

  1. PySpark foreach Function Tutorial Apply Custom Logic to Each Row in a DataFrame
  2. PySpark foreachPartition Explained Process DataFrame Partitions Efficiently with Examples
  3. How to Use call_function() in PySpark – Dynamically Apply SQL Functions in Your Code
  4. How to Use call_udf() in PySpark | Dynamically Apply UDFs in Real-Time
  5. Boost PySpark Performance with pandas_udf
  6. Create Custom Column Logic in PySpark Using udf() | Easy Guide with Real Examples
  7. How to Build User-Defined Table Functions (UDTFs) in PySpark | Split Rows

Chapter 10: SQL and Metadata

  1. PySpark inputFiles Function - How to Get Source File Paths from DataFrame
  2. PySpark isLocal Function : Check If DataFrame Operations Run Locally
  3. What is explain() in PySpark - Spark Logical vs Physical Plan - PySpark Tutorial for Beginners
  4. PySpark offset() Function : How to Skip Rows in Spark DataFrame
  5. Explore Databases and Tables in PySpark with spark.catalog | Guide to Metadata Management

Chapter 11: Sorting and Data Types

  1. PySpark Sorting Explained | ASC vs DESC | Handling NULLs with asc_nulls_first & desc_nulls_last
  2. PySpark cast() vs astype() Explained | Convert String to Int, Float & Double in DataFrame
  3. Core Data Types in PySpark Explained - IntegerType, FloatType, DoubleType, DecimalType, StringType
  4. PySpark Complex Data Types Explained : ArrayType, MapType, StructType & StructField for Beginners

Chapter 12: Nulls and Complex Types

  1. PySpark Null & Comparison Functions : Between(), isNull(), isin(), like(), rlike(), ilike()
  2. PySpark when() and otherwise() Explained | Apply If-Else Conditions to DataFrames
  3. PySpark String Functions Explained | contains(), startswith(), substr(), endswith()
  4. Working with Structs and Nested Fields in PySpark getField, getItem, withField, dropFields
  5. PySpark Date and Timestamp Types - DateType, TimestampType, Interval Types
  6. PySpark Array Functions : array(), array_contains(), sort_array(), array_size()
  7. PySpark Set-Like Array Functions : arrays_overlap(), array_union(), flatten(), array_distinct()
  8. Advanced Array Manipulations in PySpark _ slice(), concat(), element_at(), sequence()
  9. PySpark Map Functions: create_map(), map_keys(), map_concat(), map_values
  10. Transforming Arrays and Maps in PySpark : Advanced Functions_ transform(), filter(), zip_with()
  11. Flatten Arrays & Structs with explode(), inline(), and struct()

Chapter 13: Dates and Times

  1. Top PySpark Built-in DataFrame Functions Explained | col(), lit(), when(), expr(), rand() & More
  2. Top PySpark Math Functions Explained | abs(), round(), log(), pow(), sin(), degrees() & More
  3. Getting Current Date & Time in PySpark | curdate(), now(), current_timestamp()
  4. Date Calculations in PySpark | Add, Subtract, Datediff, Months Between & More
  5. PySpark Date Formatting & Conversion Tutorial : to_date(), to_timestamp(), unix_timestamp()
  6. PySpark Date & Time Creation : make_date(), make_timestamp(), make_interval()
  7. PySpark Date and Time Extraction Tutorial _ year(), hour(), dayofweek(), date_part()
  8. PySpark Date Truncation Functions: trunc(), date_trunc(), last_day()

Chapter 14: String Manipulation

  1. How to perform String Cleaning in PySpark lower, trim, initcap Explained with Real Data
  2. Substring Functions in PySpark: substr(), substring(), overlay(), left(), right()
  3. String Search in PySpark | contains, startswith, endswith, like, rlike, locate
  4. String Formatting in PySpark | concat_ws, format_number, printf, repeat, lpad, rpad Explained
  5. Split Strings in PySpark | split(str, pattern, limit) Function Explained with Examples
  6. Split_part() in PySpark Explained | Split Strings by Delimiter and Extract Specific Part
  7. Using find_in_set() in PySpark | Search String Position in a Delimited List
  8. Extract Data with regexp_extract() in PySpark | Regex Patterns Made Easy
  9. Clean & Transform Strings in PySpark Using regexp_replace() Replace Text with Regex
  10. Extract Substrings Easily in PySpark with regexp_substr()

Chapter 15: JSON

  1. PySpark JSON Functions Explained | How to Parse, Transform & Extract JSON Fields in PySpark

Chapter 16: Aggregations and Window Functions

  1. PySpark Aggregations: count(), count_distinct(), first(), last() Explained with Examples
  2. Statistical Aggregations in PySpark | avg(), mean(), median(), mode()
  3. Summarizing Data with Aggregate Functions in PySpark _ sum(), sum_distinct(), bit_and()
  4. PySpark Window Functions | Rank, Dense_Rank, Lead, Lag, NTILE | Real-World Demo
  5. Master PySpark Sorting: sort(), asc(), desc() Explained with Examples

Chapter 17: Additional DataFrame Functions

  1. PySpark toLocalIterator Explained: Efficiently Convert DataFrame to Iterator
  2. PySpark sequence Function Explained: Generate Sequences of Numbers and Dates

© 2025 Aamir Shahzad. All rights reserved.