in this series, PySpark for Beginners: Mastering the Basics then you already understand the heart of Spark: distributed data, DataFrames, and lazy execution. You’ve installed PySpark, stood up a SparkSession, read a CSV and performed simple manipulations of the data in a Dataframe. I’ll leave a link to that story at the end of this one.

One thing worth repeating from that original article is that I often use the terms PySpark and Spark interchangeably, but strictly speaking, Spark is the overarching distributed computing framework (written in Scala), and PySpark is a dedicated Python API to Spark.

Beyond the basics

Now, something interesting happens when you get past that beginner stage. You quickly realise your second PySpark project requires a slightly different mindset:

  • You want to read/write data in a safer, faster, more predictable way.
  • You want to combine datasets without feeling uncertain about joins.
  • You want to understand why Spark is behaving the way it does — and how to nudge it gently in the right direction.

This article takes you through those next steps. It’s deliberately slow‑paced and practical. No deep internals. No cluster tuning. No complicated Spark optimisations. Just the things real beginners need to know when they move from toy examples to small, real-world work.

We’re using open‑source Spark, running locally, just like before.

1. Taking the next step: reading data properly

In my first article, we used the simplest possible CSV loader:

df = spark.read.csv("sales.csv", header=True, inferSchema=True)

It works - and it’s fine for early experiments - but it hides a subtle problem.

Spark is guessing your data types

When you use the inferSchema=True directive, Spark looks at a small sample of your file and uses that information to guess whether a column is an integer, string, boolean, or double. That means:

  • If 99 rows appear to be numeric and the 100th row is blank, Spark might interpret the column as a string.
  • If someone edits the file next week and accidentally adds £23.50 instead of 23.50, Spark might treat the entire column differently.
  • If your file is large, the sample Spark uses won’t represent the whole dataset.

This can lead to mysterious behaviour later , the kind of bugs beginners find hardest to diagnose.

A better beginner habit: define a schema for your data

Think of a schema as Spark’s version of a blueprint for reading data. Before building anything, you tell Spark things like:

The names of the columns
What data type they should be
Whether or not a column value is optional.

Here’s what it looks like for our sales data example. Recall that the data looked like this:

transaction_id,customer_name,net_amount,tax_amount, is_member
101,Alice,250.50,25.05,true
102,Bob,120.00,6.00, false
103,Charlie,450.75,25.07,true
104,David,89.99,5.73,false

To specify the types of the above fields in Spark, we define our schema using code like the following.

from pyspark.sql import types as T

schema = T.StructType([
T.StructField("transaction_id", T.IntegerType(), False),
T.StructField("customer_name", T.StringType(), False),
T.StructField("net_amount", T.DoubleType(), True),
T.StructField("tax_amount", T.DoubleType(), True),
T.StructField("is_member", T.BooleanType(), True),
])

The column names and type parameters are self-explanatory. The True[False] parameter indicates that there may [not] be NULL values in the column. Note, the True/False nullability flag is mostly schema metadata and optimisation info. It’s not always strictly enforced for every data source the way a database NOT NULL constraint is.

More useful options when reading CSV data

There’s a bunch of handy CSV read options you can combine with the schema directive that make loading CSV data even more reliable.

The more common options include:

  • mode=”PERMISSIVE”: keeps bad rows as much as possible
  • mode=”DROPMALFORMED”: drops malformed rows
  • mode=”FAILFAST”: errors immediately
  • header= True[False]: Does the file contain [or not] a header record
  • nullValue: what text should replace null values in the input
  • dateFormat / timestampFormat

Now we can load the sales_data into a Dataframe like this:

df = (
    spark.read
    .option("header", True)
    # Other modes: "PERMISSIVE" and "DROPMALFORMED".
    .option("mode", "FAILFAST")
    .option("nullValue", "N/A")
    .schema(schema)
    .csv("sales_data.csv")
)

Why is this important for beginners?

  1. You know what the data types are before you start working.
  2. If specified, Spark will reject weird rows instead of silently interpreting them.
  3. Your transformations become more predictable.
  4. If you join two datasets later, type mismatches won’t surprise you.

2. Understanding data transformations 

Recall, in my previous article, in our first steps with manipulating dataframes with PySpark, we added an extra, derived column to our Dataframe using code like this:

df2 = df.withColumn("gross_amount", df.net_amount + df.tax_amount)

I explained that this line doesn’t calculate anything yet. It simply adds a step to Spark’s internal plan:

1. Read the CSV  
2. Add a new column (gross_amount = net + tax)

Then you might add more steps like this:

df3 = df2.withColumn("tax_percentage", df2.tax_amount / df2.gross_amount * 100)

Still, no computation has happened. Only when you perform an action like …

df3.show()

… does Spark say:

“Okay, now I need to actually run all these steps.”

This is what “lazy execution” means, but the important bit for beginners isn’t the name. It’s the effect, and it means,

  • You can chain many transformations without “paying” for them until you need the result.
  • Spark can rearrange the order internally to run things efficiently.
  • You don’t waste time doing intermediate steps on data you might filter out later.

Think of it like you would an everyday task, like making a sandwich:

  • You gather all the ingredients.
  • You assemble it in your mind.
  • You only actually start cutting and preparing once you know precisely what you’re making.

3. Cleaning data before it causes problems

Real data is usually messy and often contains missing values, blank strings, duplicate records, or placeholder values like “N/A” and “unknown”.

In PySpark, the goal is to catch and deal with obvious problems early so the rest of your workflow behaves predictably. PySpark has a number of useful functions that enable you to do this.

Dropping rows with missing values

The simplest cleaning function is dropna().

df_clean = df.dropna()

This removes any row that contains a null value in any column. That can be useful, but it is often too aggressive.

More commonly, you only drop rows where important columns in that particular row are missing:

df_clean = df.dropna(subset=["net_amount", "tax_amount"])

This means:

Keep the row as long as net_amount and tax_amount are present.

Other columns may still contain nulls, and that might be fine.

Filling missing values

Sometimes you don’t want to remove rows. You just want to replace missing values with something sensible.

That is where fillna() is useful.

df_clean = df.fillna({"city": "Unknown"})

You can also fill numeric columns:

df_clean = df.fillna({"tax_amount": 0.0})

This is useful when a missing value has a clear meaning. For example, a missing discount amount might reasonably become 0.0. But be careful. Filling missing values can change the meaning of your data if you choose the wrong default.

Changing column types with cast()

Sometimes Spark reads a column as the wrong type, especially when working with CSV files. If so, you can convert a column using the cast() operator:

from pyspark.sql import functions as F 
df_clean = df.withColumn("net_amount",F.col("net_amount").cast("double") )

This is especially common when dates, numbers, or booleans have been read as strings.

Removing duplicate rows

Duplicate rows can appear when files are exported more than once, joined incorrectly, or combined from multiple sources. You can remove exact duplicates like this:

df_clean = df.dropDuplicates()

Or remove duplicates based on one or more selected columns.

df_clean = df.dropDuplicates(["transaction_id"])

That second version is often more useful because it says:

Each transaction ID should only appear once.

A small data cleaning example

Putting those ideas together:

from pyspark.sql import functions as F


df_clean = (
    df
    # Remove transactions missing required values.
    .dropna(subset=["transaction_id", "net_amount"])
    # Supply defaults for optional values.
    .fillna(
        {
            "city": "Unknown",
            "tax_amount": 0.0,
        }
    )
    # Apply the expected numeric types.
    .withColumn(
        "net_amount",
        F.col("net_amount").cast("double"),
    )
    .withColumn(
        "tax_amount",
        F.col("tax_amount").cast("double"),
    )
    # Keep one row for each transaction.
    .dropDuplicates(["transaction_id"])
)

4. Joining datasets in PySpark without getting lost

If you’ve worked with databases before, you’ve probably written SQL statements that join two or more tables together. Joins in Spark work the same way, but on Dataframes.

What is a join?

If the concept of a join is new to you, they are a way to match rows from one DataFrame with related rows from another DataFrame. In other words, it answers a question like:

“Which rows in this DataFrame correspond to rows in that DataFrame?”

That is the main idea behind every join in PySpark. Once that part is clear, the syntax and join types become much easier to understand.

If you have two Dataframes like this:

sales_data.csv

transaction_id, customer_name, net_amount, tax_amount
101, Alice,    250.50, 25.05
102, Bob,      120.00, 6.00

customers.csv

customer_name, city, loyalty_level
Alice, New York, Gold
Bob,   London,   Silver

You can join them on their common customer_name field like this:

df_sales = spark.read.csv("sales_data.csv", header=True)
df_customers = spark.read.csv("customers.csv", header=True)
df_joined = df_sales.join(df_customers, on="customer_name", how="inner")
df_joined.show()


# Output
+-------------+--------------+----------+----------+--------+-------------+
|customer_name|transaction_id|net_amount|tax_amount|city    |loyalty_level|
+-------------+--------------+----------+----------+--------+-------------+
|Alice        |101           |250.50    |25.05     |New York|Gold         |
|Bob          |102           |120.00    |6.00      |London  |Silver       |
+-------------+--------------+----------+----------+--------+-------------+

Which join should beginners use?

There are several different types of joins available in Spark. For 99% of beginner use‑cases, you’ll use one of the following:

  • inner  — show only matching rows
  • left  — show everything in the left table, plus matches
  • outer  — show all rows from both tables

And of these, the inner join will be far and away the most common type of join you’ll use in your day-to-day work

Don’t worry about “broadcast”, “sort‑merge”, “shuffle‑hash”, or any other advanced join strategy yet. As your experience of Spark grows, you can read up on these at your leisure.

Just remember:

Joins are computaionally more expensive than simple column operations, so use them when necessary — but not casually.

5. Reading & Writing data out in the “Spark way”: Parquet

Most beginners stick with CSV because it’s familiar. But CSV is slow, rigid, and lacks support for data types, and in real life, Parquet is Spark’s native data format. Parquet is a columnar, compressed data format ideally suited for data analytics, data reporting and read-heavy workloads.

When Spark reads a Parquet data set:

  • It only loads the columns you actually need.
  • It understands every data type.
  • It loads significantly faster than CSV.

You write out the Dataframe contents in Parquet format files like this:

df_joined.write.mode("overwrite").parquet("output/enriched_sales")

Then you can read it back instantly like this,

df_fast = spark.read.parquet("output/enriched_sales")
df_fast.show()

NB. Using Parquet for file input and output is the single easiest performance “upgrade” for any Spark beginner.

6. Thinking in PySpark workflows

Once you understand how to read data, clean it, transform it, join it, and write it back out, the next step is learning how to organise those actions into a simple workflow. A beginner PySpark project usually follows this sequence:

Read data
  -> check and clean it
  -> add useful columns
  -> combine with other data
  -> write the result

That may sound obvious, but it is an important shift. You are no longer just experimenting with one DataFrame at a time. You are building a repeatable process.

Keep each stage simple

A useful beginner habit is to give each stage of your workflow a clear purpose. For example:

df_raw = spark.read.schema(schema).csv("sales_data.csv", header=True)

df_clean = df_raw.dropna(subset=["net_amount", "tax_amount"])

df_enriched = df_clean.withColumn(
    "gross_amount",
    F.col("net_amount") + F.col("tax_amount")
)

df_final = df_enriched.join(df_customers, on="customer_name", how="left")

df_final.write.mode("overwrite").parquet("output/final_dataset")

This style is slightly more verbose than chaining everything into one long expression, but it is much easier to read when you are learning.

Each DataFrame name tells you where you are in the workflow:

df_raw       -> the data as it arrived
df_clean     -> the data after basic cleaning
df_enriched  -> the data after adding new meaning
df_final     -> the dataset ready to save

Why this matters

When something goes wrong, this structure makes debugging much easier.

You can inspect each stage by looking at the data:

df_raw.show()
df_clean.show()
df_enriched.show()

You can check row counts:

df_raw.count()
df_clean.count()
df_final.count()

This helps to answer useful questions like:

Did rows disappear unexpectedly during cleaning?
Did the join create more rows than expected?
Did a calculated column produce nulls?

The simple mental model of: Inputs → preparation → combination → output will take you surprisingly far in your PySpark journey.

7. A gentle introduction to the Spark UI

Spark has a nice little web UI that switches on when you run an action like .count() or .write(). When your Spark job is running locally, visit:

http://localhost:4040

You should see something like this displayed.

It looks a bit overwhelming, but you don’t need to understand every tab. At this stage, you only need to know that the UI exists and why it’s useful. And it’s useful because it helps you see which Spark jobs have run or are currently running. 

And, as your experience in Spark grows, the UI can help you understand why jobs failed or are taking longer to run than expected. But that comes much later. For now, treat the Spark UI like the dashboard in your car — you don’t need to understand the engine to notice when something looks odd.

Summary: You’re now ready for your first real PySpark project

At this point, you’ve moved beyond “I can run Spark” into “I can build a clean, simple Spark pipeline.”

You now know how to:

  • read data safely,
  • clean and prepare it,
  • enrich it with new columns,
  • combine multiple datasets,
  • save the result efficiently,
  • and observe Spark just enough to stay confident.

Nothing in this article required a cluster. Nothing required advanced tuning. This is exactly how many real PySpark projects begin.

When you’re more experienced, you may want to build on your knowledge by researching some of these topics.

  • reading execution plans
  • understanding shuffles
  • managing partitions
  • other join types
  • simple performance tuning

These are some of the topics I hope to cover in a future article, but for now, you’ve mastered your next major milestone, and you can build something meaningful and useful with PySpark.


BTW, here is that link to the first article in this series,
PySpark for Beginners: Mastering the Basics, which I mentioned at the start.



Source link