What Can We Do When Memory Becomes the New Bottleneck in Data Engineering?

, memory has undoubtedly become critical resource. As the demand for memory and storage infrastructure driven by the AI boom has reached historical highs, companies like Micron Technology and Sandisk have attracted unprecedented attention and increased product prices by wielding strong pricing power. But this is not good news to companies that build data-intensive applications, rely on high-capacity storage for AI training, implement large-scale analytics, or operate on tight margins in cloud services.

For data engineers, this isn’t just market news. Instead, it’s a daily constraint. When RAM and flash become more and more expensive, the old reflex of “add more capacity” no longer works. Budgets don’t scale with data volume and cloud bills are under scrutiny. As data engineers, what can we do when our dataset doubles but the cluster doesn’t? We must be creative.

In this article, I’ll start from a real-world ETL challenge that requires to complete the data transformation of over 6 million social media posts with mixed-types data fields within limited computing capacity. I’ll walk through some solutions — including both classic and cutting-edge ones — to keep your ETL pipeline running without a hardware or cloud upgrade.

The Problem: A 6.2 Million-Row Dataset That Wouldn’t Fit in Memory

The story starts from a new ETL pipeline that you’re going to create. The raw data are 6.2 million posts from a social media platform. The dataset is extracted from the social media API and turns to over 200 columns after JSON flattening, and the most problematic part is: a large number of the data fields are mixed data types.

Here are some examples of these columns in the original JSON format:

{
  "reaction_count": 1250
}

{
  "reaction_count": "1250"
}

{
  "reaction_count": null
}

{
  "hashtags": ["AI", "Python"]
}

{
  "hashtags": "AI"
}

{
  "hashtags": null
}

Because PySpark requires a consistent schema and the social media API schema changes from time to time, you consider using Pandas to tackle the mixed-types columns. Unlike PySpark, Pandas stores these columns as object by default and does not require every row to conform to the same schema.

import pandas as pd

def normalize_mixed_columns(df, mixed_columns):
    """
    Convert mixed-type columns to strings.
    """

    cleaned_df = df.copy()

    for column in mixed_columns:

        cleaned_df[column] = (
            cleaned_df[column]
            .where(cleaned_df[column].isna(),
                   cleaned_df[column].astype(str))
        )

    return cleaned_df

social_posts_clean = normalize_mixed_columns(
    social_posts_df,
    mixed_columns
)

Simple and straightforward. However, when you tried to run this code, the execution process was terminated because the memory usage exceeded available resources. The job failed.

A Classic Solution: Reducing Peak Memory with Chunk-Based Processing

The bottleneck is the 6.2 million rows. The dataset size is approximately 30GB, which is larger than standard cloud worker memory instance. It exceeds the available worker memory during intermediate dataframe transformations. So instead of performing the data type conversions for the entire column, which forces Pandas to materialize large temporary objects in memory, the chunking technique divides each column by chunks. For this case, it’s suitable to set the chunking size as 250,000. So Pandas only needs to proceed 250k rows at one time, then release the memory and move to the next chunk.

import gc

def normalize_mixed_columns_chunked(
    df,
    mixed_columns,
    chunk_size=250000
):

    cleaned_df = df.copy()

    for column in mixed_columns:

        col_idx = cleaned_df.columns.get_loc(column)

        for start in range(0, len(cleaned_df), chunk_size):

            end = min(start + chunk_size, len(cleaned_df))

            chunk = cleaned_df.iloc[start:end, col_idx]

            mask = chunk.notna()

            if mask.any():

                chunk = chunk.astype(object)

                chunk.loc[mask] = (
                    chunk.loc[mask]
                    .astype(str)
                    .values
                )

                cleaned_df.iloc[start:end, col_idx] = chunk.values

            del chunk
            del mask
            gc.collect()

    return cleaned_df


social_posts_clean = normalize_mixed_columns_chunked(
    social_posts_df,
    mixed_columns
)

After peak memory becomes much smaller, the data transformation completes successfully and the pipeline becomes stable. However, the running time become much longer. This is not surprising, as chunking technique is fundamentally a trade-off. It trades execution speed for pipeline reliability.

From Manual Chunking to Automated Parallel Execution

Besides manual chunking in Pandas, Dask automatically partitions a DataFrame into multiple smaller partitions and resolves the memory crash during data transformation. But its internal execution mechanics is different from chunking. When I set chunk_size in Pandas, it reads one chunk, runs my code on it, drops it from RAM, and then moves to the next. It uses one CPU core at a time, so for the cloud services which provide multiple CPU cores, it didn’t make full use of the capacities. Additionally, I have to manually write a loop to aggregate the results, which makes the code complex and lengthy.

Dask divides the dataset into chunks automatically. Dask constructs a task graph and schedules partitions across available CPU cores — it significantly reduces the running time.

However, we cannot ignore the mixed data types pitfall in Dask. Because a Dask DataFrame is composed of multiple Pandas DataFrame partitions, when reading CSV or JSON files, Dask infers data types from a sample of the data. If a column contains inconsistent values, for example, empty strings (“”), None, integers, and strings, Dask will likely raise a ValueError, TypeError or a metadata inference error. This happens because Dask infers a column’s data type from an initial sample of the data and guesses the column is an integer. But if it then encounters a string in one of the following chunks, Dask raises an error.

To resolve this issue, we must explicitly specify the expected column data types instead of relying on automatic inference. The code below is to use Dask to perform data transformation for mix-types columns.

import dask.dataframe as dd

df = dd.read_parquet(
    "social_posts.parquet",
    engine = "pyarrow"
)

mixed_columns = [
    "hashtags",
    "mentions",
    "location",
    "reaction_count",
]

for column in mixed_columns:
    df[column] = df[column].map(str, meta=(column, 'str'))

df.to_parquet("social_posts_clean/", engine="pyarrow")

Dask is not as flexible as Pandas chunking when handling dynamic columns with mixed data types because it requires to specify which columns to convert. Also, it still executes many Pandas operations within each partition, so workloads dominated by Python object columns can remain memory-intensive and not fast when processing million-rows datasets. Is there any other CPU-cache-efficient solution?

A Stronger Alternative — Polars

You may ask whether there is an approach that can balance speed and memory efficiency. The answer is Polars, a DataFrame library implemented in Rust Engine. Compared to Python, Rust produces highly optimized native machine code and offers excellent memory management. It minimizes memory allocations and eliminates garbage collection overhead. However, Rust also has its drawbacks. Its development speed is much slower than Python due to strict compiler checks and its learning curve is extremely steep. That’s the reasons why it’s far less popular than Python since it was created in 2010. Does that mean data engineers cannot use this method if they are not familiar with Rust?

Polars is a lightning-fast DataFrame library built with Rust engine and exposed through a Python API. It was launched in 2020 and designed to handle massive datasets much quicker than native Python Pandas. It keeps the strength of Rust engine but allows Python users to import from a Python library.

Polars uses the Apache Arrow in-memory columnar data format, which is designed to minimize memory copies while maximizing CPU cache efficiency. It executes the .cast(pl.String) operation directly in Rust code. These features enables Polars to run several times faster than Python and use only a fraction of the memory.

Polars builds a lazy query plan and the query optimizer determines the most efficient execution plan before reading the data. These mechanics reduce unnecessary memory usage. Therefore, when handling datasets that exceed available memory, Polars can process the data in streaming mode and prevents the entire dataset from being loaded into RAM at once.

import polars as pl

df = pl.read_parquet("social_posts.parquet")

mixed_columns = [
    "hashtags",
    "mentions",
    "location",
    "reaction_count",
]

df = df.with_columns([
    pl.col(col).cast(pl.String) for col in mixed_columns
])

df.write_parquet(
    "social_posts_clean.parquet"
)

Polars outperforms the previous two solutions in memory management because it’s designed from the ground up to use memory more efficiently while Pandas Chunking and Dask focus on reducing memory pressure during the execution.

However, Polars is not a silver bullet. It has its own disadvantages. First, Polars introduces its own DataFrame API although it uses Python. Common Pandas operations such as apply(), indexing, and groupby syntax often need to be rewritten. Second, many third-party libraries are still built around Pandas, so integration with Polars still requires conversion between DataFrame formats.

Final Thoughts

So is Pandas chunking out of date? Not necessarily.

Each of the three approaches solves a different problem. The best choice depends on the constraints rather than the latest technology.

If you have really limited computing resources and process dynamic schemas, Pandas chunking is still an excellent solution. It dramatically reduces peak memory usage. The trade-off is execution time. But in many production environment, a slower but stabler pipeline is far more valuable than a faster one that fails repeatedly.

If your workload has already outgrown a single machine and you want to make use of multiple CPU cores, Dask is a better option. It automates partitioning and parallel execution. However, you should pay attention to schema consistency and data types, especially when working with semi-structured data.

Polars is selected when the workload is performance-critical and you’ve learnt the new DataFrame API. Polar is usually considered as the strongest options because of its Rust engine, Apache Arrow memory format, and query optimizer. All these features enable Polars to process large datasets with significantly lower memory consumption and much higher performance. Similar to Dask, you need to address the issues caused by mixed data types and ensure the schema consistency.

To conclude, memory optimization is not about finding a single best solution. Instead, it’s about understanding the constraints of your project and choosing the right tool. In the AI era, the ability to optimize data pipelines under memory constraints is becoming a valuable skill for data engineers.

A robust ETL pipeline requires more than memory efficiency. It also depends on testability, maintainability, and deployment reliability.

This article is part of my practical data engineering series. If you’re interested in building production-ready ETL pipelines beyond performance optimization, you may also enjoy the article Your First Task as a Data Engineer in a New Company? Make the ETL Pipeline Testable, which covers environment setup, automated testing, and AI-assisted development.

Thank you for your reading!

Source link

What Can We Do When Memory Becomes the New Bottleneck in Data Engineering?

The Problem: A 6.2 Million-Row Dataset That Wouldn’t Fit in Memory

A Classic Solution: Reducing Peak Memory with Chunk-Based Processing

From Manual Chunking to Automated Parallel Execution

A Stronger Alternative — Polars

Final Thoughts

Like this:

Related

The Problem: A 6.2 Million-Row Dataset That Wouldn’t Fit in Memory

A Classic Solution: Reducing Peak Memory with Chunk-Based Processing

From Manual Chunking to Automated Parallel Execution

A Stronger Alternative — Polars

Final Thoughts

Share this:

Like this:

Related

Related News

Deploying retail AI to scale personalisation and customer insight

Persistent Latent Memory for Multi-Hop LLM Agents: How a 6G Handover Paper Closes the Agent Cold-Start

Build and Run Your Own AI Agent in the Cloud

Japan Bets on AI Robots to Solve a Worker Shortage