I Spent an Hour on a Data Preprocessing Task Before Asking Gemini

I Spent an Hour on a Data Preprocessing Task Before Asking Gemini


a significant amount of time on data preparation for downstream tasks. Whether it involves data cleaning, handling missing values, feature engineering, data preprocessing, or post processing, this phase requires a lot of time.

So, I was working on this post-processing task where I needed to create a new column in a Pandas DataFrame by extracting values from an existing column, based on the data from two other columns.

I could have directly asked an LLM to write the code (which I usually do) but this time I wanted to do it myself. It was early in the morning and I had a fresh mind so I was in the mood to handle some complex data operations.

Here is what I had to do. I had a DataFrame with predicted_categories, pred_category_id, and text_predicted_probs columns.

The values in the predicted_categories column are lists of five categories in “category_id” – “category_description” format.

['80814001 - Freze Uçları', 
 '13003106 - Freze', 
 '80805004 - Sanayi Makineleri', 
 '13003144 - Torna Makinesi', 
 '13003195 - Kumpas']

The text_predicted_probs column has the predicted probabilities of these five categories in order.

[0.943, 0.018, 0.008, 0.006, 0.004]

Hence, the first value in the text_predicted_probs is the probability of the first category in the predicted_categories, and so on.

The pred_category_id column shows the predicted category id from another model . What I need is the predicted probability of the category in the pred_category_id column.

I need to get the order of the pred_category_id in the predicted_categories column and then take its value from the test_predicted_probs column.

The drawing below demonstrates what I want to achieve:

In this drawing, I want to get the probability of category 13003106, which is the second item in the list and its corresponding probability value is 0.018.

If we asked Gemini, or another advanced model, we’ll probably get the answer in seconds. But, I wanted to do it on my own first and then ask Gemini.

Let’s start with reading the dataset into a Pandas DataFrame.

import pandas as pd

results = pd.read_csv("prediction_results.csv")

The values in the predicted_categories column are lists of strings with category ids and category names:

results.loc[0, "predicted_categories"]
# output: "['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']"

It’s a list but saved as a string so we first convert it to a list object using the literal_eval function in the built-in ast module of Python:

ast.literal_eval(results.loc[0, "predicted_categories"])
# output:
['80814001 - Freze Uçları',
 '13003106 - Freze',
 '80805004 - Sanayi Makineleri',
 '13003144 - Torna Makinesi',
 '13003195 - Kumpas']

To extract the category ids, we can split each string in this list at the “-” character and then select the first part after splitting. Since we have a list with five categories, we should do this operation in a list comprehension as follows:

[category.split("-")[0].strip() 
for category in ast.literal_eval(results.loc[0, "predicted_categories"])]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']

We’ve done it for a single value (i.e. one row). In order to do the same operation to the entire predicted_categories column, we can use a list comprehension. It will be a list comprehension inside another list comprehension (i.e. nested list comprehension):

results.loc[:, "predicted_category_ids"] = [
    [category.split("-")[0].strip() for category in ast.literal_eval(predicted_categories)] 
    for predicted_categories in results["predicted_categories"]
]

We now have category ids extracted from the predicted_categories column:

The next step is to check the order of the categories in the predicted category id lists. We will then use this order to extract the predicted probability of the category.

Python list object has an index method, which returns the index (i.e. order) of the item in the list.

results.loc[0, "predicted_category_ids"]
# output:
['80814001', '13003106', '80805004', '13003144', '13003195']


results.loc[0, "predicted_category_ids"].index("13003106")
# output:
2

Once I find the index of a predicted category id, I can use it to get the probability of this category id from the text_predicted_probs column:

What we need to do:

  1. Get the index of pred_category_id in the predicted_category_ids
  2. Use this index to extract the relevant value from text_predicted_probs

These steps can be done in a single operation by zipping these three columns. Let’s test it on the first row:

for i, j, k in zip(results["pred_category_id"][:1], results["predicted_category_ids"][:1], results["text_predicted_probs"][:1]):
    print(j.index(str(i))) # get the index of pred_category_id in predicted_category_ids
    print(ast.literal_eval(k)[j.index(str(i))]) # get the value at this index in text_predicted_probs

# output:
0
0.943

We can confirm the output in the screenshot above. The pred_category_id is 80814001, which is the first item (i.e. index = 0) in the predicted_category_ids and the first probability value is 0.943.

The lists in the text_predicted_probs column are saved as string as well so we apply the literal_eval function to convert them to a list object.

We can do this operation on the entire DataFrame using a list comprehension as follow:

results.loc[:, "pred_category_prob"] = [
    float(ast.literal_eval(k)[j.index(str(i))]) if str(i) in j else 0
    for i, j, k in zip(results["pred_category_id"], results["predicted_category_ids"], results["text_predicted_probs"])
]

We basically converted the for loop in the previous code block to a list comprehension. I’ve only added a check “if str(i) in j else 0” because there are some rows with a pred_category_id that does not exist in predicted_category_ids. In such cases, we set the probability value to 0.

The task is now completed as we can see below:

How did Gemini do this?

After working about an hour (including breaks) I completed the task. Then, I wanted to ask Gemini to see how it would handle the same task.

I wrote a prompt explaining my goal and included a small sample of the data.

Here is what Gemini produced on its first attempt:

import pandas as pd

# 1. Create the sample DataFrame
data = {
    'predicted_categories': [
        ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri', '13003144 - Torna Makinesi', '13003195 - Kumpas']
    ],
    'pred_category_id': [80814001],
    'text_predicted_probs': [
        [0.943, 0.018, 0.008, 0.006, 0.004]
    ]
}

df = pd.DataFrame(data)

# 2. Define a function to extract the probability matching the category ID
def get_matching_prob(row):
    # Convert ID to string for matching
    target_id_str = str(row['pred_category_id'])
    
    # Iterate through the categories to find the matching index
    for index, category in enumerate(row['predicted_categories']):
        if category.startswith(target_id_str):
            # Return the corresponding probability from the same index
            return row['text_predicted_probs'][index]
            
    # Return None (or 0) if no match is found to prevent the code from crashing
    return None

# 3. Create the new column
df['pred_category_prob'] = df.apply(get_matching_prob, axis=1)

# Display the result
print(df[['pred_category_id', 'pred_category_prob']])

It first created a sample DataFrame using the data I provided in the prompt. Then, it wrote a function that does exactly what I explained.

It took me about an hour to write the code myself. Gemini completed the exact same task in seconds based on a prompt that took me only a few minutes to write.

However, there is a minor issue in Gemini’s code. It uses the apply function, which is not vectorized. In case of working with large datasets, using the apply function (basically a for loop) can increase the run time.

But the solution is simple. I asked Gemini to write a vectorized version of it and here is what I got:

import pandas as pd

# Sample DataFrame
data = {
    'predicted_categories': [
        ['80814001 - Freze Uçları', '13003106 - Freze', '80805004 - Sanayi Makineleri'],
        ['13003144 - Torna Makinesi', '13003195 - Kumpas']
    ],
    'pred_category_id': [80814001, 13003195],
    'text_predicted_probs': [
        [0.943, 0.018, 0.008],
        [0.6, 0.4]
    ]
}
df = pd.DataFrame(data)

# 1. Explode both list columns simultaneously to maintain alignment between category and probability
df_exploded = df.explode(['predicted_categories', 'text_predicted_probs'])

# 2. Extract the numeric ID from the category string using vectorized regex
df_exploded['extracted_id'] = df_exploded['predicted_categories'].str.extract(r'^(\d+)').astype(float)

# 3. Filter for rows where the extracted ID matches the target 'pred_category_id'
matched = df_exploded[df_exploded['extracted_id'] == df_exploded['pred_category_id']]

# 4. Dedup the index (safety net in case an ID appears twice within the same list)
matched = matched[~matched.index.duplicated(keep='first')]

# 5. Map the extracted probability column back to the original DataFrame using the index
df['pred_category_prob'] = matched['text_predicted_probs']

df

The second solution was perfectly fine and looked simpler than the code I wrote.

So, I spent about an hour on a task that an LLM could have completed in less than five minutes. However, if I didn’t know how Pandas worked, I would have accepted the first solution, which was not the optimal one. It is a perfect example of how LLMs can increase productivity, but only if you actually know what you’re doing.



Source link