Beyond the Straight Line: Choosing Between OLS, Interaction Terms, and Tweedie Regression

Beyond the Straight Line: Choosing Between OLS, Interaction Terms, and Tweedie Regression


a machine learning model to predict how much money customers will spend on an e-commerce platform over the next year. The job is data science business as usual: load the data, clean it, understand it, and model the outcome.

The big question that comes to our minds: Which algorithm do you pull out of your toolkit?

For regression models, I assume that we tend to default to the classics like OLS, or at other times, just jump straight to a complex ensemble method like XGBoost. But by experience, I have been seeing that often a generalized linear framework is exactly what you need for inference, speed, and interpretability.

The real challenge lies in choosing the right flavor.

  • Should we stick to a standard Ordinary Least Squares (OLS) regression?
  • Do we need to introduce interaction terms?
  • Or is our data weird enough to warrant a Tweedie regression?

Choosing the wrong one can lead to models that output impossible predictions, like a customer spending a negative amount of money.

Let’s break down these three approaches so you can confidently pick the right tool for your specific data landscape.

Dataset

The dataset used for this exercise will be the French Motor Third-Party Liability Claims dataset, from R package CASDatasets, under license GPL >2.

Christophe Dutang and Arthur Charpentier (2026). CASdatasets: Insurance datasets, R package version 1.2-1, DOI 10.57745/P0KHAG.

# Basics
import pandas as pd
import numpy as np

# Dataviz
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style('darkgrid')
import plotly.express as px

# Stats
import statsmodels.api as sm
import statsmodels.formula.api as smf
import scipy.stats as stats

# Preprocessing & Modeling
from sklearn.linear_model import TweedieRegressor
from feature_engine.encoding import OneHotEncoder
import warnings
from sklearn.metrics import mean_absolute_error

import warnings
warnings.filterwarnings('ignore')

To get the dataset, there’s a code snippet you can use in this sklearn tutorial [4].

# Load Data
df = load_mtpl2(n_samples=250_000)
View of the dataset. Image by the author.

After exploring the data a little, there are a couple of interesting transformations we can make, such as clipping the data at $500,000 maximum and keep the number of claims under 4, as numbers over that are way too crazy (outliers) and will definetely distort our predictions.

# Clipped
df = df.query('ClaimAmount < 500000')
df = df.query('ClaimNb < 4')

1. Regular OLS: The Reliable Highway

Let’s start with a baseline model. The simplest Ordinary Least Squares (OLS) regression is the bedrock of statistical modeling. If data science had a “default” settings button, it would be OLS.

But the “downside” is that it requires assumptions that the data will behave in a certain way.

For example, it would assume:

  • The target changes at a constant, predictable rate as your independent variables change.
  • If you increase the number of claims by 1, your Claim Amount goes up by a fixed amount, regardless of whether this is your first ever claim or not.

Mathematically, OLS looks for the line that minimizes the sum of the squared differences between your actual data points and that line.

When does it work?

OLS is incredibly powerful when your data meets a few core assumptions:

  • Linearity: The relationship between your features and your target is reasonably straight.
  • Homoscedasticity: The variance of your residuals (errors) is constant. In plain English, your model isn’t wildly more inaccurate for large values than it is for small values.
  • Normal Distribution: The errors are normally distributed.

If you are predicting something stable and continuous (e.g., the height of a tree based on its age, or the temperature based on elevation), regular OLS is often your best bet. It’s transparent, fast, and easily understood.

For this problem, though, we’ll see it’s not the best bet, given our data is inflated with zeroes claim amounts.

# Define the OLS model formula
formula = "ClaimAmount ~ Exposure + VehPower + VehAge + DrivAge + BonusMalus + Density + C(Area) + C(VehBrand) + C(VehGas) + C(Region)"

# Fit the OLS model
ols_model = smf.ols(formula=formula, data=df).fit()

# Display the model summary
print(ols_model.summary())

After running our baseline model, let’s have a look at the results.

OLS Results. Image by the author.

If we look at the model’s residuals, the OLS model’s residuals, as shown by the histogram and Q-Q plot, are clearly not normally distributed, which is a common issue when dealing with datasets that have many zero values like insurance claim amounts.

Histogram and QQ-Plot of the residuals: Model struggles with higher values. Image by the author.

2. OLS with Interaction Terms: The “It Depends” Factor

But what happens when the real world gets complicated? Let’s say you are predicting house prices based on two features: the number of bedrooms and whether the house has a swimming pool.

A regular OLS model looks at these independently. It says, “An extra bedroom adds $50,000, and a pool adds $30,000.” But is that always true? Does a pool add the same value to a 1-bedroom condo as it does to a 5-bedroom luxury estate?

Probably not. The value of the pool depends on the size of the house. This is where interaction terms come into play.

An interaction term tells your model that the effect of one variable changes depending on the level of another variable. In your data, you represent this by literally multiplying the two features together:

equation

Another way of looking at it is by using this analogy with cooking. Interaction terms are like baking.

  • Flour on its own is dry.
  • Water on its own is wet.
  • But when you mix them together, you don’t just get “dry-wet”; you get dough, a completely new property.

When should you use it?

You should upgrade your OLS model with interaction terms when you suspect synergy or antagonism between features.

  • Synergy: Two features together are more powerful than the sum of their parts (e.g., Marketing Spend $\times$ Holiday Season).
  • Antagonism: One feature dampens the effect of another (e.g., Dosage of Drug A $\times$ Dosage of Drug B).

If your regular OLS model is underperforming and you have strong domain knowledge suggesting that your variables talk to each other, don’t just throw more data at it. Try adding an interaction term.

Well, in this case, let’s see how it performs.

# Define the OLS model formula with another interaction term and without 'Region'
formula_no_region = "ClaimAmount ~ Exposure * VehBrand + Exposure * BonusMalus + DrivAge + BonusMalus + Density + C(Area) + VehPower + C(VehGas)"

# Fit the OLS model with two interaction terms and without 'Region'
ols_no_region_model = smf.ols(formula=formula_no_region, data=df).fit()

# Display the model summary
print(ols_no_region_model.summary())
OLS with interaction terms results. Image by the author.

Despite introducing interaction terms, the OLS model’s performance on the ClaimAmount data is unchanged. This is due to high concentration of zero claims and a heavily skewed distribution for non-zero claims.

Ordinary Least Squares models assume normally distributed errors, an assumption that is clearly violated in this zero-inflated dataset. Consequently, adding interaction terms can potentially capture more complex relationships, but cannot model this data’s true distribution, leading to only marginal improvements.

Histogram and QQ-Plot of the residuals: Model struggles with higher values. Image by the author.

3. Tweedie Regression: Embracing the Messy Zeroes

What if your data isn’t a neat, continuous bell curve? What if your data looks like a giant wall of zeroes followed by a long, trailing tail of positive numbers?

As we can see in our example, in any given year, the vast majority of insurance policyholders file exactly $0 in claims. They don’t get into accidents. But for the small percentage who do, their claims aren’t just $5 or $10. They can be thousands or tens of thousands of dollars.

If you try to fit a regular OLS line to this data, the model breaks down. Why? Because OLS assumes a normal distribution. To accommodate all those zeroes and a few massive outliers, OLS might start predicting that some low-risk customers will have negative claims.

This is the reason we will try the Tweedie distribution.

The Tweedie distribution is a special family of probability distributions that can handle a mixture of a discrete mass at zero and a continuous, right-skewed distribution for positive values. It acts as a hybrid between a Poisson distribution (which counts events) and a Gamma distribution (which measures continuous positive amounts).

Think of Tweedie regression like measuring rainfall in a desert. On most days, the rainfall is exactly zero. But when it does rain, it pours, and you get a continuous, highly variable amount of water. You cannot model desert rainfall using a smooth, symmetrical bell curve.

When should you use it?

Tweedie regression is the gold standard for data characterized by:

  • Strictly non-negative values (you cannot have less than $0 in sales or claims).
  • A massive spike at exactly zero (zero-inflation).
  • A highly skewed distribution for the non-zero values.

If you are working in insurance tech, credit risk, or modeling customer lifetime value where many users never convert, Tweedie is your potential friend.

# OHE categorical variables
ohe = OneHotEncoder(drop_last=True)
df_ohe = ohe.fit_transform(df.drop(columns=['OLS_pred', 'OLS_Int_Term_pred'], axis=1))

# X and Y
X = df_ohe.drop(columns=['ClaimAmount'], axis=1)
y = df_ohe['ClaimAmount']

# More granular search for tweedie_powers around 1.75-1.8
tweedie_powers = np.arange(1.7, 1.8, 0.01) # Example: From 1.7 to 1.8 with step 0.01
alpha_values = [0.1, 0.5, 1.0] # Test different regularization strengths

results = []

print("Searching for best Tweedie power and alpha...")
for power in tweedie_powers:
    for alpha in alpha_values:
        sklearn_tweedie_model = TweedieRegressor(
            power=power, 
            link='auto', 
            solver='newton-cholesky', 
            max_iter=500, 
            alpha=alpha # Add regularization
        )
        try:
            # Fit the model, using Exposure as an offset
            sklearn_tweedie_model.fit(X, y, sample_weight=df['Exposure'])
            mae = mean_absolute_error(y, sklearn_tweedie_model.predict(X))
            results.append((power, alpha, mae))
        except Exception as e:
            # Catch potential convergence issues for some parameter combinations
            results.append((power, alpha, float('inf'))) # Assign high MAE for failed fits
            

# Sort results to find the best model
best_result = min(results, key=lambda x: x[2])
best_power, best_alpha, best_mae = best_result

print(f"Best Tweedie Power: {best_power:.2f}, Best Alpha: {best_alpha:.1f}, Best MAE: {best_mae:.2f}")

# Now, fit the best model
sklearn_tweedie_model = TweedieRegressor(
    power=best_power, 
    link='auto', 
    solver='newton-cholesky', 
    max_iter=500, 
    alpha=best_alpha
)
sklearn_tweedie_model.fit(X, y, sample_weight=df['Exposure'])

print("Sklearn Tweedie Regressor fitted successfully with optimized parameters!")
print(results)
Searching for best Tweedie power and alpha...
Best Tweedie Power: 1.76, Best Alpha: 1.0, Best MAE: 110.31
Sklearn Tweedie Regressor fitted successfully with optimized parameters!

Let’s predict with this model, and next check how the QQ plot looks like.

# Make predictions using the optimized model
df['Tweedie_pred'] = sklearn_tweedie_model.predict(X)

This is a comparison of the actuals vs predictions of this last model.

Tweedie Model predictions vs actuals. Image by the author.

Notice how the errors are still very present, especially for ClaimNb == 1. While the Tweedie model’s residuals still show deviations from normality, this model follows another distribution, which is the Tweedie. So, the real improvement can be seen in the MAE metric, which brings a value about 35% lower.

Comparison of the Models

cols=['ClaimNb','ClaimAmount', 'OLS_pred', 'OLS_Int_Term_pred', 'Tweedie_pred']
df[cols].sample(10).round()
Comparing the models. Image by the author.

We can see that the zero claims are closer to zero in the Tweedie Regression, much better results than the OLS estimates, since those will bring values close to the average for all the ClaimNb values. Tweedie regressor can understand the zero inflation, however it won’t predict zero, but it gives a smaller value.

from sklearn.metrics import mean_absolute_error

# Calculate MAE for OLS
mae_ols = mean_absolute_error(df['ClaimAmount'], df['OLS_pred'])
print(f"MAE for OLS Model: {mae_ols:.2f}")

# Calculate MAE for OLS Interaction Terms
mae_ols = mean_absolute_error(df['ClaimAmount'], df['OLS_Int_Term_pred'])
print(f"MAE for OLS Model w/ Interaction Terms: {mae_ols:.2f}")

# Calculate MAE for Statsmodels Tweedie
mae_sm_tweedie = mean_absolute_error(df['ClaimAmount'], df['Tweedie_pred'])
print(f"MAE for Tweedie Model: {mae_sm_tweedie:.2f}")

# Calculate MAE for Zero Inflated
mae_zero_inflated = mean_absolute_error(df['ClaimAmount'], df['ZeroInflated_pred'])
print(f"MAE for Zero-Inflated Model: {mae_zero_inflated:.2f}")
MAE for OLS Model: 174.17
MAE for OLS Model w/ Interaction Terms: 172.24
MAE for Tweedie Model: 111.97

The Decision Matrix: Choosing Your Path

When you are staring at your dataset, trying to decide which path to take, use this table to guide your decision:

ModelTarget Variable TypeKey Assumption / StrengthBest Used For…
Regular OLSContinuous, symmetric, bell-shapedRelationships are linear and independent.Stable physical measurements, basic forecasting.
OLS with InteractionsContinuous, symmetricThe effect of feature A changes based on feature B.Marketing attribution, real estate pricing, economics.
Tweedie RegressionHighly skewed, non-negative, mostly zeroesHandles a massive spike at zero followed by a long tail.Insurance claims, customer lifetime value, rare event costs.

Bonus Model

But I am still not happy with this model. Certainly, the results are not that good in practice. We see that the Tweedie model can, in fact, drastically reduce the error, but it still does that by controling very high and very low numbers (0) and driving everything closer to the average.

The only solution left is to create a two-step model where we will model a Zero Inflated distribution:

  1. Run a classifier that will tell us if there was a claim amount or not. It returns the probability of having a claim amount.
  2. With the result of the first model, we can use the Tweedie distribution to model only those observations with a claim amount. Otherwise, the model will just return zero.

Let’s code that.

We begin by preparing our dataset with One Hot Encoding to get rid of the categories.

import lightgbm as lgb
from sklearn.metrics import classification_report

# Remove previous predictions from df
cols = ['ClaimNb', 'Exposure', 'Area', 'VehPower', 'VehAge', 'DrivAge',
       'BonusMalus', 'VehBrand', 'VehGas', 'Density', 'Region', 'ClaimAmount']
df2 = df[cols].copy()

# 1. Create the binary outcome variable for having a claim
df2['has_claim'] = (df2['ClaimAmount'] > 0).astype(int)

# Prepare the data for "zero claim" classification
df_ohe = ohe.fit_transform(df2)
X = df_ohe.drop(columns=['ClaimAmount', 'has_claim'])
y = df_ohe['has_claim']

Next, there’s a bunch of code.

  • We will split in train and validation sets for LightGBM.
  • Calculates weights to handle class imbalance.
  • Sets tuned LightGBM model parameters.
  • Trains LightGBM model with early stopping.
from sklearn.model_selection import train_test_split

print("Tuning LightGBM for P(Claim > 0) with updated parameters...")

# Split the training data into training and validation sets for early stopping
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42, stratify=y)

train_data_split = lgb.Dataset(X_train, label=y_train)
val_data_split = lgb.Dataset(X_val, label=y_val)

# Calculate scale_pos_weight for better handling of class imbalance
scale_pos_weight_value = (y_train == 0).sum() / (y_train == 1).sum()

# Define updated parameters for better control and to mitigate the warning
params_tuned = {
    'objective': 'binary',
    'metric': 'auc',
    'learning_rate': 0.05,  # Reduced learning rate
    'num_leaves': 31,       # Increased num_leaves (default is 31), allowing for more complexity
    'max_depth': 7,         # Increased tree depth, allowing for more complexity
    'min_child_samples': 15, # Slightly decreased minimum data in leaf
    'min_gain_to_split': 0.01, # Decreased minimum gain for a split
    'force_col_wise': True, # Keep for dataset compatibility
    'scale_pos_weight': scale_pos_weight_value, # Explicitly handle imbalance
    'verbose': -1           # Suppress verbose output during training
}

# Re-train LightGBM model with tuned parameters
# Increase num_boost_round given the reduced learning rate and increased complexity
model_tuned = lgb.train(
    params_tuned,
    train_data_split,
    num_boost_round=300, # Increased number of boosting rounds
    valid_sets=[val_data_split], # Provide validation set for early stopping
    callbacks=[lgb.early_stopping(20, verbose=False)] # Add early stopping and suppress verbose, increased patience
)
  • Predicts claim occurrence probabilities. Transforms it to a binary 0 or 1 for classification report calculation.
  • Evaluates model using classification report.
  • Stores predicted probabilities in dataframe.
  • Sets probabilities below 0.5 to zero.
# Predictions and evaluation with the tuned model
y_pred = (model_tuned.predict(X) > 0.5).astype(int)
print("\nClassification Report for Tuned LightGBM Model:")
print(classification_report(y, y_pred))

# Store the probabilities for the next step of the Zero-Inflated model
df2['prob_has_claim'] = model_tuned.predict(X)

# If prob_has_claim is under 0.5, make it 0
df2.loc[df2['prob_has_claim'] < 0.5, 'prob_has_claim'] = 0

Now, part 2 of the model is filtering only those claims with dollar amount associated and use the Tweedie Regression to estimate it.

# Part 2: Model the severity (ClaimAmount > 0) using Tweedie Regression
# Filter data for positive claims only
df_positive = df_ohe[df_ohe['has_claim'] == 1].copy()
X_positive = df_ohe.drop(columns=['ClaimAmount', 'has_claim'], axis=1)
y_positive = df_ohe['ClaimAmount']

# Use Exposure as offset for the severity model
print("\nFitting Tweedie Model E[ClaimAmount | Claim > 0]...")
sklearn_tweedie_model = TweedieRegressor(
    power=1.85,
    link='auto',
    solver='newton-cholesky',
    max_iter=500,
    alpha=1
)
sklearn_tweedie_model.fit(X_positive, y_positive, sample_weight=df['Exposure'])

Finally, the prediction of this model will be the probability estimated by the first model multiplied by the Tweedie Regressor estimated amount.

# Predict conditional severity for all observations (even those with 0 claims)
# This assumes the severity model learned from positive claims can generalize.
df['pred_severity_if_claim'] = sklearn_tweedie_model.predict(df_ohe.drop(columns=['ClaimAmount', 'has_claim'], axis=1))

# Calculate the final Zero-Inflated Model prediction
df['ZeroInflated_pred'] = df2['prob_has_claim'] * df['pred_severity_if_claim']

# Display head of predictions and MAE
print("\nZero-Inflated Model Predictions (first 5 rows):")
print(df[['ClaimAmount', 'ZeroInflated_pred']].sample(5))

mae_zero_inflated = mean_absolute_error(df['ClaimAmount'], df['ZeroInflated_pred'])
print(f"\nMAE for Zero-Inflated Model: {mae_zero_inflated:.2f}")

This is the result.

Fitting Tweedie Model E[ClaimAmount | Claim > 0]...

Zero-Inflated Model Predictions (first 5 rows):
         ClaimAmount  ZeroInflated_pred
IDpol                                  
1179635          0.0           0.000000
108738           0.0           0.000000
2066923          0.0           0.000000
1007186          0.0           0.000000
1117109       1128.0         729.428288

MAE for Zero-Inflated Model: 87.79

Great, we got another 21% of improvement in MAE. This is amazing!

Zero Inflated model results Actual x Predictions. Image by the author.

Conclusion

Data science is rarely about finding a single “perfect” algorithm that solves every problem. Instead, it is about matching the mathematical assumptions of your model to the real-world behavior of your data.

Start simple. Look at the distribution of your target variable. If it looks like a bell curve, start with Regular OLS. If you know your features influence one another, introduce Interaction Terms. But if you are staring at a mountain of zeroes and a long tail of large numbers, skip the headaches of OLS entirely and leverage the power of a Tweedie Regression.

Taking a few minutes to analyze your target’s distribution before modeling can save you days of tuning the wrong architecture later.

GitHub Repository

https://github.com/gurezende/Zero-Inflated-Tweedie-Regression

References

[1. Dataset] (https://www.openml.org/search?type=data&sort=runs&id=43593&status=active)

[2. Data License and citation] (https://dutangc.github.io/CASdatasets/)

Christophe Dutang and Arthur Charpentier (2026). CASdatasets: Insurance datasets, R package version 1.2-1, DOI 10.57745/P0KHAG.

[3. Tweedie Regression Documentation] (https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.TweedieRegressor.html)

[4. Tweedie Regression Tutorial] (https://scikit-learn.org/stable/auto_examples/linear_model/plot_tweedie_regression_insurance_claims.html)



Source link