Insurance Price Optimization Walkthrough
Utilizing Generalized Linear Models and Non-Linear Programming to Maximize Profits
Introduction
In the insurance industry, setting the right price is crucial for achieving profitability and competitiveness. A price set too high may drive away customers, while a price set too low can lead to underwriting losses. The optimal price maximizes profitability by carefully considering the relationship between price, customer demand, and expected claim costs.
This article provides a comprehensive, step-by-step guide on how to build an insurance price optimization model. We will demonstrate how to:
Model customer demand and price elasticity using Generalized Linear Models (GLMs), accounting for how different customer segments react to price changes.
Model expected claim costs by separately modeling claim frequency and severity, a standard and robust actuarial technique.
Define distinct customer segments to allow for targeted pricing strategies.
Use Non-Linear Programming (NLP) to find the profit-maximizing price for each segment.
Step 1: Demand Modeling and Price Elasticity
The first step is to understand how customers respond to price. The demand function estimates the probability that a customer will purchase a policy at a given price. We can model this using a logistic regression, but to make it more realistic, we will include an interaction term between price and income. This allows our model to learn that price sensitivity can differ across income levels.
First, we generate synthetic data and fit a Logit model.
import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.optimize import minimize
from scipy.stats import logistic
# Set a consistent style for plots
sns.set_style("whitegrid")
def create_demand_data(n_samples=10000, random_seed=42):
"""
Creates a sample dataset for demand modeling.
Crucially, it includes an interaction term between price and income
to model varying price elasticity.
"""
np.random.seed(random_seed)
# Feature Generation
income = np.random.lognormal(mean=np.log(60000), sigma=0.5, size=n_samples)
marketing_spend = np.random.uniform(50, 500, n_samples)
price = np.random.uniform(500, 2500, n_samples)
# The 'true' underlying model for purchase probability (log-odds)
# A negative price coefficient means demand drops as price rises.
# A positive income coefficient means higher income customers are more likely to buy.
# The interaction term (price * income) is key: the negative effect of price
# is dampened for customers with higher incomes.
log_odds = (
-0.5 # Intercept
- 4.0 * (price / 1000) # Base price effect
+ 1.5 * (income / 100000) # Income effect
+ 1.0 * (marketing_spend / 100) # Marketing effect
+ 2.0 * (price / 1000) * (income / 100000) # Interaction Term!
)
probability = logistic.cdf(log_odds)
purchase = np.random.binomial(1, probability)
df = pd.DataFrame({
'income': income,
'marketing_spend': marketing_spend,
'price': price,
'purchase': purchase
})
return df
def plot_demand_curves(demand_model_results, original_data):
"""
Plots predicted demand curves for different income segments.
"""
plt.figure(figsize=(10, 6))
# Define income segments based on quantiles
low_income = original_data['income'].quantile(0.25)
high_income = original_data['income'].quantile(0.75)
# Generate a range of prices to plot against
price_range = np.linspace(
original_data['price'].min(), original_data['price'].max(), 200
)
# Assume average marketing spend
avg_marketing = original_data['marketing_spend'].mean()
# Get the exact column names used by the demand model for training
demand_exog_names = demand_model_results.model.exog_names
for income_level, label in [(low_income, 'Low Income'), (high_income, 'High Income')]:
# Create a raw DataFrame for prediction features
plot_df_raw = pd.DataFrame({
'price': price_range,
'income': income_level,
'marketing_spend': avg_marketing,
})
plot_df_raw['price_income_interaction'] = plot_df_raw['price'] * plot_df_raw['income']
# Add constant and then reorder columns to match the model's exog_names
# This is the crucial part to ensure 'const' is present and order is correct.
plot_df_with_const = sm.add_constant(plot_df_raw, has_constant='add')
# Select and reorder columns to match the trained model's exog_names
# This handles cases where original features might not be in the same order
# as exog_names after add_constant, or if some features were dropped.
plot_df_ordered = plot_df_with_const[demand_exog_names]
# Predict the probability of purchase
pred_probs = demand_model_results.predict(plot_df_ordered)
plt.plot(price_range, pred_probs, label=f'Demand for {label} Segment')
plt.title('Price Elasticity by Income Segment', fontsize=16)
plt.xlabel('Price ($)', fontsize=12)
plt.ylabel('Probability of Purchase (Demand)', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()
def create_cost_data(n_samples=10000, random_seed=42):
"""
Generates synthetic data for cost modeling.
Claim frequency and severity are explicitly linked to features.
"""
np.random.seed(random_seed)
# Feature Generation
driver_age = np.random.randint(18, 70, n_samples)
vehicle_value = np.random.lognormal(mean=np.log(25000), sigma=0.6, size=n_samples)
# --- Frequency Model (Poisson) ---
# Younger drivers have higher claim frequency
# Higher vehicle value is slightly correlated with more careful driving (lower freq)
freq_log_lambda = (
-1.5
- 0.03 * (driver_age - 18) # Negative correlation with age
+ 0.5 * np.log1p(vehicle_value / 10000) # Slight positive correlation
)
freq_lambda = np.exp(freq_log_lambda)
number_of_claims = np.random.poisson(freq_lambda)
# --- Severity Model (Gamma) ---
# Higher vehicle value leads to much higher claim severity
sev_log_mu = (
6.0
+ 1.2 * np.log1p(vehicle_value / 10000) # Strong positive correlation
)
sev_mu = np.exp(sev_log_mu)
# Gamma distribution shape parameter (controls variance)
gamma_shape = 2.0
claim_severity = np.random.gamma(shape=gamma_shape, scale=sev_mu / gamma_shape, size=n_samples)
df = pd.DataFrame({
'driver_age': driver_age,
'vehicle_value': vehicle_value,
'number_of_claims': number_of_claims,
# Only observe severity if there was at least one claim
'claim_severity': np.where(number_of_claims > 0, claim_severity, 0)
})
df['total_claim_amount'] = df['number_of_claims'] * df['claim_severity']
return df
def plot_predicted_vs_actual_cost(freq_model_results, sev_model_results, df_cost):
"""
Plots the predicted vs. actual total claim amount to assess cost model performance.
"""
# Ensure consistency in feature names for prediction
cost_features_for_prediction = ['driver_age', 'vehicle_value']
# Create the raw prediction DataFrame
X_test_raw = df_cost[cost_features_for_prediction].copy()
# Add constant and then reorder columns for frequency model prediction
X_freq_test_with_const = sm.add_constant(X_test_raw, has_constant='add')
X_freq_test_ordered = X_freq_test_with_const[freq_model_results.model.exog_names]
# Add constant and then reorder columns for severity model prediction
X_sev_test_with_const = sm.add_constant(X_test_raw, has_constant='add')
X_sev_test_ordered = X_sev_test_with_const[sev_model_results.model.exog_names]
# Get predictions from both models
predicted_freq = freq_model_results.predict(X_freq_test_ordered)
predicted_sev = sev_model_results.predict(X_sev_test_ordered)
df_cost['predicted_cost'] = predicted_freq * predicted_sev
plt.figure(figsize=(10, 6))
# Use a log scale due to the skewed nature of insurance costs
plt.scatter(
np.log1p(df_cost['total_claim_amount']),
np.log1p(df_cost['predicted_cost']),
alpha=0.3,
label='Individual Policies'
)
# Add a reference line
min_val = min(plt.xlim()[0], plt.ylim()[0])
max_val = max(plt.xlim()[1], plt.ylim()[1])
plt.plot([min_val, max_val], [min_val, max_val], 'r--', label='Perfect Prediction')
plt.title('Cost Model Performance: Predicted vs. Actual Cost (Log Scale)', fontsize=16)
plt.xlabel('Log(1 + Actual Total Claim Amount)', fontsize=12)
plt.ylabel('Log(1 + Predicted Total Claim Amount)', fontsize=12)
plt.legend()
plt.show()
def optimize_price_for_segment(segment_features, demand_model_results, freq_model_results, sev_model_results):
"""
Finds the optimal price for a customer segment using non-linear optimization.
"""
# 1. Predict the cost for this segment
cost_features_raw = pd.DataFrame([segment_features])
# Prepare features for frequency model prediction
freq_cost_features_with_const = sm.add_constant(cost_features_raw[['driver_age', 'vehicle_value']], has_constant='add')
freq_cost_features_ordered = freq_cost_features_with_const[freq_model_results.model.exog_names]
predicted_freq = freq_model_results.predict(freq_cost_features_ordered).iloc[0]
# Prepare features for severity model prediction
sev_cost_features_with_const = sm.add_constant(cost_features_raw[['driver_age', 'vehicle_value']], has_constant='add')
sev_cost_features_ordered = sev_cost_features_with_const[sev_model_results.model.exog_names]
predicted_sev = sev_model_results.predict(sev_cost_features_ordered).iloc[0]
predicted_cost = predicted_freq * predicted_sev
# 2. Define the negative profit function to be minimized
def negative_profit(price):
price = price[0] # scipy optimizer passes price as an array
# Prepare raw features for demand prediction
demand_features_raw = pd.DataFrame([{
'price': price,
'income': segment_features['income'],
'marketing_spend': segment_features['marketing_spend']
}])
demand_features_raw['price_income_interaction'] = demand_features_raw['price'] * demand_features_raw['income']
# Add constant and reorder columns to match the demand model's exog_names
demand_features_with_const = sm.add_constant(demand_features_raw, has_constant='add')
demand_features_ordered = demand_features_with_const[demand_model_results.model.exog_names]
# Predict demand at the given price
demand = demand_model_results.predict(demand_features_ordered).iloc[0]
# Calculate profit and return its negative
profit = (price - predicted_cost) * demand
return -profit
# 3. Run the optimization
# Initial guess for the price and bounds
initial_price_guess = [1500]
# Price must be above cost, with a reasonable upper limit.
# Add a small epsilon to predicted_cost to ensure it's strictly greater
bounds = [(predicted_cost + 0.01, 5000)]
result = minimize(negative_profit, initial_price_guess, method='SLSQP', bounds=bounds)
optimal_price = result.x[0]
max_profit = -result.fun
return optimal_price, max_profit, predicted_cost
def plot_profit_curves(segments, results_df, demand_model_results, freq_model_results, sev_model_results):
"""
Visualizes the profit curve for each segment, highlighting the optimal price.
"""
plt.figure(figsize=(12, 7))
# Get the exact column names used by the demand model
demand_exog_names = demand_model_results.model.exog_names
for idx, row in results_df.iterrows():
segment_name = row['Segment']
segment_features = segments[segment_name]
predicted_cost = row['Predicted Cost']
optimal_price = row['Optimal Price']
# Generate a range of prices around the optimum
price_range = np.linspace(predicted_cost + 0.01, optimal_price * 2, 300) # Ensure price > cost
profits = []
for price in price_range:
# Prepare raw features for demand prediction
demand_features_raw = pd.DataFrame([{
'price': price, 'income': segment_features['income'],
'marketing_spend': segment_features['marketing_spend']
}])
demand_features_raw['price_income_interaction'] = demand_features_raw['price'] * demand_features_raw['income']
# Add constant and reorder columns to match the demand model's exog_names
demand_features_with_const = sm.add_constant(demand_features_raw, has_constant='add')
demand_features_ordered = demand_features_with_const[demand_exog_names]
demand = demand_model_results.predict(demand_features_ordered).iloc[0]
profit = (price - predicted_cost) * demand
profits.append(profit)
# Plot the curve
plt.plot(price_range, profits, label=f'Profit Curve for: {segment_name}')
# Mark the maximum
plt.axvline(x=optimal_price, linestyle='--',
color=plt.gca().lines[-1].get_color(),
label=f'Optimal Price: ${optimal_price:,.2f}')
plt.title('Profit Optimization Curves by Customer Segment', fontsize=16)
plt.xlabel('Price ($)', fontsize=12)
plt.ylabel('Expected Profit per Policyholder ($)', fontsize=12)
plt.legend()
plt.grid(True)
plt.show()
def main():
# Step 1: Demand Modeling and Price Elasticity
df_demand = create_demand_data()
X_demand_raw = df_demand[['price', 'income', 'marketing_spend']].copy()
X_demand_raw['price_income_interaction'] = X_demand_raw['price'] * X_demand_raw['income']
X_demand = sm.add_constant(X_demand_raw, has_constant='add') # Add constant here for training
# Store the model object first, then fit it
demand_model = sm.Logit(df_demand['purchase'], X_demand)
demand_model_results = demand_model.fit(disp=False) # disp=False to suppress optimization messages
print("--- Demand Model Results ---")
print(demand_model_results.summary())
plot_demand_curves(demand_model_results, df_demand)
# Step 2: Cost Modeling (Frequency-Severity)
df_cost = create_cost_data()
features = ['driver_age', 'vehicle_value']
# Fit Frequency Model
X_freq = sm.add_constant(df_cost[features], has_constant='add') # Add constant here for training
freq_model = sm.GLM(df_cost['number_of_claims'], X_freq, family=sm.families.Poisson())
freq_model_results = freq_model.fit(disp=False)
print("\n--- Frequency Model (Poisson GLM) Results ---")
print(freq_model_results.summary())
# Fit Severity Model (only on policies with claims)
df_sev = df_cost[df_cost['number_of_claims'] > 0].copy()
X_sev = sm.add_constant(df_sev[features], has_constant='add') # Add constant here for training
sev_model = sm.GLM(df_sev['claim_severity'], X_sev, family=sm.families.Gamma(link=sm.families.links.log()))
sev_model_results = sev_model.fit(disp=False)
print("\n--- Severity Model (Gamma GLM) Results ---")
print(sev_model_results.summary())
plot_predicted_vs_actual_cost(freq_model_results, sev_model_results, df_cost)
# Step 3: Defining Customer Segments
customer_segments = {
"Young Driver, Standard Car": {
"driver_age": 22,
"vehicle_value": 18000,
"income": 45000,
"marketing_spend": 150
},
"Experienced Driver, Economy Car": {
"driver_age": 45,
"vehicle_value": 12000,
"income": 70000,
"marketing_spend": 250
},
"Family Driver, High-Value SUV": {
"driver_age": 40,
"vehicle_value": 45000,
"income": 110000,
"marketing_spend": 300
}
}
# Step 4: Non-Linear Price Optimization
results = []
for name, features in customer_segments.items():
opt_price, max_prof, pred_cost = optimize_price_for_segment(
features, demand_model_results, freq_model_results, sev_model_results
)
results.append({
'Segment': name,
'Predicted Cost': pred_cost,
'Optimal Price': opt_price,
'Max Profit per Policyholder': max_prof
})
results_df = pd.DataFrame(results)
print("\n--- Price Optimization Results ---")
print(results_df.round(2))
# Step 5: Results and Visualization
plot_profit_curves(customer_segments, results_df, demand_model_results, freq_model_results, sev_model_results)
if __name__ == "__main__":
main()The interaction between price and income is visualized in the plot below. The demand curve for the "High Income Segment" is both higher and flatter than for the "Low Income Segment". This shows that higher-income customers are not only more likely to purchase a policy overall, but their decision is also less affected by price increases i.e. they have lower price elasticity.
Step 2: Cost Modeling (Frequency-Severity)
Next, we model the expected cost per policy. A robust method is to model two components separately:
Claim Frequency: The number of claims a policyholder makes. Modeled with a Poisson GLM.
Claim Severity: The average cost of each claim. Modeled with a Gamma GLM.
The total expected cost is then Expected Frequency * Expected Severity.
Frequency Model 🚗
The Poisson GLM results show the coefficient for driver_age is negative, confirming that claim frequency tends to decrease as a driver gets older.
--- Frequency Model (Poisson GLM) Results ---
Generalized Linear Model Regression Results
==============================================================================
Dep. Variable: number_of_claims No. Observations: 10000
Model: GLM Df Residuals: 9997
Model Family: Poisson Df Model: 2
Link Function: Log Scale: 1.0000
Method: IRLS Log-Likelihood: -5515.5
Date: Sun, 25 May 2025 Deviance: 6943.8
Time: 18:45:31 Pearson chi2: 9.89e+03
No. Iterations: 6 Pseudo R-squ. (CS): 0.04898
Covariance Type: nonrobust
=================================================================================
coef std err z P>|z| [0.025 0.975]
---------------------------------------------------------------------------------
const -0.6484 0.067 -9.714 0.000 -0.779 -0.518
driver_age -0.0288 0.002 -19.173 0.000 -0.032 -0.026
vehicle_value 9.587e-06 8.31e-07 11.539 0.000 7.96e-06 1.12e-05
=================================================================================
Severity Model 💰
The performance of our combined cost model is visualized in the scatter plot below showing predicted vs. actual claim costs on a log scale, with points clustering around a 45-degree line, indicating good model performance. The wide scatter is expected for insurance data, and the dense vertical line at x=0 represents the many policies with no claims.
Step 3: Defining Customer Segments
With our models built, we define specific customer segments to find a tailored price for each.
# We define three distinct customer segments
customer_segments = {
"Young Driver, Standard Car": {
"driver_age": 22,
"vehicle_value": 18000,
"income": 45000,
"marketing_spend": 150
},
"Experienced Driver, Economy Car": {
"driver_age": 45,
"vehicle_value": 12000,
"income": 70000,
"marketing_spend": 250
},
"Family Driver, High-Value SUV": {
"driver_age": 40,
"vehicle_value": 45000,
"income": 110000,
"marketing_spend": 300
}
}
Step 4: Non-Linear Price Optimization
For each segment, we define a profit function: Profit(Price) = (Price - Predicted_Cost) * Demand(Price). Because the demand component is non-linear, we use a Non-Linear Programming (NLP) solver (scipy.optimize.minimize) to find the price that maximizes this function.
The results are summarized in the table below. The model has successfully identified a unique optimal price and expected profit for each distinct customer segment.
Young Driver, Standard Car:
Predicted Cost: $485.61
Optimal Price: $912.29
Max Profit per Policyholder: $106.34
Experienced Driver, Economy Car:
Predicted Cost: $199.57
Optimal Price: $1,072.95
Max Profit per Policyholder: $493.17
Family Driver, High-Value SUV:
Predicted Cost: $708.94
Optimal Price: $2,048.81
Max Profit per Policyholder: $797.50
Step 5: Results and Visualization
The final plot illustrates our solution with each curve representing the expected profit for a segment across a range of prices. The optimizer has found the peak of each curve, marked by the dashed line. This visualization shows the profit-maximizing price for the "Family Driver" is more than double that of the "Young Driver", ultimately, demonstrating why a one-price-fits-all strategy is suboptimal.
Sources:
https://scikit-learn.org/stable/auto_examples/linear_model/plot_tweedie_regression_insurance_claims.html
https://web.actuaries.ie/sites/default/files/erm-resources/Emphasis2010_1_art4_Price_Opt.pdf




