Rossmann Sales Forecasting & Causal Analysis

End-to-end data science on 1,115 German drugstores: model benchmarking, competition causal inference, and promo heterogeneous treatment effects.

Overview

This project applies forecasting and causal inference to the Rossmann Store Sales Kaggle dataset — daily sales records for 1,115 German drugstores from 2013 to 2015. It has two objectives:

  1. Forecasting — benchmark a full model matrix (OLS variants, Random Forest, XGBoost, LightGBM, CatBoost, MLP, Prophet) on RMSPE, the Kaggle competition metric.
  2. Causal inference — move beyond naive correlations to estimate the causal effect of (a) nearby competitor openings using Regression Discontinuity in Time, and (b) promotional campaigns using a Causal Forest with Double ML.

Results at a Glance

Forecasting — all-stores RMSPE (validation: May–Jul 2015)

ModelRMSPENotes
Seasonal naive0.1767Same day 52 weeks prior
OLS structural (linear)0.2302Calendar + promo, raw sales target
OLS structural (mult)0.2095Calendar + promo, log(Sales) target
MLP0.18352-layer neural network
OLS predictive (linear)0.1607Lag features replace store FE
OLS predictive (mult)0.1525Log-lags + log(Sales) target
Random Forest0.1455200 trees, log(Sales) target
CatBoost0.1375Native categorical support
LightGBM ★0.1315Best — histogram boosting, full feature set
Prophet (20-store sample)~0.1498Per-store time-series model; no lag features

Store 70 — single-store vs global models

ModelSingle-storeGlobal (filtered)
OLS structural (mult)0.12700.1235
OLS predictive (mult)0.12520.1267
Random Forest0.11160.1184
XGBoost0.1111
CatBoost0.1027
LightGBM ★0.0982
RMSPE comparison chart across all model families
RMSPE league table — all models, both scopes (lower is better).
Actual vs LightGBM predicted sales for store 70
Store 70 — actual daily sales vs LightGBM forecast over the full validation period (May–Jul 2015).

Methodology

Feature engineering

Raw date and store metadata are transformed into a rich feature set before any model sees the data. Key decisions:

Model matrix — structural vs predictive

OLS models are organised along two axes:

StructuralPredictive
Multiplicative (log target)OLS-SMOLS-PM
Linear (raw target)OLS-SLOLS-PL

Structural models use calendar and promo features; store fixed effects (absorbed via within-store demeaning — Frisch-Waugh theorem) leave coefficients as clean seasonal multipliers. Predictive models replace store fixed effects with lag/rolling features. The same-condition lag already encodes Promo status by construction, so Promo is excluded as a separate predictor.

LightGBM feature importance (SHAP)

LightGBM SHAP feature importance
Mean |SHAP| values for the global LightGBM model — lag features dominate, followed by seasonal calendar features. Competition and promotion features contribute measurably but are secondary to the store's own recent history.
Prediction error distributions
% prediction error distribution — violin plots for seasonal naive, OLS predictive, and LightGBM. LightGBM has a substantially tighter and more symmetric error distribution.

Causal Analysis I — Competitor Opening Effect

Identification: Regression Discontinuity in Time (RDiT)

CompetitionOpenSinceYear/Month records when the nearest competitor opened. For each of 163 treated stores (competitor opened mid-window, with ≥60 days either side), we estimate a sharp discontinuity model within a ±90 day bandwidth:

log(Salesit) = α + β·POSTit + γ·t + δ·POSTit·t + εit

β captures the immediate level shift at the opening. Store fixed effects are absorbed by within-store demeaning. Standard errors are HC3-robust.

Key finding Competitor opening reduces sales by approximately −2.9% (β = −0.030, 95% CI [−0.041, −0.019]). The placebo test (fake opening 6 months early) gives β = +0.004, p = 0.57 — pre-trends are clean.
Event study around competitor opening
Monthly average (residualised) log sales around the competitor opening date. The drop at t = 0 is visible and persistent for the first few months.
RDiT scatter with fitted trend lines
RDiT: weekly-binned log sales vs days to opening. Blue trend = pre-opening, coral trend = post-opening. The level shift at t = 0 (red dashed) is the β estimate.

Heterogeneity by competition distance — continuous moderator

Rather than splitting into terciles, we add distance as a continuous moderator interacted with POST. Since CompetitionDistance is constant within store, (POST × dist)dm = postdm × disti, so the specification remains valid under within-store demeaning:

log_salesdm = β·postdm + γ·tdm + δ·(t·post)dm + λ·(postdm × disti) + μ·Promodm
Competition effect vs distance — continuous moderator
Causal effect of competitor opening as a function of distance. Shaded band = 95% CI on the continuous moderator line. Dots = binned RDiT estimates (6 quantile groups). Effect is effectively zero beyond ~13 km.
Distance groupRangeβ% change
Close< 1,153 m−0.045−4.4%
Medium1,153 – 4,767 m−0.044−4.3%
Far> 4,767 m+0.001n.s.
Continuous slope (λ)per kmEffect → 0 at ~13 km

The pattern is a threshold rather than a smooth gradient — stores within roughly 5 km lose 4–4.5% of sales regardless of exact distance; beyond that the effect disappears.

Causal Analysis II — Promotion Heterogeneous Treatment Effects

Assignment mechanism

Before estimating the promo effect we verify the assignment mechanism empirically. Promo turns out to follow a strict centralised calendar:

The endogeneity is primarily temporal: promo weeks are correlated with seasonal factors that independently drive sales. DML controls for this by residualising both Y and T on the same rich feature set (including lag features that capture recent trajectory) before estimating the treatment effect.

Causal Forest with Double ML

econml.CausalForestDML with gradient boosting nuisance models:

  1. Regress log(Sales) on controls X → residuals Ỹ
  2. Regress Promo on controls X → residuals P̃
  3. Fit an honest random forest on (Ỹ, P̃) to estimate τ(x) per observation

The naive promo lift (mean promo days vs non-promo days) is +38.8%. Results from the causal forest will be added here once the model run completes.

Key Findings

  1. Lag features dominate forecasting. sales_lag_same_cond — the previous occurrence with the same weekday and Promo status — is the single strongest predictor, ahead of all calendar and store-metadata features. A store's own recent history is far more informative than any structural feature.
  2. Log target matters on RMSPE. Multiplicative OLS outperforms linear OLS by 3–6 pp because RMSPE penalises relative errors; a linear model's loss is dominated by large-volume stores.
  3. Structural OLS underperforms globally. Without access to recent sales trajectory, the structural model (RMSPE 0.21) lags well behind predictive OLS (0.15) and tree models. Within a single store the gap narrows considerably.
  4. Gradient boosting is best in class. LightGBM (0.1315) and CatBoost (0.1375) outperform all OLS and Random Forest variants with the same feature set — the ensemble handles non-linearities and interactions that OLS cannot.
  5. Competitor opening reduces sales by ~3%. The RDiT estimate is β = −0.030 (−2.9%), concentrated within ~5 km. The placebo test confirms the result is not driven by pre-existing trends.
  6. Competition effect is a threshold, not a gradient. Stores within 5 km lose 4–4.5% regardless of exact distance; beyond that the effect is statistically indistinguishable from zero. The linear moderator estimates the effect reaches zero at ~13 km.
  7. Promo is a centralised binary calendar. Cross-store synchronisation is perfect (std = 0.000 on every day). The naive +38.8% lift is confounded by seasonal scheduling; the DML-adjusted estimate controls for this via residualisation.

Reproducing Results

# 1. Get the data (Kaggle CLI)
kaggle competitions download -c rossmann-store-sales -p data/raw/
unzip data/raw/rossmann-store-sales.zip -d data/raw/

# 2. Run in Docker
docker compose up -d
docker exec rossmann-sales-forecast-notebook-1 python notebooks/03_feature_engineering.py
docker exec rossmann-sales-forecast-notebook-1 python notebooks/04_models.py
docker exec rossmann-sales-forecast-notebook-1 python notebooks/05_competition_opening.py
docker exec rossmann-sales-forecast-notebook-1 python notebooks/06_model_comparison.py
docker exec rossmann-sales-forecast-notebook-1 python notebooks/07_promo_causal_forest.py

Full code and data instructions at github.com/bakered/rossmann-sales-forecast .