World Cup, 104 matches, and roughly as many confident predictions as there are fans. Building a model that announces “Team X wins, probability p” is easy — an afternoon’s work with public data and a Poisson distribution. The trap is believing the number. A single model hands you a single answer and no sense of how much it hinges on the dozens of choices buried inside it: which rating system, which goal distribution, which learning algorithm. Change any one of them and the “answer” can move by double digits.
So instead of trusting one model, I built eleven — one for (almost) every chapter of a machine-learning textbook — trained or computed them all on the same real match data, ran each through the same tournament simulator, and let them argue. Three rating systems (Elo, Colley, PageRank), two goal models (Poisson, Negative Binomial), five classifiers (logistic regression, KNN, random forest, XGBoost, a neural network), and the betting market as a benchmark. Same 48 teams, same data, eleven methods.
They crown four different champions — and that disagreement, not the consensus, turns out to be the most useful thing a suite of models can give you. This article is about how to build it and how to read it. (If you just want a single clean forecast, the Elo-plus-Poisson version is its own short article; here we’re after something more honest than one number.)
The data
Everything is fit on 358 real international matches: every game from the 2010–2022 World Cups (256 matches) plus the 2020 and 2024 European Championships (102), pulled from the openfootball project — specifically its worldcup.json and euro.json datasets, which are dedicated to the public domain. The classifiers learn a mapping from match features to results on these games; the rating systems are computed directly from the results graph. The field is the real, confirmed 2026 draw — 48 teams, 12 groups.
Three features describe each match from the “home” (first-named, neutral-venue) team’s perspective: the strength gap between the teams, their combined strength, and a knockout flag. The target is the three-way result (win / draw / loss).
One interface, eleven engines
The only way to race different model families fairly is to force them through the same contract: given two teams, return P(win), P(draw), P(loss) plus an expected goal difference for group-stage tiebreakers. Everything downstream is identical across models: the 12 groups, the best-third qualification, and the 32-team knockout. The simulator is even vectorized so that all 20,000 tournaments per model run as NumPy array operations rather than Python loops.
def match_probs(model, a, b):
"""Every model implements this -> (p_win, p_draw, p_loss)."""
...
def simulate(model, n_sims=20_000):
# groups -> top 2 + 8 best thirds -> 32-team knockout -> champion
...What differs is how each model fills in match_probs. That is where the disagreement is born, so let’s go through the families.
The rating models: strength from results
Elo is the one most people have heard of — the chess rating, adapted for football: a self-correcting number updated after each match by R' = R + K(S − E), where S is the actual result and the win expectancy is E = 1/(1 + 10^(−Δ/400)) for a rating gap Δ. To get a match probability we run the Elo gap through that logistic curve and split off a draw probability fit separately (more on that below).
Colley ratings drop the temporal updating entirely and solve a single linear system. Build the Colley matrix C and vector b over all matches:
C_ii = 2 + (games played by i)
C_ij = -(games between i and j)
b_i = 1 + (wins_i - losses_i) / 2then solve C r = b for the rating vector r. The +2 on the diagonal is a Laplace-style prior that makes the system strictly diagonally dominant and therefore always solvable — every team is implicitly seeded at 0.5 before any games. Colley is elegant precisely because it has no free parameters and no notion of “current form”: it’s a pure, closed-form summary of who beat whom.
PageRank treats the season as a directed graph. Every match adds weight to an edge from the loser to the winner (draws split the weight both ways), so pointing at a team is an endorsement. Normalize each node’s out-edges into a stochastic transition matrix T, then find the stationary distribution under a damped random walk:
r = (1 - d)/n + d · Tᵀ r # d = 0.85, solved by power iterationA team scores highly if strong teams “point to” it — i.e., lost to it. It’s the same algorithm Google used to rank web pages, applied to football results.
Colley and PageRank live on their own scales, so I z-score each and map onto an Elo-like scale before running them through the same win-expectancy curve. Teams absent from the 358-match graph fall back to the prior. This is why they’re interesting: they ignore reputation entirely and rate only what’s in the data — and in this data window the Netherlands graded out far higher than the market believes.
The goal models: Poisson and Negative Binomial
These model scorelines, not outcomes. I fit a Poisson GLM with a log link on the real match goals, stacking each match as two observations (each team’s goals vs. its signed strength gap):
import statsmodels.api as sm
# goals ~ exp(beta0 + beta1 * strength_diff)
fit = sm.GLM(goals, sm.add_constant(sdiff), family=sm.families.Poisson()).fit()
# -> lambda = exp(0.167 + 0.00164 * strength_diff)From λ_home, λ_away we recover P(W/D/L) by forming the outer product of the two teams’ Poisson goal distributions and summing the cells where the home team scores more, the same, or fewer goals than the away team.
The Negative Binomial variant relaxes Poisson’s most restrictive assumption — that the mean equals the variance. Real goal data is mildly overdispersed, so NB introduces a dispersion parameter α with variance μ + αμ². Here the fitted α ≈ 0.008 is tiny (international goals are close to equidispersed), so NB barely departs from Poisson — itself a useful empirical finding worth a sentence in any writeup.
The classifiers: outcome from features
Five models predict P(W/D/L) directly from the three features:
- Logistic regression: a multinomial/softmax model, linear in the log-odds. Its inductive bias — that result probability is a smooth, monotone function of the strength gap — happens to match the problem almost perfectly.
- K-Nearest Neighbours: no parametric form at all; it predicts a match from the class balance of its 30 nearest historical matchups. With only three features the curse of dimensionality isn’t a threat, so KNN is surprisingly competitive.
- Random Forest: bagged decision trees, each grown on a bootstrap sample and a random feature subset, then averaged — variance reduction by ensembling.
- XGBoost: gradient-boosted trees, fit sequentially so each tree corrects the previous ensemble’s residuals.
- Neural network: a small multilayer perceptron (16→8 hidden units) that learns its own feature interactions.
All five expose predict_proba, so they drop straight into the common interface — I just batch-predict over all 48×48 ordered team pairs once and cache the probability matrices.
The draw problem
One subtlety binds the rating models together. Elo, Colley, and PageRank natively give a win expectancy, not a three-way split — so where does P(draw) come from?
I fit it from the data as a logistic function of the absolute strength gap: evenly matched teams draw far more often than mismatches. That single calibrated curve is shared by all three rating models, which keeps the comparison fair.
One match, eleven opinions
Before simulating a whole tournament, it helps to see the disagreement at the level of a single game. Take Spain vs. Morocco — a strong favorite against a very good underdog. Here is the win / draw / loss probability each model assigns to Spain:
| Model | Spain win | Draw | Morocco win |
|---|---|---|---|
| PageRank (Ch 8) | 69% | 24% | 7% |
| Poisson (Ch 4) | 63% | 22% | 15% |
| Negative Binomial (Ch 4) | 62% | 22% | 15% |
| Logistic (Ch 5) | 61% | 24% | 15% |
| Elo (Ch 8) | 61% | 26% | 13% |
| Colley (Ch 8) | 57% | 26% | 17% |
| Neural net (Ch 7) | 56% | 20% | 24% |
| KNN (Ch 4/5) | 47% | 27% | 27% |
| Random Forest (Ch 6) | 40% | 39% | 22% |
| XGBoost (Ch 6) | 25% | 64% | 11% |
Table: Spain vs Morocco modeled outcomes. Numbers calculated by author
Spain’s win probability runs from 69% (PageRank) down to 25% (XGBoost) — and XGBoost actually makes the draw the most likely result, at 64%. PageRank loves Spain because strong teams have lost to them in the data; XGBoost, over-flexible on only 358 matches, smears probability onto the draw class — a calibration failure worth a whole separate post.
These aren’t rounding differences — they’re different theories of the same game. Now multiply that disagreement across 64 group matches and a 32-team knockout, 20,000 times over, and you get genuinely different tournaments.
The result: they don’t agree
Read across any row and a team’s fortunes swing on who’s asking. Spain ranges from a 13% afterthought to a 29% runaway. The bottom line — each model’s single most likely champion:
| Champion pick | Models backing it |
|---|---|
| Spain | Elo, Poisson, Negative Binomial, Logistic, KNN, PageRank, and the market |
| Argentina | Random Forest, XGBoost |
| France | Neural network |
| Netherlands | Colley |
Table 2: Different models, different champions. Created by author
Eleven models, four champions.
Why they disagree — the three real reasons
There are three distinct reasons for this disagreement, and none of them have much to do with soccer as such:
- Information source. Elo and the market’s implied odds encode current global form; Colley and PageRank encode only the results in the dataset. When a team’s recent results outrun its reputation (the Netherlands here), the graph methods diverge sharply from the form-based ones. Neither is wrong — they answer different questions.
- Goals vs. outcomes. The Poisson family models scorelines and derives the winner; the classifiers model the result directly. In tight matches those two routes assign different draw mass and therefore different knockout survival.
- Bias vs. variance. The boosted trees pick up subtler interactions in the training data and tilt toward Argentina; the linear models smooth those away. On only 358 matches that flexibility is as likely to be fitting noise as signal — which, it turns out, is exactly what the cross-validated scores show: the simplest classifier fits best and the most flexible ones fit worst.
You can see this family structure directly by correlating each model’s 48-team probability vector with every other’s:
Three blocks fall out, and the clustering is not an accident. The form-based models (Elo, logistic regression, Poisson, Negative Binomial) agree almost perfectly, because they all run off the same strength prior through different link functions.
The machine-learning classifiers (KNN, random forest, XGBoost, the neural net) form a second block. And Colley and PageRank, the only models that ignore the prior and read pure results, sit apart from everyone else (correlations down around 0.72–0.83). The chart is essentially a map of which models share information — which is the honest way to read any ensemble.
The consensus — averaging the ten non-market models — still reads sensibly:
Spain has ~20% win probability, France and Argentina ~14%, then the Netherlands and England. Averaging models is itself a method: a simple ensemble usually beats most of its members because uncorrelated errors partially cancel.
But notice the grey bars — the min-to-max range across models. They’re wide. Anyone who hands you a single number for “who wins the World Cup” is hiding those bars, and the bars are the honest part.
What this can and can’t tell you
At this stage, a few caveats are apt, the first of which is essential for reading the agreement chart correctly. The goal models and the classifiers all take the same Elo-style strength prior as their primary input, so a large part of their agreement is mechanical, not independent corroboration — only Colley and PageRank derive strength independently from the results graph, which is exactly why they sit apart. Thus, treat the consensus as “where these methods, given this data, tend to land,” not as eleven independent witnesses.
Second, the training set is 358 matches, heavily weighted toward World Cups and European Championships; non-European form is under-sampled, and six 2026 qualifiers with no match history fall back to the prior. Third, matches are simulated at a neutral venue with a seeded bracket rather than FIFA’s exact Round-of-32 map. None of this sinks the exercise — but a model suite is only as independent as its inputs, and being explicit about that is the difference between an actual ensemble and an echo chamber.
Beyond the World Cup
The setup opens two threads worth pulling on your own data. First, the market’s implied odds are just another column in that heatmap. So the natural next step is to line the model consensus up against the market’s de-vigged probabilities and ask where, and why, they disagree: reading odds as probabilities, removing the margin, and asking how efficient the market really is as a forecaster.
Second, you might assume the more flexible models, such as XGBoost, and the neural net, fit the historical data best. The cross-validated scores say the opposite, and the reason is that we’re overfitting on a small and low-dimensional dataset. That’s a lesson that travels far beyond football.
A fuller modeling suite, data, and charts are on GitHub. Every model is built and explained in the upcoming book I co-authored, titled Soccer Analytics with Machine Learning (O’Reilly, 2026). (The e-book will be available from around June 25; the printed book will be available by about mid-June, from everywhere you get your books.)
