Brier Score

Category: Decision-Making & Strategic Thinking Source: Glenn W. Brier (1950) / Meteorology / Forecasting Science Practitioner Score: 42/50 (Tier 1)

Overview

The Brier Score is a strictly proper scoring rule that measures the accuracy of probabilistic predictions. It calculates the mean squared error between predicted probabilities and actual outcomes, providing a single numeric measure of forecast quality. Lower scores indicate better accuracy, with 0 being perfect and 2 being worst possible.

Core Insight: You can't improve what you don't measure. The Brier Score converts vague notions of "good forecasting" into quantifiable performance, enabling systematic improvement through calibration feedback.

Formula: BS = (1/N) Σ(fᵢ - oᵢ)²

fᵢ = forecasted probability (0-1)
oᵢ = actual outcome (1 if event occurred, 0 if not)
N = number of forecasts

When to Use

Calibration assessment - Are your 70% predictions actually happening 70% of the time?
Forecaster comparison - Which analyst/model produces more accurate predictions?
Training feedback - Providing objective performance scores for improvement
Decision validation - Evaluating quality of past predictions over time
Model selection - Comparing machine learning models or forecasting systems

Anti-patterns:

Single predictions (need 20+ for statistical validity)
Non-probabilistic forecasts (binary yes/no)
Events without clear resolution criteria
Immediate feedback needs (requires outcome data)

How to Execute

Step 1: Record Probabilistic Forecast

Action: Document your prediction as a probability between 0% and 100%

Precision: Use granular probabilities (65%, not "likely")
Timestamp: Record when prediction was made
Resolution criteria: Define exactly what constitutes "event occurred"
Output: Logged forecast with probability, date, and outcome definition

Step 2: Wait for Event Resolution

Action: Allow sufficient time for outcome to be determined

Clear endpoint: Specify resolution date/trigger in advance
Unambiguous outcome: 1 (occurred) or 0 (did not occur)
No gaming: Outcome determination must be independent of forecaster
Output: Resolved outcome (1 or 0)

Step 3: Calculate Individual Forecast Error

Action: Compute squared difference between forecast and outcome

If event occurred (oᵢ = 1): Error = (1 - fᵢ)²
If event did not occur (oᵢ = 0): Error = (0 - fᵢ)² = fᵢ²
Example: Predicted 70% (0.7), event happened → (1 - 0.7)² = 0.09
Output: Single forecast Brier score

Step 4: Aggregate Across Multiple Forecasts

Action: Average squared errors across N predictions

Minimum sample: 20+ forecasts for meaningful assessment
Formula: BS = (Error₁ + Error₂ + ... + Errorₙ) / N
Output: Overall Brier score for forecast set

Step 5: Interpret Score Against Benchmarks

Action: Compare your score to reference points

Perfect accuracy: 0.00 (impossible in practice)
Excellent: < 0.10 (superforecaster level)
Good: 0.10 - 0.20 (well-calibrated forecaster)
Average: 0.20 - 0.30 (typical expert)
Poor: > 0.30 (worse than random guessing)
Output: Performance classification

Step 6: Decompose into Calibration vs. Resolution

Action: Break Brier score into skill components

Calibration: Are X% forecasts correct X% of the time?
Resolution: Can you distinguish different probability levels?
Formula: BS = Reliability - Resolution + Uncertainty
Output: Diagnostic breakdown identifying improvement areas

Step 7: Implement Calibration Improvements

Action: Use insights to adjust forecasting behavior

Overconfident (too many extremes): Pull predictions toward 50%
Underconfident (clustered near 50%): Increase differentiation
Systemic bias: Adjust all forecasts by consistent offset
Output: Updated forecasting protocol

Real-World Examples

Weather Forecasting:

Meteorologists tracked with Brier scores for 50+ years
Led to dramatic improvements in precipitation forecasting
Result: Today's 5-day forecast as accurate as 1-day forecast in 1970s

Good Judgment Project:

Superforecasters averaged 0.15-0.18 Brier scores
Regular forecasters averaged 0.25-0.30
Intelligence analysts (with classified info) averaged 0.30+
Result: Validated that systematic methodology beats expertise

Sports Betting Markets:

Bookmakers use Brier scores to evaluate odds accuracy
Prediction markets (PredictIt, Polymarket) track participant scores
Result: Efficient markets reflect well-calibrated probabilities

Integration Points

Complements:

Superforecasting: Brier score measures effectiveness of 10 commandments
Calibration: Primary diagnostic tool for improving calibration
Bayesian Updating: Tracks whether belief updates improve accuracy
Prediction Markets: Aggregation mechanism with Brier-optimized incentives

Enables:

Performance tracking: Quantitative measure for deliberate practice
A/B testing: Compare forecasting methodologies empirically
Incentive design: Reward accurate probabilistic predictions

Common Pitfalls

Pitfall 1: Conflating Low Score with Skill

Warning sign: 0.05 score from 10 predictions all at 51%/49%
Fix: Check resolution - can you distinguish probability levels?

Pitfall 2: Ignoring Calibration Components

Warning sign: Good overall score masking systemic bias
Fix: Decompose into reliability, resolution, and uncertainty

Pitfall 3: Small Sample Sizes

Warning sign: Declaring "good forecaster" from 5 predictions
Fix: Require minimum 20-50 forecasts before drawing conclusions

Pitfall 4: Cherry-Picking

Warning sign: Only tracking predictions you feel confident about
Fix: Commit to scoring ALL forecasts in domain upfront

Pitfall 5: Resolution Ambiguity

Warning sign: Disputes about whether event "really" occurred
Fix: Define resolution criteria precisely when making forecast

Multi-Category Extension

For events with more than 2 outcomes (e.g., election with 3+ candidates):

Formula: BS = (1/N) Σ Σⱼ (fᵢⱼ - oᵢⱼ)²

Example: Three-way race forecast of A=60%, B=10%, C=30%

If A wins: (1-0.6)² + (0-0.1)² + (0-0.3)² = 0.16 + 0.01 + 0.09 = 0.26
If B wins: (0-0.6)² + (1-0.1)² + (0-0.3)² = 0.36 + 0.81 + 0.09 = 1.26
If C wins: (0-0.6)² + (0-0.1)² + (1-0.3)² = 0.36 + 0.01 + 0.49 = 0.86

Note: Multi-category scores range from 0 (perfect) to 2 (worst possible).

Validation Checklist

[ ] All forecasts are probabilistic (0-100%), not binary
[ ] Resolution criteria defined before outcome known
[ ] Outcomes recorded honestly (no retroactive "adjustments")
[ ] Sample size sufficient (20+ forecasts minimum)
[ ] Score decomposed into calibration and resolution components
[ ] Comparison made to baseline (random guessing, other forecasters)
[ ] Trend tracked over time to measure improvement

Practical Tips

Tracking System:

Spreadsheet columns: Date, Question, Forecast %, Outcome (0/1), Error²
Running average in final column
Monthly review of calibration plots

Calibration Plot:

X-axis: Your forecasted probability (grouped into bins)
Y-axis: Actual frequency of occurrence
Perfect calibration = diagonal line

Improvement Signals:

Score decreasing over time (learning)
Calibration plot approaching diagonal
Resolution increasing (more differentiation)

brier-score