Brier Score
Category: Decision-Making & Strategic Thinking Source: Glenn W. Brier (1950) / Meteorology / Forecasting Science Practitioner Score: 42/50 (Tier 1)
Overview
The Brier Score is a strictly proper scoring rule that measures the accuracy of probabilistic predictions. It calculates the mean squared error between predicted probabilities and actual outcomes, providing a single numeric measure of forecast quality. Lower scores indicate better accuracy, with 0 being perfect and 2 being worst possible.
Core Insight: You can't improve what you don't measure. The Brier Score converts vague notions of "good forecasting" into quantifiable performance, enabling systematic improvement through calibration feedback.
Formula: BS = (1/N) Σ(fᵢ - oᵢ)²
- fᵢ = forecasted probability (0-1)
- oᵢ = actual outcome (1 if event occurred, 0 if not)
- N = number of forecasts
When to Use
- Calibration assessment - Are your 70% predictions actually happening 70% of the time?
- Forecaster comparison - Which analyst/model produces more accurate predictions?
- Training feedback - Providing objective performance scores for improvement
- Decision validation - Evaluating quality of past predictions over time
- Model selection - Comparing machine learning models or forecasting systems
Anti-patterns:
- Single predictions (need 20+ for statistical validity)
- Non-probabilistic forecasts (binary yes/no)
- Events without clear resolution criteria
- Immediate feedback needs (requires outcome data)
How to Execute
Step 1: Record Probabilistic Forecast
Action: Document your prediction as a probability between 0% and 100%
- Precision: Use granular probabilities (65%, not "likely")
- Timestamp: Record when prediction was made
- Resolution criteria: Define exactly what constitutes "event occurred"
- Output: Logged forecast with probability, date, and outcome definition
Step 2: Wait for Event Resolution
Action: Allow sufficient time for outcome to be determined
- Clear endpoint: Specify resolution date/trigger in advance
- Unambiguous outcome: 1 (occurred) or 0 (did not occur)
- No gaming: Outcome determination must be independent of forecaster
- Output: Resolved outcome (1 or 0)
Step 3: Calculate Individual Forecast Error
Action: Compute squared difference between forecast and outcome
- If event occurred (oᵢ = 1): Error = (1 - fᵢ)²
- If event did not occur (oᵢ = 0): Error = (0 - fᵢ)² = fᵢ²
- Example: Predicted 70% (0.7), event happened → (1 - 0.7)² = 0.09
- Output: Single forecast Brier score
Step 4: Aggregate Across Multiple Forecasts
Action: Average squared errors across N predictions
- Minimum sample: 20+ forecasts for meaningful assessment
- Formula: BS = (Error₁ + Error₂ + ... + Errorₙ) / N
- Output: Overall Brier score for forecast set
Step 5: Interpret Score Against Benchmarks
Action: Compare your score to reference points
- Perfect accuracy: 0.00 (impossible in practice)
- Excellent: < 0.10 (superforecaster level)
- Good: 0.10 - 0.20 (well-calibrated forecaster)
- Average: 0.20 - 0.30 (typical expert)
- Poor: > 0.30 (worse than random guessing)
- Output: Performance classification
Step 6: Decompose into Calibration vs. Resolution
Action: Break Brier score into skill components
- Calibration: Are X% forecasts correct X% of the time?
- Resolution: Can you distinguish different probability levels?
- Formula: BS = Reliability - Resolution + Uncertainty
- Output: Diagnostic breakdown identifying improvement areas
Step 7: Implement Calibration Improvements
Action: Use insights to adjust forecasting behavior
- Overconfident (too many extremes): Pull predictions toward 50%
- Underconfident (clustered near 50%): Increase differentiation
- Systemic bias: Adjust all forecasts by consistent offset
- Output: Updated forecasting protocol
Real-World Examples
Weather Forecasting:
- Meteorologists tracked with Brier scores for 50+ years
- Led to dramatic improvements in precipitation forecasting
- Result: Today's 5-day forecast as accurate as 1-day forecast in 1970s
Good Judgment Project:
- Superforecasters averaged 0.15-0.18 Brier scores
- Regular forecasters averaged 0.25-0.30
- Intelligence analysts (with classified info) averaged 0.30+
- Result: Validated that systematic methodology beats expertise
Sports Betting Markets:
- Bookmakers use Brier scores to evaluate odds accuracy
- Prediction markets (PredictIt, Polymarket) track participant scores
- Result: Efficient markets reflect well-calibrated probabilities
Integration Points
Complements:
- Superforecasting: Brier score measures effectiveness of 10 commandments
- Calibration: Primary diagnostic tool for improving calibration
- Bayesian Updating: Tracks whether belief updates improve accuracy
- Prediction Markets: Aggregation mechanism with Brier-optimized incentives
Enables:
- Performance tracking: Quantitative measure for deliberate practice
- A/B testing: Compare forecasting methodologies empirically
- Incentive design: Reward accurate probabilistic predictions
Common Pitfalls
Pitfall 1: Conflating Low Score with Skill
- Warning sign: 0.05 score from 10 predictions all at 51%/49%
- Fix: Check resolution - can you distinguish probability levels?
Pitfall 2: Ignoring Calibration Components
- Warning sign: Good overall score masking systemic bias
- Fix: Decompose into reliability, resolution, and uncertainty
Pitfall 3: Small Sample Sizes
- Warning sign: Declaring "good forecaster" from 5 predictions
- Fix: Require minimum 20-50 forecasts before drawing conclusions
Pitfall 4: Cherry-Picking
- Warning sign: Only tracking predictions you feel confident about
- Fix: Commit to scoring ALL forecasts in domain upfront
Pitfall 5: Resolution Ambiguity
- Warning sign: Disputes about whether event "really" occurred
- Fix: Define resolution criteria precisely when making forecast
Multi-Category Extension
For events with more than 2 outcomes (e.g., election with 3+ candidates):
Formula: BS = (1/N) Σ Σⱼ (fᵢⱼ - oᵢⱼ)²
Example: Three-way race forecast of A=60%, B=10%, C=30%
- If A wins: (1-0.6)² + (0-0.1)² + (0-0.3)² = 0.16 + 0.01 + 0.09 = 0.26
- If B wins: (0-0.6)² + (1-0.1)² + (0-0.3)² = 0.36 + 0.81 + 0.09 = 1.26
- If C wins: (0-0.6)² + (0-0.1)² + (1-0.3)² = 0.36 + 0.01 + 0.49 = 0.86
Note: Multi-category scores range from 0 (perfect) to 2 (worst possible).
Validation Checklist
- [ ] All forecasts are probabilistic (0-100%), not binary
- [ ] Resolution criteria defined before outcome known
- [ ] Outcomes recorded honestly (no retroactive "adjustments")
- [ ] Sample size sufficient (20+ forecasts minimum)
- [ ] Score decomposed into calibration and resolution components
- [ ] Comparison made to baseline (random guessing, other forecasters)
- [ ] Trend tracked over time to measure improvement
Practical Tips
Tracking System:
- Spreadsheet columns: Date, Question, Forecast %, Outcome (0/1), Error²
- Running average in final column
- Monthly review of calibration plots
Calibration Plot:
- X-axis: Your forecasted probability (grouped into bins)
- Y-axis: Actual frequency of occurrence
- Perfect calibration = diagonal line
Improvement Signals:
- Score decreasing over time (learning)
- Calibration plot approaching diagonal
- Resolution increasing (more differentiation)
Further Reading
- Original paper: "Verification of Forecasts Expressed in Terms of Probability" - Glenn W. Brier (1950)
- "Superforecasting" - Philip Tetlock (Brier score as key metric)
- Good Judgment Open: Platform with automatic Brier score calculation
- Cultivate Labs: Forecasting guide with Brier score tutorials
微信扫一扫