Reward Function HOLD Bias Fix (v2.5.0)

Experiment Overview

| Item | Details | |------|---------| | Date | 2024-12-27 | | Goal | Fix reward function that teaches model HOLD is optimal | | Environment | GPU-native PPO training, vectorized_env.py | | Status | Success |

Context

Training BTCUSD showed the model wasn't finding trading opportunities over a 2-year period. Investigation revealed the reward function had a hidden HOLD bias.

Root Cause: Asymmetric payoff structure made HOLD the rational choice:

| Action | Reward If Correct | Reward If Wrong | Expected Value | |--------|-------------------|-----------------|----------------| | HOLD | 0 | 0 | 0 (safe) | | BUY/SELL | +0.35 to +0.90 | -0.52 (painful) | Negative after costs |

The Math for BTCUSD:

Slippage cost: 0.5% (crypto) per trade
5-bar horizon (5 hours at 1H timeframe)
Bitcoin 5-hour volatility: ~0.3-0.5% average
Expected move < slippage cost = negative EV for trading

Root Cause Analysis

| Issue | Problem | Impact | |-------|---------|--------| | Slippage penalty too aggressive | 0.5% × 10 scaling = huge penalty | Model avoids trading entirely | | Exploration bonus negligible | 0.01 × uncertainty = ~0.001 | No incentive to try trading | | HOLD gets zero reward | HOLD = 0, wrong trade = -0.52 | Asymmetric payoff favors HOLD | | Direction threshold too strict | 0.1% threshold vs 0.5% slippage | Correct predictions still lose money |

Verified Workflow

Step 1: Reduce Slippage Cost

# In vectorized_env.py, GPUEnvConfig
slippage_cost_crypto: float = 0.002  # Was 0.005 (0.5% -> 0.2%)
slippage_weight: float = 0.02        # Was 0.05

Step 2: Add Trading Incentive

# In vectorized_env.py, GPUEnvConfig
trading_incentive: float = 0.02  # NEW: Small bonus for non-HOLD actions

# In _calculate_rewards() after drawdown_penalty
trade_executed = (pred_direction != 0).float()
trading_incentive_reward = trade_executed * self.config.trading_incentive

# Add to combined reward
reward = (
    ... existing components ...
    trading_incentive_reward  # v2.5.0 - fix HOLD bias
) * risk_adjustment

Step 3: Increase Exploration Bonus

# In _calculate_rewards(), COMPONENT 5
exploration_bonus = 0.05 * uncertainty  # Was 0.01

Step 4: Increase Direction Threshold

# In _calculate_rewards()
threshold = 0.003  # Was 0.001 (0.3% vs 0.1%)
# Must exceed slippage to be considered profitable

Step 5: Rebalance Reward Weights

# In GPUEnvConfig
direction_weight: float = 0.40       # Was 0.35 - primary signal
magnitude_weight: float = 0.05       # Was 0.10 - noisy component
pnl_weight: float = 0.25             # Keep
stop_tp_weight: float = 0.10         # Was 0.15
exploration_weight: float = 0.15     # Was 0.10
slippage_weight: float = 0.02        # Was 0.05
drawdown_penalty_weight: float = 0.03 # Was 0.05

Failed Attempts (Critical)

| Attempt | Why it Failed | Lesson Learned | |---------|---------------|----------------| | Switch to 15-min timeframe | Smaller moves, same costs = worse HOLD bias | Fix reward function first, then consider timeframe | | Just reduce slippage penalty | Model still biased toward HOLD | Need positive incentive for trading | | Large trading incentive (0.1) | Caused overtrading | 0.02 is sufficient to break tie | | Remove slippage penalty entirely | Model overtrades, ignores costs | Need penalty, just not excessive |

Final Parameters

# vectorized_env.py - GPUEnvConfig (v2.5.0)

# Reward weights (8 components)
direction_weight: float = 0.40
magnitude_weight: float = 0.05
pnl_weight: float = 0.25
stop_tp_weight: float = 0.10
exploration_weight: float = 0.15
slippage_weight: float = 0.02
drawdown_penalty_weight: float = 0.03
trading_incentive: float = 0.02  # NEW

# Transaction costs
slippage_cost_crypto: float = 0.002  # Was 0.005
slippage_cost: float = 0.001         # Equity unchanged

# Direction threshold
threshold = 0.003  # Was 0.001 (in _calculate_rewards)

# Exploration bonus multiplier
exploration_bonus = 0.05 * uncertainty  # Was 0.01

Key Insights

Asymmetric payoffs create bias - If HOLD=0 and wrong trade=-X, model learns HOLD is safe. Add small positive reward for trading to balance.
Slippage must be < expected move - If cost to trade > expected profit, rational to never trade. Align slippage with actual broker costs (0.1-0.2%).
Threshold should match slippage - Direction threshold (0.1%) below slippage (0.5%) means "correct" predictions still lose money. Set threshold >= slippage.
Exploration needs real incentive - 0.01 multiplier is negligible. Increase to 0.05 for meaningful exploration bonus.
Test with volatile assets first - BTCUSD has higher volatility, so if model won't trade BTC, it definitely won't trade lower-vol assets.

Expected Behavior After Fix

With the new reward function, expect:

Trade rate: 30-60% (was ~5%)
More balanced signal distribution (BUY/SELL/HOLD)
Model takes trades when expected move > costs
Still respects risk management (drawdown penalty)

References

alpaca_trading/gpu/vectorized_env.py: Lines 388-410 (GPUEnvConfig), 1622-1712 (_calculate_rewards)
alpaca_trading/api/routes/signals.py: Lines 75, 136 (dashboard key fix: 'passed' -> 'pass')
Literature: Risk-Aware RL Reward - multi-component reward design
Skill: reward-scaling-calibration - related fix for reward_scale