Back to skills
extension
Category: Content & MediaNo API key required

network-architecture-sizing

PPO network architecture sizing for trading models. Trigger: (1) model files are unexpectedly small/large, (2) choosing hidden_dims for training, (3) balancing model capacity vs inference speed.

personAuthor: jakexiaohubgithub

Network Architecture Sizing - Research Notes

Experiment Overview

| Item | Details | |------|---------| | Date | 2025-12-18 | | Goal | Understand relationship between hidden_dims and model file size | | Environment | Google Colab A100, PyTorch 2.x, NativePPOTrainer | | Status | Documented |

Context

Training runs produced models at ~72 MB instead of expected ~148 MB. Investigation revealed the hidden_dims configuration determines model size, with the first layer dominating total parameter count due to multiplication with observation dimensions.

Architecture Comparison

Model Size vs Architecture

| Architecture | hidden_dims | Layers | Total Params | File Size | First Layer Size | |--------------|-------------|--------|--------------|-----------|------------------| | Large (v2.2) | (2048, 1024, 512, 256) | 4 | 12.6M | ~148 MB | 2048 × obs_dim | | Medium (v2.3) | (1024, 512, 256) | 3 | 6.1M | ~72 MB | 1024 × obs_dim | | Small | (512, 256, 128) | 3 | ~1.5M | ~18 MB | 512 × obs_dim | | Tiny | (256, 128, 64) | 3 | ~0.4M | ~5 MB | 256 × obs_dim |

Why First Layer Dominates

With 53 features × 100 lookback = 5,300 input dimensions:

  • Large: 2048 × 5300 = 10.9M params (86% of network)
  • Medium: 1024 × 5300 = 5.4M params (89% of network)
  • Small: 512 × 5300 = 2.7M params (90% of network)

Key insight: The first hidden layer dimension has exponentially more impact on model size than deeper layers.

Configuration Locations

Current defaults in ppo_trainer_native.py:

| Function | GPU Tier | hidden_dims | |----------|----------|-------------| | get_auto_config() | H100 | (1024, 512, 256) | | get_auto_config() | A100 | (1024, 512, 256) | | get_auto_config() | high (40GB+) | (512, 256, 128) | | get_auto_config() | medium (20-40GB) | (512, 256, 128) | | get_auto_config() | low (<20GB) | (256, 128, 64) | | get_a100_config() | A100-80GB | (1024, 512, 256) | | get_a100_config() | A100-40GB | (512, 256, 128) |

Verified Workflow

To use larger architecture (148 MB models):

from alpaca_trading.gpu.ppo_trainer_native import get_auto_config

config = get_auto_config(total_timesteps=200_000_000, training_mode='production')
config.hidden_dims = (2048, 1024, 512, 256)  # Override to 4-layer large

trainer = NativePPOTrainer(env, config)

To verify model architecture before training:

import torch

# Check expected size
obs_dim = 5300  # 53 features × 100 lookback
hidden_dims = (2048, 1024, 512, 256)

params = obs_dim * hidden_dims[0]  # First layer
for i in range(len(hidden_dims) - 1):
    params += hidden_dims[i] * hidden_dims[i+1]
params += hidden_dims[-1] * 64 * 2  # Actor + critic heads

print(f"Expected params: {params:,}")
print(f"Expected size: ~{params * 4 * 3 / 1024 / 1024:.0f} MB")  # float32 × 3 (weights + optimizer state)

To inspect existing model:

import torch

ckpt = torch.load('model.pt', map_location='cpu', weights_only=False)
print(f"hidden_dims: {ckpt['config'].hidden_dims}")
print(f"Total params: {sum(v.numel() for v in ckpt['policy_state_dict'].values()):,}")

Failed Attempts (Critical)

| Attempt | Why it Failed | Lesson Learned | |---------|---------------|----------------| | Assuming all configs use same architecture | Different GPU tiers have different defaults | Always check hidden_dims in config before training | | Only checking layer count | 3-layer (1024,512,256) vs 4-layer (2048,1024,512,256) | First layer width matters more than depth | | Not saving config with model | Couldn't reproduce training | Always save full config in checkpoint | | Using large architecture on small GPU | OOM errors | Match architecture to available VRAM | | Assuming bigger = better | Overfitting on small datasets | Larger models need more data/regularization |

Performance Considerations

Larger Architecture (2048, 1024, 512, 256)

Pros:

  • Higher model capacity for complex patterns
  • Better for symbols with rich feature interactions
  • May capture longer-term dependencies

Cons:

  • 2x file size (~148 MB vs ~72 MB)
  • Slower inference (~1.5-2x)
  • Higher VRAM usage during training
  • More prone to overfitting with limited data

Smaller Architecture (1024, 512, 256)

Pros:

  • Faster inference (important for live trading)
  • Lower VRAM requirements
  • Faster training iterations
  • Better generalization on limited data

Cons:

  • May underfit complex market dynamics
  • Less capacity for feature interactions

Recommended Architecture by Use Case

| Use Case | Recommended hidden_dims | Rationale | |----------|------------------------|-----------| | Quick iteration/testing | (512, 256, 128) | Fast training, low memory | | Standard production | (1024, 512, 256) | Good balance | | Complex symbols (crypto) | (2048, 1024, 512, 256) | Higher volatility patterns | | Limited training data (<1 year) | (512, 256, 128) | Reduce overfitting | | Extended training (500M+ steps) | (2048, 1024, 512, 256) | Capacity for more learning |

Key Insights

  • First layer width dominates model size - doubling first layer ~doubles total params
  • File size ≈ params × 12 bytes (float32 weights + Adam optimizer moments)
  • Current v2.3 defaults favor smaller models - optimized for speed over capacity
  • Architecture mismatch = inference failure - models trained with different hidden_dims are incompatible
  • Always log hidden_dims - critical for reproducibility and debugging

Diagnostic Commands

# Compare two model architectures
def compare_models(path1, path2):
    m1 = torch.load(path1, map_location='cpu', weights_only=False)
    m2 = torch.load(path2, map_location='cpu', weights_only=False)

    print(f"Model 1: {m1['config'].hidden_dims}")
    print(f"Model 2: {m2['config'].hidden_dims}")
    print(f"Params 1: {sum(v.numel() for v in m1['policy_state_dict'].values()):,}")
    print(f"Params 2: {sum(v.numel() for v in m2['policy_state_dict'].values()):,}")

References

  • alpaca_trading/gpu/ppo_trainer_native.py: Lines 1314, 1339, 1726, 1760
  • alpaca_trading/gpu/ppo_trainer_native.py: NativeActorCritic class (line 305)
  • CLAUDE.md: GPU Optimized Settings table