Rust Performance Best Practices

Expert-level performance optimization guide for Rust. Contains 45+ rules across 9 categories with real benchmarks, failure modes, and profiling workflows.

When to Apply

Reference these guidelines when:

Investigating slow Rust programs or high latency
Optimizing build times or binary size
Reviewing allocation-heavy code
Debugging lock contention or thread scaling issues
Setting up release profiles for production
Working with async runtimes (Tokio, async-std)

When NOT to Apply

Skip these optimizations when:

Code isn't in a hot path (profile first!)
Readability would suffer significantly
You haven't measured a performance problem
The optimization requires unsafe code you can't verify
Premature optimization would delay shipping

The Optimization Workflow

CRITICAL: Most Rust code doesn't need optimization. Profile first, optimize second.

┌─────────────────────────────────────────────────────────────┐
│                   OPTIMIZATION WORKFLOW                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. MEASURE FIRST                                           │
│     └── Profile before changing anything                   │
│     └── Use cargo flamegraph, perf, or heaptrack           │
│     └── Identify actual bottlenecks (don't guess!)         │
│                                                             │
│  2. CHECK BUILD SETTINGS                                    │
│     └── Release mode? (10-100x vs debug)                   │
│     └── LTO enabled? (5-20% improvement)                   │
│     └── Target CPU? (10-30% for SIMD)                      │
│                                                             │
│  3. FIX ALGORITHMIC ISSUES                                  │
│     └── O(n²) → O(n log n) matters more than micro-opts   │
│     └── Check data structure choices                       │
│     └── Avoid unnecessary work                             │
│                                                             │
│  4. REDUCE ALLOCATIONS                                      │
│     └── Pre-size collections (with_capacity)               │
│     └── Reuse buffers (clear + reuse)                      │
│     └── Avoid cloning (borrow instead)                     │
│                                                             │
│  5. OPTIMIZE HOT LOOPS                                      │
│     └── Iterators over indices                             │
│     └── Reduce lock scope                                  │
│     └── Batch I/O operations                               │
│                                                             │
│  6. MEASURE AGAIN                                           │
│     └── Verify improvement with benchmarks                 │
│     └── Check for regressions elsewhere                    │
│     └── Document the optimization                          │
│                                                             │
└─────────────────────────────────────────────────────────────┘

Quick Profiling Commands

# CPU profiling (Linux)
cargo flamegraph --bin myapp
perf record -g ./target/release/myapp && perf report

# Memory profiling
heaptrack ./target/release/myapp && heaptrack_gui heaptrack.myapp.*.gz
DHAT_LOG_FILE=dhat.out cargo run --release && dh_view.py dhat.out

# Benchmark
cargo bench                          # All benchmarks
cargo bench hot_function             # Specific benchmark

# Check allocations
MALLOC_TRACE=/tmp/mtrace.log ./target/release/myapp
mtrace ./target/release/myapp /tmp/mtrace.log

# Assembly inspection
cargo asm my_crate::hot_function --rust

# syscall count
strace -c ./target/release/myapp 2>&1 | head -20

Common Scenarios → Rules

"My Rust program is slow"

Is it running in debug mode?
├── YES → build-release-profile (10-100x speedup)
└── NO
    │
    Where does flamegraph show time?
    ├── malloc/free → alloc-* rules (with_capacity, reuse buffers)
    ├── Mutex::lock → sync-* rules (RwLock, atomics, shorter scope)
    ├── read/write syscalls → io-* rules (BufReader/BufWriter)
    ├── clone/drop → alloc-avoid-clone, use references
    └── Your code → iter-* rules, algorithm improvements

"My binary is too large"

1. Enable LTO: build-enable-lto (10-20% smaller)
2. Set opt-level = 'z': build-opt-level (optimizes for size)
3. panic = 'abort': build-panic-abort (removes unwinding code)
4. Strip symbols: strip = true in Cargo.toml
5. Remove debug info: debug = 0

"High memory usage"

1. Pre-size collections: alloc-*-with-capacity
2. Reuse allocations: alloc-reuse-buffers
3. Avoid cloning: alloc-avoid-clone
4. Use slices in APIs: alloc-use-slices-in-apis
5. Consider arena allocators: bumpalo crate

"Lock contention / thread scaling"

1. Profile: lock_api::ReentrantMutex or parking_lot profiling
2. Reduce lock scope: sync-keep-lock-scope-short
3. Read-heavy? → sync-use-rwlock
4. Simple counters? → sync-use-atomics
5. Message passing? → sync-use-channels
6. Thread-local + periodic flush for stats

"Slow file I/O"

1. Wrap in BufReader/BufWriter: io-use-bufreader, io-use-bufwriter
2. Flush before returning: io-flush-bufwriter (data loss prevention!)
3. Reuse line buffer: io-read-line-with-bufread
4. Consider mmap for random access: memmap2 crate

Rule Categories

| Priority | Category | Typical Impact | Prefix | | -------- | --------------- | ----------------------------------- | --------- | | 1 | Build Profiles | 10-100x (debug→release) | build- | | 2 | Benchmarking | Enables measurement | bench- | | 3 | Allocation | 2-50x for allocation-heavy code | alloc- | | 4 | Data Structures | 2-10x for hot paths | data- | | 5 | Iteration | 2-5x for loop-heavy code | iter- | | 6 | Synchronization | 5-100x for contended code | sync- | | 7 | I/O | 10-100x for I/O-bound code | io- | | 8 | Unsafe | 5-30% in tight loops (experts only) | unsafe- |

1. Build Profiles (CRITICAL)

These apply to ALL Rust code. Check these first.

| Rule | Impact | One-liner | | ----------------------- | ----------- | ------------------------------------ | | build-release-profile | 10-100x | Always ship release builds | | build-opt-level | 2-5x | opt-level=3 for speed, 'z' for size | | build-enable-lto | 5-20% | LTO enables cross-crate optimization | | build-codegen-units | 5-15% | codegen-units=1 for max optimization | | build-panic-abort | Binary size | panic='abort' removes unwinding | | build-target-cpu | 10-30% | target-cpu=native for SIMD | | build-pgo | 5-20% | Profile-guided optimization | | build-incremental-off | 5-10% | Disable for release builds |

2. Benchmarking (REQUIRED)

You can't optimize what you don't measure.

| Rule | Purpose | | --------------------- | ----------------------------------- | | bench-cargo-bench | Use cargo bench with criterion | | bench-bench-profile | Bench profile enables optimizations | | bench-black-box | Prevent dead code elimination | | bench-avoid-io | I/O variance destroys measurements |

3. Allocation

Every allocation is a syscall. Reduce them.

| Rule | Impact | Pattern | | ----------------------------- | ----------- | ----------------------------------------------------------------- | | alloc-vec-with-capacity | 2-10x | Vec::with_capacity(n) not Vec::new() | | alloc-string-with-capacity | 2-5x | String::with_capacity(n) | | alloc-hashmap-with-capacity | 2-5x | HashMap::with_capacity(n) | | alloc-reuse-buffers | 2-10x | .clear() and reuse, don't reallocate (up to 50x in tight loops) | | alloc-use-slices-in-apis | Flexibility | &[T] not Vec<T> in parameters | | alloc-avoid-clone | 2-10x | Borrow &T instead of clone() (benefits scale with data size) |

4. Data Structures

The right data structure beats micro-optimization.

| Rule | When | | -------------------------------- | ----------------------------- | | data-avoid-linkedlist | Almost always (Vec wins) | | data-choose-vecdeque-for-queue | FIFO queues | | data-choose-map-type | HashMap=O(1), BTreeMap=sorted | | data-use-entry-api | Insert-or-update patterns | | data-repr-transparent | FFI newtypes |

5. Iteration

Iterators are as fast as loops and safer.

| Rule | Impact | Pattern | | ------------------------------ | ------------- | --------------------------------------- | | iter-avoid-collect-then-loop | 2-3x | Chain iterators, don't collect | | iter-use-lazy-iterators | 2-3x | .filter().map() not intermediate vecs | | iter-use-any-find | Short-circuit | .any() not .filter().count() > 0 | | iter-use-retain | In-place | .retain() not .filter().collect() | | iter-use-binary-search | O(log n) | .binary_search() on sorted data |

6. Synchronization

Locks are expensive. Minimize contention.

| Rule | Impact | When | | ---------------------------- | -------------- | --------------------------------------------- | | sync-share-with-arc | Avoids copying | Share large (>64B) data across threads | | sync-use-rwlock | 2-8x for reads | >80% reads, few writes; consider parking_lot | | sync-keep-lock-scope-short | 4x | Minimize code under lock | | sync-use-channels | 3-4x | Message passing vs shared state | | sync-use-atomics | 20x | Simple counters, flags | | sync-use-parking-lot | 1.5-5x | Prefer parking_lot over std sync primitives |

7. I/O

Every syscall costs. Buffer them.

| Rule | Impact | Pattern | | --------------------------- | -------- | ------------------------------------ | | io-use-bufreader | 50x | Wrap File in BufReader | | io-use-bufwriter | 18x | Wrap File in BufWriter | | io-flush-bufwriter | CRITICAL | Must flush or lose data! | | io-read-line-with-bufread | 53x | Reuse String buffer with read_line |

8. Async/Await (HIGH)

Critical for Tokio and async-std applications.

| Rule | Impact | Pattern | | ------------------------- | ------------- | -------------------------------------------- | | async-spawn-blocking | Prevents hang | Use spawn_blocking for CPU-bound work | | async-cooperative | Latency | Yield periodically in long computations | | async-mutex-choice | Correctness | tokio::sync::Mutex across .await points | | async-avoid-blocking-io | Throughput | Use async I/O, not std::fs in async contexts | | async-bounded-channels | Backpressure | Prefer bounded channels for flow control |

Key insight: The async runtime is cooperative. Blocking the executor thread starves all other tasks.

// BAD: Blocks the async runtime
async fn process(data: &[u8]) -> Result<Hash> {
    let hash = expensive_hash(data);  // CPU-bound, blocks executor!
    Ok(hash)
}

// GOOD: Offload to blocking thread pool
async fn process(data: Vec<u8>) -> Result<Hash> {
    tokio::task::spawn_blocking(move || expensive_hash(&data)).await?
}

9. Unsafe (Expert Only)

Only after profiling proves these matter.

| Rule | Impact | Risk | | ------------------------- | ------------- | ------------------------- | | unsafe-get-unchecked | 5-30% | UB if bounds wrong | | unsafe-use-maybeuninit | 20-100x alloc | UB if read before write | | unsafe-avoid-transmute | Correctness | Prefer safe alternatives | | unsafe-repr-transparent | Zero-cost | Required for FFI newtypes |

Decision Trees

When to use with_capacity?

Do you know the size?
├── YES, exact → with_capacity(exact)
├── YES, approximate → with_capacity(estimate)
└── NO
    │
    Will it grow frequently?
    ├── YES → Start bigger or use reserve()
    └── NO → Vec::new() is fine

Mutex vs RwLock vs Atomics?

Is it a simple counter/flag?
├── YES → Atomics (20x faster)
└── NO
    │
    What's the read/write ratio?
    ├── Mostly reads (>90%) → RwLock
    ├── Mostly writes → Mutex
    └── Mixed → Mutex (simpler)

    Consider: parking_lot > std for all of these

When is unsafe get_unchecked worth it?

Did you profile and find bounds checks are the bottleneck?
├── NO → Don't use it
└── YES
    │
    Did you check if LLVM already removed the bounds check?
    ├── NO → Check assembly first (cargo asm)
    └── YES, still there
        │
        Can you use iterators instead?
        ├── YES → Use iterators (same speed, safe)
        └── NO → get_unchecked with documented invariants

Reading Rules

Each rule file in rules/ contains:

Quantified impact with real benchmark numbers
Visual explanations of how the optimization works
Incorrect examples showing common mistakes
Correct examples with best practices
When NOT to apply - trade-offs and edge cases
Common mistakes to avoid
Profiling commands to identify the issue
References to official docs

Full Compiled Document

For all rules in a single file: AGENTS.md