Optimizing RAG Performance

Guide for improving RAG systems with reranking, caching, production optimizations, and deployment patterns. Focus on quick wins and production-grade improvements.

When to Use This Skill

Improving retrieval accuracy and quality
Adding reranking to existing RAG pipeline
Reducing latency and improving response time
Deploying RAG to production
Implementing caching for faster re-processing
Scaling to handle more documents or queries
Optimizing costs (API usage, compute resources)

Quick Wins (Low Effort, High Impact)

1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)

Benefit: Transforms any embedding into competitive performance Effort: Minimal code change

from llama_index.postprocessor.cohere_rerank import CohereRerank

# For English content
reranker = CohereRerank(
    top_n=5,
    model="rerank-english-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# For Thai/multilingual content
reranker = CohereRerank(
    top_n=5,
    model="rerank-multilingual-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# Apply to query engine
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve more candidates
    node_postprocessors=[reranker]  # Rerank to top 5
)

Best Practice: Always retrieve 10x candidates, rerank to final top_k

2. Enable Parallel Loading → 13x Speedup (30 minutes)

Benefit: 391s → 31s for 32 PDF files Effort: One parameter change

from llama_index.core import SimpleDirectoryReader

# Sequential (slow)
documents = SimpleDirectoryReader(input_dir="./data").load_data()

# Parallel (13x faster)
documents = SimpleDirectoryReader(input_dir="./data").load_data(
    num_workers=10  # Adjust based on CPU cores
)

3. Optimize Batch Size → Faster Embeddings (15 minutes)

Benefit: Reduced API calls, better throughput Effort: Configuration change

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    embed_batch_size=100  # Up from default 10
)

Quick Decision Guide

Reranker Selection

Best quality → CohereRerank (API-based, best performance)
Best open-source → bge-reranker-large (local, free)
Cost-effective → SentenceTransformerRerank (local, fast)
Multi-language → CohereRerank multilingual-v3.0

Caching Strategy

Development → Local cache (pipeline.persist)
Production → Redis cache (distributed)
When to clear → Model changes, schema updates

Production Deployment

Small scale (<1M docs) → Serverless (auto-scaling)
Large scale → Container-based (consistent performance)
Multi-tenant → Collection isolation + metadata filtering

Optimization Patterns

Pattern 1: Add Best-Practice Reranking

from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

# Open-source alternative to Cohere
reranker = FlagEmbeddingReranker(
    model="BAAI/bge-reranker-large",
    top_n=5
)

query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker]
)

Performance: OpenAI + bge-reranker-large: 0.910 hit rate, 0.856 MRR

Pattern 2: Pipeline Caching (Local)

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding

# Create pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512),
        OpenAIEmbedding()
    ]
)

# Run and cache
nodes = pipeline.run(documents=documents)
pipeline.persist("./pipeline_cache")

# Subsequent runs reuse cached results
pipeline.load("./pipeline_cache")
nodes = pipeline.run(documents=documents)  # Only processes new/changed docs

Pattern 3: Redis Cache (Production)

from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.storage.kvstore.redis import RedisKVStore

# Distributed caching
ingest_cache = IngestionCache(
    cache=RedisKVStore.from_host_and_port(
        host="redis-server",
        port=6379
    ),
    collection="rag_pipeline_cache"
)

pipeline = IngestionPipeline(
    transformations=[...],
    cache=ingest_cache
)

# Cache shared across instances
nodes = pipeline.run(documents=documents)

Pattern 4: Advanced Retrieval Strategies

Metadata Pre-filtering (Sub-50ms):

from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter

# Filter before vector search (90% reduction possible)
filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="category", value="technical"),
        ExactMatchFilter(key="year", value="2024")
    ]
)

query_engine = index.as_query_engine(
    filters=filters,  # Narrow search space first
    similarity_top_k=5
)

Document Summary Retrieval (For 100+ docs):

from llama_index.core import DocumentSummaryIndex

# Two-stage: document-level → chunk-level
summary_index = DocumentSummaryIndex.from_documents(
    documents,
    response_synthesizer=response_synthesizer
)

retriever = summary_index.as_retriever(similarity_top_k=3)

Chunk Decoupling (Precision + Context):

from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Embed sentences, retrieve with context windows
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # Sentences before/after
    window_metadata_key="window"
)

# Replace with window for synthesis
postprocessor = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

query_engine = index.as_query_engine(
    node_postprocessors=[postprocessor],
    similarity_top_k=6
)

Pattern 5: Multiple Postprocessors Chain

from llama_index.core.postprocessor import SimilarityPostprocessor

# Progressive refinement
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.7),  # Filter low scores
        CohereRerank(top_n=10),                          # Rerank top candidates
        MetadataReplacementPostProcessor(...)             # Expand context
    ]
)

Your Codebase Integration

For `src/` Pipeline

Add Reranking to All 7 Strategies:

src/10_basic_query_engine.py → Add CohereRerank
src/16_hybrid_search.py → Add reranker after fusion
All other strategies: Consistent reranking layer

Enable Caching:

src/09_enhanced_batch_embeddings.py → Add pipeline caching
src/02_prep_doc_for_embedding.py → Cache preprocessing

For `src-iLand/` Pipeline

Thai-Optimized Reranking:

# In src-iLand/retrieval/retrievers/
reranker = CohereRerank(
    top_n=5,
    model="rerank-multilingual-v3.0"  # Thai support
)

Fast Metadata Filtering (Already implemented):

src-iLand/retrieval/fast_metadata_index.py → Sub-50ms filtering
Inverted indices for: จังหวัด, อำเภอ, ประเภทโฉนด
B-tree indices for: area, coordinates

Batch Processing Optimization:

src-iLand/docs_embedding/batch_embedding.py → Increase batch_size
src-iLand/data_processing/ → Enable parallel loading

Detailed References

Load these when you need comprehensive details:

reference-reranking.md: Complete reranking guide
- 6 reranker models with benchmarks
- Multi-language reranking
- Cost-performance trade-offs
- Node postprocessor types
reference-production.md: Production optimization patterns
- Ingestion pipeline caching (local & Redis)
- Parallel processing (13x speedup)
- Vector store integration
- Multi-tenancy patterns
- Deployment architectures
- Error handling and reliability
reference-advanced-retrieval.md: Advanced retrieval strategies
- Document summary retrieval
- Recursive retrieval
- Chunk decoupling
- Sub-question decomposition
- Fusion retrieval
- Auto retriever

Common Workflows

Workflow 1: Add Reranking to Existing Pipeline

[ ] Step 1: Choose reranker
- Load reference-reranking.md for comparison
- For Thai: CohereRerank multilingual-v3.0
- For cost: bge-reranker-large

[ ] Step 2: Install dependencies

pip install llama-index-postprocessor-cohere-rerank
# OR
pip install llama-index-postprocessor-flag-embedding-reranker

[ ] Step 3: Update query engine
- Change similarity_top_k from 5 → 10
- Add reranker with top_n=5
[ ] Step 4: Test impact
- Compare retrieval quality before/after
- Expected: 5-15% hit rate improvement
[ ] Step 5: Deploy to all strategies
- Apply consistently across retrievers

Workflow 2: Enable Production Caching

[ ] Step 1: Choose caching backend
- Development → Local cache
- Production → Redis cache

[ ] Step 2: Wrap pipeline

pipeline = IngestionPipeline(
    transformations=[splitter, embedder],
    # cache=... (local or Redis)
)

[ ] Step 3: Initial run (builds cache)

nodes = pipeline.run(documents=documents)
pipeline.persist("./cache")  # Local only

[ ] Step 4: Subsequent runs (uses cache)
- Only processes new/changed documents
- Massive speedup for re-runs
[ ] Step 5: Monitor cache size
- Clear when storage grows too large
- Clear on model/schema changes

Workflow 3: Deploy to Production

[ ] Step 1: Review production checklist
- Load reference-production.md for full guide
[ ] Step 2: Implement error handling
- Retry logic for API calls
- Fallback strategies for failures
- Circuit breaker pattern
[ ] Step 3: Add monitoring
- Track retrieval latency (p50, p95, p99)
- Monitor hit rate and MRR
- Log embedding API usage
[ ] Step 4: Set up caching
- Redis for distributed systems
- Query result caching
- Embedding caching
[ ] Step 5: Deploy with redundancy
- Load balancing
- Health checks
- Graceful degradation

Workflow 4: Optimize for Scale (100+ Documents)

[ ] Step 1: Add metadata filtering
- Tag documents with categories
- Filter before vector search (90% reduction)
[ ] Step 2: Implement document summaries
- Generate summaries for each document
- Two-stage retrieval: doc → chunk
[ ] Step 3: Enable fast metadata indexing
- Build inverted indices for categorical fields
- B-tree indices for numeric fields
- See src-iLand/retrieval/fast_metadata_index.py

[ ] Step 4: Use async operations

retriever = index.as_retriever(use_async=True)

[ ] Step 5: Monitor and tune
- Adjust top_k based on precision/recall
- Optimize chunk size for domain

Performance Benchmarks

Reranking Impact (from reference docs)

| Embedding | Without Rerank | With Cohere Rerank | Improvement | |-----------|----------------|-------------------|-------------| | OpenAI | 0.870 hit rate | 0.927 hit rate | +6.6% | | JinaAI Base | 0.880 hit rate | 0.933 hit rate | +6.0% | | bge-large | 0.820 hit rate | 0.876 hit rate | +6.8% |

Parallel Loading Impact

Sequential: 391 seconds (32 PDF files)
Parallel (10 workers): 31 seconds
Speedup: 13x faster

Caching Impact

First run: Full processing time
Cached run: Only new/changed documents
Typical speedup: 10-100x for repeat runs

Key Reminders

Reranking Best Practices:

Always retrieve 10x candidates, rerank to top-k
Use multilingual models for Thai content
Combine with hybrid search for best results

Caching Cautions:

Clear cache when changing embedding models
Clear cache when updating document schema
Monitor cache size growth

Production Essentials:

Implement retry logic and error handling
Monitor latency, hit rate, costs
Use distributed caching (Redis)
Enable async operations for parallel retrieval

Scripts

This skill includes utility scripts in the scripts/ directory:

validate_config.py

Validates RAG configuration before deployment:

python .claude/skills/optimizing-rag/scripts/validate_config.py \
    --config-file ./config.yaml

Checks:

Chunk size appropriate for domain
Embedding model consistency
Top-k values reasonable
Reranker configuration

benchmark_performance.py

Measures retrieval performance:

python .claude/skills/optimizing-rag/scripts/benchmark_performance.py \
    --index-path ./index \
    --queries-file ./test_queries.txt

Reports:

Retrieval latency (p50, p95, p99)
Throughput (queries/second)
Memory usage

Next Steps

After optimizing:

Evaluate: Use evaluating-rag skill to measure improvements with hit rate and MRR
Monitor: Set up continuous evaluation in production
Iterate: Use metrics to guide further optimizations