Optimizing RAG Performance
Guide for improving RAG systems with reranking, caching, production optimizations, and deployment patterns. Focus on quick wins and production-grade improvements.
When to Use This Skill
- Improving retrieval accuracy and quality
- Adding reranking to existing RAG pipeline
- Reducing latency and improving response time
- Deploying RAG to production
- Implementing caching for faster re-processing
- Scaling to handle more documents or queries
- Optimizing costs (API usage, compute resources)
Quick Wins (Low Effort, High Impact)
1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)
Benefit: Transforms any embedding into competitive performance Effort: Minimal code change
from llama_index.postprocessor.cohere_rerank import CohereRerank
# For English content
reranker = CohereRerank(
top_n=5,
model="rerank-english-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# For Thai/multilingual content
reranker = CohereRerank(
top_n=5,
model="rerank-multilingual-v3.0",
api_key="YOUR_COHERE_API_KEY"
)
# Apply to query engine
query_engine = index.as_query_engine(
similarity_top_k=10, # Retrieve more candidates
node_postprocessors=[reranker] # Rerank to top 5
)
Best Practice: Always retrieve 10x candidates, rerank to final top_k
2. Enable Parallel Loading → 13x Speedup (30 minutes)
Benefit: 391s → 31s for 32 PDF files Effort: One parameter change
from llama_index.core import SimpleDirectoryReader
# Sequential (slow)
documents = SimpleDirectoryReader(input_dir="./data").load_data()
# Parallel (13x faster)
documents = SimpleDirectoryReader(input_dir="./data").load_data(
num_workers=10 # Adjust based on CPU cores
)
3. Optimize Batch Size → Faster Embeddings (15 minutes)
Benefit: Reduced API calls, better throughput Effort: Configuration change
from llama_index.embeddings.openai import OpenAIEmbedding
embed_model = OpenAIEmbedding(
model="text-embedding-3-small",
embed_batch_size=100 # Up from default 10
)
Quick Decision Guide
Reranker Selection
- Best quality → CohereRerank (API-based, best performance)
- Best open-source → bge-reranker-large (local, free)
- Cost-effective → SentenceTransformerRerank (local, fast)
- Multi-language → CohereRerank multilingual-v3.0
Caching Strategy
- Development → Local cache (pipeline.persist)
- Production → Redis cache (distributed)
- When to clear → Model changes, schema updates
Production Deployment
- Small scale (<1M docs) → Serverless (auto-scaling)
- Large scale → Container-based (consistent performance)
- Multi-tenant → Collection isolation + metadata filtering
Optimization Patterns
Pattern 1: Add Best-Practice Reranking
from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker
# Open-source alternative to Cohere
reranker = FlagEmbeddingReranker(
model="BAAI/bge-reranker-large",
top_n=5
)
query_engine = index.as_query_engine(
similarity_top_k=10,
node_postprocessors=[reranker]
)
Performance: OpenAI + bge-reranker-large: 0.910 hit rate, 0.856 MRR
Pattern 2: Pipeline Caching (Local)
from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding
# Create pipeline with transformations
pipeline = IngestionPipeline(
transformations=[
SentenceSplitter(chunk_size=512),
OpenAIEmbedding()
]
)
# Run and cache
nodes = pipeline.run(documents=documents)
pipeline.persist("./pipeline_cache")
# Subsequent runs reuse cached results
pipeline.load("./pipeline_cache")
nodes = pipeline.run(documents=documents) # Only processes new/changed docs
Pattern 3: Redis Cache (Production)
from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.storage.kvstore.redis import RedisKVStore
# Distributed caching
ingest_cache = IngestionCache(
cache=RedisKVStore.from_host_and_port(
host="redis-server",
port=6379
),
collection="rag_pipeline_cache"
)
pipeline = IngestionPipeline(
transformations=[...],
cache=ingest_cache
)
# Cache shared across instances
nodes = pipeline.run(documents=documents)
Pattern 4: Advanced Retrieval Strategies
Metadata Pre-filtering (Sub-50ms):
from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter
# Filter before vector search (90% reduction possible)
filters = MetadataFilters(
filters=[
ExactMatchFilter(key="category", value="technical"),
ExactMatchFilter(key="year", value="2024")
]
)
query_engine = index.as_query_engine(
filters=filters, # Narrow search space first
similarity_top_k=5
)
Document Summary Retrieval (For 100+ docs):
from llama_index.core import DocumentSummaryIndex
# Two-stage: document-level → chunk-level
summary_index = DocumentSummaryIndex.from_documents(
documents,
response_synthesizer=response_synthesizer
)
retriever = summary_index.as_retriever(similarity_top_k=3)
Chunk Decoupling (Precision + Context):
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor
# Embed sentences, retrieve with context windows
node_parser = SentenceWindowNodeParser.from_defaults(
window_size=3, # Sentences before/after
window_metadata_key="window"
)
# Replace with window for synthesis
postprocessor = MetadataReplacementPostProcessor(
target_metadata_key="window"
)
query_engine = index.as_query_engine(
node_postprocessors=[postprocessor],
similarity_top_k=6
)
Pattern 5: Multiple Postprocessors Chain
from llama_index.core.postprocessor import SimilarityPostprocessor
# Progressive refinement
query_engine = index.as_query_engine(
similarity_top_k=20,
node_postprocessors=[
SimilarityPostprocessor(similarity_cutoff=0.7), # Filter low scores
CohereRerank(top_n=10), # Rerank top candidates
MetadataReplacementPostProcessor(...) # Expand context
]
)
Your Codebase Integration
For src/ Pipeline
Add Reranking to All 7 Strategies:
src/10_basic_query_engine.py→ Add CohereReranksrc/16_hybrid_search.py→ Add reranker after fusion- All other strategies: Consistent reranking layer
Enable Caching:
src/09_enhanced_batch_embeddings.py→ Add pipeline cachingsrc/02_prep_doc_for_embedding.py→ Cache preprocessing
For src-iLand/ Pipeline
Thai-Optimized Reranking:
# In src-iLand/retrieval/retrievers/
reranker = CohereRerank(
top_n=5,
model="rerank-multilingual-v3.0" # Thai support
)
Fast Metadata Filtering (Already implemented):
src-iLand/retrieval/fast_metadata_index.py→ Sub-50ms filtering- Inverted indices for: จังหวัด, อำเภอ, ประเภทโฉนด
- B-tree indices for: area, coordinates
Batch Processing Optimization:
src-iLand/docs_embedding/batch_embedding.py→ Increase batch_sizesrc-iLand/data_processing/→ Enable parallel loading
Detailed References
Load these when you need comprehensive details:
-
reference-reranking.md: Complete reranking guide
- 6 reranker models with benchmarks
- Multi-language reranking
- Cost-performance trade-offs
- Node postprocessor types
-
reference-production.md: Production optimization patterns
- Ingestion pipeline caching (local & Redis)
- Parallel processing (13x speedup)
- Vector store integration
- Multi-tenancy patterns
- Deployment architectures
- Error handling and reliability
-
reference-advanced-retrieval.md: Advanced retrieval strategies
- Document summary retrieval
- Recursive retrieval
- Chunk decoupling
- Sub-question decomposition
- Fusion retrieval
- Auto retriever
Common Workflows
Workflow 1: Add Reranking to Existing Pipeline
-
[ ] Step 1: Choose reranker
- Load
reference-reranking.mdfor comparison - For Thai: CohereRerank multilingual-v3.0
- For cost: bge-reranker-large
- Load
-
[ ] Step 2: Install dependencies
pip install llama-index-postprocessor-cohere-rerank # OR pip install llama-index-postprocessor-flag-embedding-reranker -
[ ] Step 3: Update query engine
- Change
similarity_top_kfrom 5 → 10 - Add reranker with
top_n=5
- Change
-
[ ] Step 4: Test impact
- Compare retrieval quality before/after
- Expected: 5-15% hit rate improvement
-
[ ] Step 5: Deploy to all strategies
- Apply consistently across retrievers
Workflow 2: Enable Production Caching
-
[ ] Step 1: Choose caching backend
- Development → Local cache
- Production → Redis cache
-
[ ] Step 2: Wrap pipeline
pipeline = IngestionPipeline( transformations=[splitter, embedder], # cache=... (local or Redis) ) -
[ ] Step 3: Initial run (builds cache)
nodes = pipeline.run(documents=documents) pipeline.persist("./cache") # Local only -
[ ] Step 4: Subsequent runs (uses cache)
- Only processes new/changed documents
- Massive speedup for re-runs
-
[ ] Step 5: Monitor cache size
- Clear when storage grows too large
- Clear on model/schema changes
Workflow 3: Deploy to Production
-
[ ] Step 1: Review production checklist
- Load
reference-production.mdfor full guide
- Load
-
[ ] Step 2: Implement error handling
- Retry logic for API calls
- Fallback strategies for failures
- Circuit breaker pattern
-
[ ] Step 3: Add monitoring
- Track retrieval latency (p50, p95, p99)
- Monitor hit rate and MRR
- Log embedding API usage
-
[ ] Step 4: Set up caching
- Redis for distributed systems
- Query result caching
- Embedding caching
-
[ ] Step 5: Deploy with redundancy
- Load balancing
- Health checks
- Graceful degradation
Workflow 4: Optimize for Scale (100+ Documents)
-
[ ] Step 1: Add metadata filtering
- Tag documents with categories
- Filter before vector search (90% reduction)
-
[ ] Step 2: Implement document summaries
- Generate summaries for each document
- Two-stage retrieval: doc → chunk
-
[ ] Step 3: Enable fast metadata indexing
- Build inverted indices for categorical fields
- B-tree indices for numeric fields
- See
src-iLand/retrieval/fast_metadata_index.py
-
[ ] Step 4: Use async operations
retriever = index.as_retriever(use_async=True) -
[ ] Step 5: Monitor and tune
- Adjust top_k based on precision/recall
- Optimize chunk size for domain
Performance Benchmarks
Reranking Impact (from reference docs)
| Embedding | Without Rerank | With Cohere Rerank | Improvement | |-----------|----------------|-------------------|-------------| | OpenAI | 0.870 hit rate | 0.927 hit rate | +6.6% | | JinaAI Base | 0.880 hit rate | 0.933 hit rate | +6.0% | | bge-large | 0.820 hit rate | 0.876 hit rate | +6.8% |
Parallel Loading Impact
- Sequential: 391 seconds (32 PDF files)
- Parallel (10 workers): 31 seconds
- Speedup: 13x faster
Caching Impact
- First run: Full processing time
- Cached run: Only new/changed documents
- Typical speedup: 10-100x for repeat runs
Key Reminders
Reranking Best Practices:
- Always retrieve 10x candidates, rerank to top-k
- Use multilingual models for Thai content
- Combine with hybrid search for best results
Caching Cautions:
- Clear cache when changing embedding models
- Clear cache when updating document schema
- Monitor cache size growth
Production Essentials:
- Implement retry logic and error handling
- Monitor latency, hit rate, costs
- Use distributed caching (Redis)
- Enable async operations for parallel retrieval
Scripts
This skill includes utility scripts in the scripts/ directory:
validate_config.py
Validates RAG configuration before deployment:
python .claude/skills/optimizing-rag/scripts/validate_config.py \
--config-file ./config.yaml
Checks:
- Chunk size appropriate for domain
- Embedding model consistency
- Top-k values reasonable
- Reranker configuration
benchmark_performance.py
Measures retrieval performance:
python .claude/skills/optimizing-rag/scripts/benchmark_performance.py \
--index-path ./index \
--queries-file ./test_queries.txt
Reports:
- Retrieval latency (p50, p95, p99)
- Throughput (queries/second)
- Memory usage
Next Steps
After optimizing:
- Evaluate: Use
evaluating-ragskill to measure improvements with hit rate and MRR - Monitor: Set up continuous evaluation in production
- Iterate: Use metrics to guide further optimizations
Scan to join WeChat group