返回 Skill 列表
extension
分类: 内容与媒体无需 API Key

optimizing-rag

通过重新排序、缓存、并行处理和生产部署模式来优化RAG性能。当提高检索质量、添加重新排序器、部署到生产环境、实施缓存策略或针对规模和延迟进行优化时使用。

person作者: jakexiaohubgithub

Optimizing RAG Performance

Guide for improving RAG systems with reranking, caching, production optimizations, and deployment patterns. Focus on quick wins and production-grade improvements.

When to Use This Skill

  • Improving retrieval accuracy and quality
  • Adding reranking to existing RAG pipeline
  • Reducing latency and improving response time
  • Deploying RAG to production
  • Implementing caching for faster re-processing
  • Scaling to handle more documents or queries
  • Optimizing costs (API usage, compute resources)

Quick Wins (Low Effort, High Impact)

1. Add Reranking → 5-15% Hit Rate Improvement (1-2 hours)

Benefit: Transforms any embedding into competitive performance Effort: Minimal code change

from llama_index.postprocessor.cohere_rerank import CohereRerank

# For English content
reranker = CohereRerank(
    top_n=5,
    model="rerank-english-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# For Thai/multilingual content
reranker = CohereRerank(
    top_n=5,
    model="rerank-multilingual-v3.0",
    api_key="YOUR_COHERE_API_KEY"
)

# Apply to query engine
query_engine = index.as_query_engine(
    similarity_top_k=10,  # Retrieve more candidates
    node_postprocessors=[reranker]  # Rerank to top 5
)

Best Practice: Always retrieve 10x candidates, rerank to final top_k

2. Enable Parallel Loading → 13x Speedup (30 minutes)

Benefit: 391s → 31s for 32 PDF files Effort: One parameter change

from llama_index.core import SimpleDirectoryReader

# Sequential (slow)
documents = SimpleDirectoryReader(input_dir="./data").load_data()

# Parallel (13x faster)
documents = SimpleDirectoryReader(input_dir="./data").load_data(
    num_workers=10  # Adjust based on CPU cores
)

3. Optimize Batch Size → Faster Embeddings (15 minutes)

Benefit: Reduced API calls, better throughput Effort: Configuration change

from llama_index.embeddings.openai import OpenAIEmbedding

embed_model = OpenAIEmbedding(
    model="text-embedding-3-small",
    embed_batch_size=100  # Up from default 10
)

Quick Decision Guide

Reranker Selection

  • Best quality → CohereRerank (API-based, best performance)
  • Best open-source → bge-reranker-large (local, free)
  • Cost-effective → SentenceTransformerRerank (local, fast)
  • Multi-language → CohereRerank multilingual-v3.0

Caching Strategy

  • Development → Local cache (pipeline.persist)
  • Production → Redis cache (distributed)
  • When to clear → Model changes, schema updates

Production Deployment

  • Small scale (<1M docs) → Serverless (auto-scaling)
  • Large scale → Container-based (consistent performance)
  • Multi-tenant → Collection isolation + metadata filtering

Optimization Patterns

Pattern 1: Add Best-Practice Reranking

from llama_index.postprocessor.flag_embedding_reranker import FlagEmbeddingReranker

# Open-source alternative to Cohere
reranker = FlagEmbeddingReranker(
    model="BAAI/bge-reranker-large",
    top_n=5
)

query_engine = index.as_query_engine(
    similarity_top_k=10,
    node_postprocessors=[reranker]
)

Performance: OpenAI + bge-reranker-large: 0.910 hit rate, 0.856 MRR

Pattern 2: Pipeline Caching (Local)

from llama_index.core.ingestion import IngestionPipeline
from llama_index.core.node_parser import SentenceSplitter
from llama_index.embeddings.openai import OpenAIEmbedding

# Create pipeline with transformations
pipeline = IngestionPipeline(
    transformations=[
        SentenceSplitter(chunk_size=512),
        OpenAIEmbedding()
    ]
)

# Run and cache
nodes = pipeline.run(documents=documents)
pipeline.persist("./pipeline_cache")

# Subsequent runs reuse cached results
pipeline.load("./pipeline_cache")
nodes = pipeline.run(documents=documents)  # Only processes new/changed docs

Pattern 3: Redis Cache (Production)

from llama_index.ingestion import IngestionPipeline, IngestionCache
from llama_index.storage.kvstore.redis import RedisKVStore

# Distributed caching
ingest_cache = IngestionCache(
    cache=RedisKVStore.from_host_and_port(
        host="redis-server",
        port=6379
    ),
    collection="rag_pipeline_cache"
)

pipeline = IngestionPipeline(
    transformations=[...],
    cache=ingest_cache
)

# Cache shared across instances
nodes = pipeline.run(documents=documents)

Pattern 4: Advanced Retrieval Strategies

Metadata Pre-filtering (Sub-50ms):

from llama_index.core.vector_stores import MetadataFilters, ExactMatchFilter

# Filter before vector search (90% reduction possible)
filters = MetadataFilters(
    filters=[
        ExactMatchFilter(key="category", value="technical"),
        ExactMatchFilter(key="year", value="2024")
    ]
)

query_engine = index.as_query_engine(
    filters=filters,  # Narrow search space first
    similarity_top_k=5
)

Document Summary Retrieval (For 100+ docs):

from llama_index.core import DocumentSummaryIndex

# Two-stage: document-level → chunk-level
summary_index = DocumentSummaryIndex.from_documents(
    documents,
    response_synthesizer=response_synthesizer
)

retriever = summary_index.as_retriever(similarity_top_k=3)

Chunk Decoupling (Precision + Context):

from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

# Embed sentences, retrieve with context windows
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,  # Sentences before/after
    window_metadata_key="window"
)

# Replace with window for synthesis
postprocessor = MetadataReplacementPostProcessor(
    target_metadata_key="window"
)

query_engine = index.as_query_engine(
    node_postprocessors=[postprocessor],
    similarity_top_k=6
)

Pattern 5: Multiple Postprocessors Chain

from llama_index.core.postprocessor import SimilarityPostprocessor

# Progressive refinement
query_engine = index.as_query_engine(
    similarity_top_k=20,
    node_postprocessors=[
        SimilarityPostprocessor(similarity_cutoff=0.7),  # Filter low scores
        CohereRerank(top_n=10),                          # Rerank top candidates
        MetadataReplacementPostProcessor(...)             # Expand context
    ]
)

Your Codebase Integration

For src/ Pipeline

Add Reranking to All 7 Strategies:

  • src/10_basic_query_engine.py → Add CohereRerank
  • src/16_hybrid_search.py → Add reranker after fusion
  • All other strategies: Consistent reranking layer

Enable Caching:

  • src/09_enhanced_batch_embeddings.py → Add pipeline caching
  • src/02_prep_doc_for_embedding.py → Cache preprocessing

For src-iLand/ Pipeline

Thai-Optimized Reranking:

# In src-iLand/retrieval/retrievers/
reranker = CohereRerank(
    top_n=5,
    model="rerank-multilingual-v3.0"  # Thai support
)

Fast Metadata Filtering (Already implemented):

  • src-iLand/retrieval/fast_metadata_index.py → Sub-50ms filtering
  • Inverted indices for: จังหวัด, อำเภอ, ประเภทโฉนด
  • B-tree indices for: area, coordinates

Batch Processing Optimization:

  • src-iLand/docs_embedding/batch_embedding.py → Increase batch_size
  • src-iLand/data_processing/ → Enable parallel loading

Detailed References

Load these when you need comprehensive details:

  • reference-reranking.md: Complete reranking guide

    • 6 reranker models with benchmarks
    • Multi-language reranking
    • Cost-performance trade-offs
    • Node postprocessor types
  • reference-production.md: Production optimization patterns

    • Ingestion pipeline caching (local & Redis)
    • Parallel processing (13x speedup)
    • Vector store integration
    • Multi-tenancy patterns
    • Deployment architectures
    • Error handling and reliability
  • reference-advanced-retrieval.md: Advanced retrieval strategies

    • Document summary retrieval
    • Recursive retrieval
    • Chunk decoupling
    • Sub-question decomposition
    • Fusion retrieval
    • Auto retriever

Common Workflows

Workflow 1: Add Reranking to Existing Pipeline

  • [ ] Step 1: Choose reranker

    • Load reference-reranking.md for comparison
    • For Thai: CohereRerank multilingual-v3.0
    • For cost: bge-reranker-large
  • [ ] Step 2: Install dependencies

    pip install llama-index-postprocessor-cohere-rerank
    # OR
    pip install llama-index-postprocessor-flag-embedding-reranker
    
  • [ ] Step 3: Update query engine

    • Change similarity_top_k from 5 → 10
    • Add reranker with top_n=5
  • [ ] Step 4: Test impact

    • Compare retrieval quality before/after
    • Expected: 5-15% hit rate improvement
  • [ ] Step 5: Deploy to all strategies

    • Apply consistently across retrievers

Workflow 2: Enable Production Caching

  • [ ] Step 1: Choose caching backend

    • Development → Local cache
    • Production → Redis cache
  • [ ] Step 2: Wrap pipeline

    pipeline = IngestionPipeline(
        transformations=[splitter, embedder],
        # cache=... (local or Redis)
    )
    
  • [ ] Step 3: Initial run (builds cache)

    nodes = pipeline.run(documents=documents)
    pipeline.persist("./cache")  # Local only
    
  • [ ] Step 4: Subsequent runs (uses cache)

    • Only processes new/changed documents
    • Massive speedup for re-runs
  • [ ] Step 5: Monitor cache size

    • Clear when storage grows too large
    • Clear on model/schema changes

Workflow 3: Deploy to Production

  • [ ] Step 1: Review production checklist

    • Load reference-production.md for full guide
  • [ ] Step 2: Implement error handling

    • Retry logic for API calls
    • Fallback strategies for failures
    • Circuit breaker pattern
  • [ ] Step 3: Add monitoring

    • Track retrieval latency (p50, p95, p99)
    • Monitor hit rate and MRR
    • Log embedding API usage
  • [ ] Step 4: Set up caching

    • Redis for distributed systems
    • Query result caching
    • Embedding caching
  • [ ] Step 5: Deploy with redundancy

    • Load balancing
    • Health checks
    • Graceful degradation

Workflow 4: Optimize for Scale (100+ Documents)

  • [ ] Step 1: Add metadata filtering

    • Tag documents with categories
    • Filter before vector search (90% reduction)
  • [ ] Step 2: Implement document summaries

    • Generate summaries for each document
    • Two-stage retrieval: doc → chunk
  • [ ] Step 3: Enable fast metadata indexing

    • Build inverted indices for categorical fields
    • B-tree indices for numeric fields
    • See src-iLand/retrieval/fast_metadata_index.py
  • [ ] Step 4: Use async operations

    retriever = index.as_retriever(use_async=True)
    
  • [ ] Step 5: Monitor and tune

    • Adjust top_k based on precision/recall
    • Optimize chunk size for domain

Performance Benchmarks

Reranking Impact (from reference docs)

| Embedding | Without Rerank | With Cohere Rerank | Improvement | |-----------|----------------|-------------------|-------------| | OpenAI | 0.870 hit rate | 0.927 hit rate | +6.6% | | JinaAI Base | 0.880 hit rate | 0.933 hit rate | +6.0% | | bge-large | 0.820 hit rate | 0.876 hit rate | +6.8% |

Parallel Loading Impact

  • Sequential: 391 seconds (32 PDF files)
  • Parallel (10 workers): 31 seconds
  • Speedup: 13x faster

Caching Impact

  • First run: Full processing time
  • Cached run: Only new/changed documents
  • Typical speedup: 10-100x for repeat runs

Key Reminders

Reranking Best Practices:

  • Always retrieve 10x candidates, rerank to top-k
  • Use multilingual models for Thai content
  • Combine with hybrid search for best results

Caching Cautions:

  • Clear cache when changing embedding models
  • Clear cache when updating document schema
  • Monitor cache size growth

Production Essentials:

  • Implement retry logic and error handling
  • Monitor latency, hit rate, costs
  • Use distributed caching (Redis)
  • Enable async operations for parallel retrieval

Scripts

This skill includes utility scripts in the scripts/ directory:

validate_config.py

Validates RAG configuration before deployment:

python .claude/skills/optimizing-rag/scripts/validate_config.py \
    --config-file ./config.yaml

Checks:

  • Chunk size appropriate for domain
  • Embedding model consistency
  • Top-k values reasonable
  • Reranker configuration

benchmark_performance.py

Measures retrieval performance:

python .claude/skills/optimizing-rag/scripts/benchmark_performance.py \
    --index-path ./index \
    --queries-file ./test_queries.txt

Reports:

  • Retrieval latency (p50, p95, p99)
  • Throughput (queries/second)
  • Memory usage

Next Steps

After optimizing:

  • Evaluate: Use evaluating-rag skill to measure improvements with hit rate and MRR
  • Monitor: Set up continuous evaluation in production
  • Iterate: Use metrics to guide further optimizations