Web Search Agent Evaluations
Development assistant for running and comparing web search capabilities across CLI agents.
Overview
Evaluate 4 agents (Claude Code, Gemini, Droid, Codex) with 2 tools (builtin, You.com MCP) = 8 pairings.
Key Features:
- Headless adapters - Schema-driven CLI agent execution via @plaited/agent-eval-harness
- Flag-based architecture - Single service per agent, mode selected via environment variables
- Type-safe constants - MCP server definitions in TypeScript
- Isolated execution - Each pairing runs in its own Docker container
Architecture:
agent-schemas/- Headless adapter JSON schemasmcp-servers.ts- TypeScript MCP server constantsdocker/entrypoint- Bun shell script for runtime configscripts/- Type-safe execution and comparison CLI toolsdocker/- Container infrastructure
Quick Commands
Run Evaluations
# Full dataset (151 prompts), k=5 — all 8 agent×provider scenarios
bun run trials
# Quick smoke test (5 random prompts, single trial)
bun run trials -- --count 5 -k 1
# Specific agent or provider
bun run trials -- --agent claude-code --search-provider builtin
bun run trials -- --agent gemini --search-provider you
# Trial type presets
bun run trials -- --trial-type capability # k=10, deep exploration
bun run trials -- --trial-type regression # k=3, fast regression check
# Custom k value
bun run trials -- -k 7
# Control parallelism
bun run trials -- -j 4 # Limit to 4 containers
bun run trials -- --prompt-concurrency 4 # 4 prompts per container
# Direct Docker (manual testing)
docker compose run --rm -e SEARCH_PROVIDER=builtin claude-code
docker compose run --rm -e SEARCH_PROVIDER=you -e PROMPT_COUNT=5 gemini
Compare Results
Comparisons are written to data/comparisons/YYYY-MM-DD/.
# Latest date auto-detected
bun run compare
# Statistical analysis with bootstrap confidence intervals
bun run compare:stat
# Specific date or filter
bun run compare -- --run-date 2026-02-18
bun run compare -- --agent droid
bun run compare -- --search-provider builtin
bun run compare -- --trial-type capability
# View results
cat data/comparisons/2026-02-18/all-builtin-weighted.json | jq '.capability'
cat data/comparisons/2026-02-18/builtin-vs-you-weighted.json | jq '.headToHead.capability'
Comparison strategies:
weighted(default) - Capability, reliability, and consistency weighted scoringstatistical- Bootstrap sampling with 95% confidence intervals
Generate Report
Generate a comprehensive REPORT.md from comparison results:
# Latest date auto-detected
bun run report
# Specific date
bun run report -- --run-date 2026-02-18
# Preview without writing
bun run report -- --dry-run
Report includes:
- Executive summary with best capability, reliability, and performance
- Quality rankings with pass@k and pass^k scores
- Performance rankings (latency P50/P90/P99)
- Flakiness analysis with top flaky prompts
- MCP tool impact analysis (builtin vs MCP comparison)
- Tool call statistics (P50/P90/P99/mean per provider)
- Tool call distribution histograms
- Failing prompts list (pass@k = 0%) with query text
Output: data/comparisons/YYYY-MM-DD/REPORT.md
Calibrate Grader
Interactive wizard to sample failures and review grader accuracy. Helps distinguish between agent failures (agent got it wrong) and grader bugs (agent was correct, grader too strict).
# Interactive calibration (recommended)
bun run calibrate
Interactive prompts:
- Run date - Select from available dated runs
- Agents - Multi-select via numbers or "all"
- Search providers - Multi-select via numbers or "all"
- Sample count - Number of failures to sample (default: 5)
Output: data/calibration/{date}-{agent}-{provider}.md
What calibration reveals:
- ❌ Grader too strict - Agent gave correct answer, grader rejected valid paraphrasing
- ❌ Hint too vague - Grader can't tell good from bad answers
- ✅ Real failures - Agent genuinely gave wrong/incomplete answer
See @agent-eval-harness calibration docs for grader calibration concepts.
Parallelization
The evaluation harness supports two-level parallelization for optimal performance:
Container-Level Concurrency (-j, --concurrency)
Controls how many Docker containers (agent×provider scenarios) run simultaneously.
bun run trials # Unlimited (default, all 8 scenarios at once)
bun run trials -- -j 4 # Limit to 4 containers
bun run trials -- -j 1 # Sequential (debugging)
Use cases:
- Unlimited (default) - All scenarios at once, I/O-bound workload handles it fine
-j 4- Limit concurrency if hitting API rate limits-j 2- Conservative, for low-resource machines-j 1- Sequential execution for debugging
Prompt-Level Concurrency (--prompt-concurrency)
Controls how many prompts run in parallel within each container.
bun run trials -- --prompt-concurrency 4 # 4 prompts (moderate parallelism)
bun run trials -- --prompt-concurrency 1 # Sequential (default, safest)
bun run trials -- --prompt-concurrency 8 # 8 prompts (high memory, CI only)
How it works:
- Uses harness
-jflag with--workspace-dirfor isolation - Each prompt gets its own workspace directory
- Web searches are I/O-bound — parallel prompts maximize network bandwidth
Performance comparison:
| Config | Containers | Prompts/Container | Full (151 prompts, k=5) | |--------|-----------|-------------------|------------------------| | Default | unlimited | 1 | ~2.5 hrs | | Faster | unlimited | 4 | ~40 min | | CI (high memory) | unlimited | 8 | ~20 min |
Warning: Stream-mode agents (claude-code, droid) use ~400-500MB RSS per prompt process. With --prompt-concurrency 8 that's 3-4GB per container — OOM kills likely in Docker (see issue #45)
Prompts
Prompts live in a flat data/prompts/ directory. The format differs by search provider:
- Builtin mode: Just the query (e.g., "What are the best free icon libraries...")
- MCP mode:
"Use {server-name} and answer\n{query}"with MCP metadata
| File | Prompts | Metadata | Use With |
|------|---------|----------|----------|
| prompts.jsonl | 151 | No MCP | SEARCH_PROVIDER=builtin |
| prompts-you.jsonl | 151 | mcpServer="ydc-server", expectedTools=["you-search"] | SEARCH_PROVIDER=you |
The entrypoint automatically selects the correct prompt file based on SEARCH_PROVIDER. To run a random subset, pass PROMPT_COUNT (or --count N via CLI):
bun run trials -- --count 5 # 5 random prompts from full dataset
Results
All trial results are written to flat dated directories:
data/results/YYYY-MM-DD/
├── claude-code/
│ ├── builtin.jsonl
│ └── you.jsonl
├── gemini/
├── droid/
└── codex/
Each .jsonl line is a TrialResult:
{"id":"websearch-001","input":"...","k":5,"passRate":0.8,"passAtK":0.999,"passExpK":0.328,"trials":[...]}
Versioning:
git add data/results/ && git commit -m "feat: trial run YYYY-MM-DD"
Compare runs:
bun run compare # Latest date auto-detected
bun run compare -- --run-date 2026-02-18 # Specific date
Adding a New Agent
1. Create Headless Adapter Schema
Create agent-schemas/<agent>.json:
{
"command": ["<agent-cli>", "--flag", "{input}"],
"outputEvents": {
"match": { "path": "$.type" },
"patterns": {
"text": { "value": "text" },
"tool_call": { "value": "tool_call" },
"tool_result": { "value": "tool_result" }
}
},
"result": {
"contentPath": "$.output",
"errorPath": "$.error"
},
"mode": "stream",
"env": ["AGENT_API_KEY"]
}
Key fields:
command- CLI invocation with{input}placeholderoutputEvents.match.path- JSONPath to event type fieldpatterns- Map event types to standard namesresult.contentPath- JSONPath to extract final outputmode-"stream"(persistent) or"iterative"(new process per turn)env- Required environment variables
Test schema:
bunx @plaited/agent-eval-harness adapter:check -- \
bunx @plaited/agent-eval-harness headless --schema agent-schemas/<agent>.json
2. Create Dockerfile
Create docker/<agent>.Dockerfile:
FROM base
# Install agent CLI
RUN npm install -g <agent-cli>
# Copy entrypoint and MCP config
COPY docker/entrypoint /entrypoint
COPY mcp-servers.ts /eval/mcp-servers.ts
RUN chmod +x /entrypoint
CMD ["/entrypoint"]
Verify installation:
docker build -t test-<agent> -f docker/<agent>.Dockerfile .
docker run --rm test-<agent> <agent> --version
3. Add Docker Compose Service
Add to docker-compose.yml:
<agent>:
build:
context: .
dockerfile: docker/<agent>.Dockerfile
volumes:
- ./agent-schemas:/eval/agent-schemas:ro
- ./data:/eval/data
- ./scripts:/eval/scripts:ro
working_dir: /workspace
env_file: .env
environment:
- AGENT=<agent>
- SEARCH_PROVIDER=${SEARCH_PROVIDER:-builtin}
4. Update TypeScript Entrypoint
Edit docker/entrypoint to add agent to configureMcp() function:
const configureMcp = async (agent: string, tool: McpServerKey): Promise<void> => {
const server = MCP_SERVERS[tool]
const apiKey = server.auth ? process.env[server.auth.envVar] : undefined
switch (agent) {
// ... existing cases ...
case '<agent>': {
await $`<agent> mcp add ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
console.log('✓ Agent MCP server added')
break
}
}
}
Add timeout if needed in buildTrialsCommand():
switch (AGENT) {
case '<agent>':
cmd.push('--timeout', '120000') // 2 minutes
break
}
5. Update Scripts
Edit scripts/shared/shared.constants.ts to add agent to ALL_AGENTS:
export const ALL_AGENTS: Agent[] = ["claude-code", "gemini", "droid", "codex", "<agent>"]
Also update the Agent type in scripts/shared/shared.types.ts:
type Agent = "claude-code" | "gemini" | "droid" | "codex" | "<agent>"
6. Test
docker compose build <agent>
docker compose run --rm -e SEARCH_PROVIDER=builtin <agent>
docker compose run --rm -e SEARCH_PROVIDER=you <agent>
Adding a New MCP Tool
1. Add to mcp-servers.ts
export const MCP_SERVERS = {
you: { /* ... existing */ },
exa: {
name: 'exa-server',
type: 'http' as const,
url: 'https://api.exa.ai/mcp',
auth: {
type: 'bearer' as const,
envVar: 'EXA_API_KEY',
},
},
} as const
export type McpServerKey = keyof typeof MCP_SERVERS
2. Update docker/entrypoint
Add the new tool case to configureMcp() for each agent:
case 'claude-code': {
await $`claude mcp add --transport http ${server.name} ${server.url} --header "Authorization: Bearer ${apiKey}"`.quiet()
break
}
case 'gemini': {
await $`gemini mcp add --transport http --header "Authorization: Bearer ${apiKey}" ${server.name} ${server.url}`.quiet()
break
}
case 'droid': {
await $`droid mcp add ${server.name} ${server.url} --type http --header "Authorization: Bearer ${apiKey}"`.quiet()
break
}
case 'codex': {
const configDir = `${process.env.HOME}/.codex`
await $`mkdir -p ${configDir}`.quiet()
const config = `[mcp_servers.${server.name}]
url = "${server.url}"
bearer_token_env_var = "${server.auth?.envVar}"
`
await Bun.write(`${configDir}/config.toml`, config)
break
}
3. Update Environment Files
Add to .env and .env.example:
EXA_API_KEY=your_api_key_here
4. Generate MCP Prompt Sets
Use the generate-mcp-prompts script to create MCP variant files with proper metadata:
# Generate variants for new MCP server
bun scripts/generate-mcp-prompts.ts --mcp-key exa
# Creates:
# - data/prompts/prompts-exa.jsonl
The script prepends "Use {server-name} and answer\n" to each query and adds MCP metadata (server name and expected tools).
The entrypoint automatically handles provider-specific prompt files:
const promptFile = SEARCH_PROVIDER === "builtin"
? `/eval/data/prompts/prompts.jsonl`
: `/eval/data/prompts/prompts-${SEARCH_PROVIDER}.jsonl` // e.g., prompts-exa.jsonl
Note: scripts/run-trials.ts automatically picks up new MCP servers from mcp-servers.ts, so no manual updates needed.
5. Test
docker compose build
docker compose run --rm -e SEARCH_PROVIDER=exa claude-code
bun run trials -- --search-provider exa --count 5 -k 1
Schema Format Reference
Current agent schemas:
| Schema | Agent | Mode | Status |
|--------|-------|------|--------|
| claude-code.json | Claude Code | stream | ✅ Tested |
| gemini.json | Gemini CLI | iterative | ✅ Tested |
| droid.json | Droid CLI | stream | ✅ Tested |
| codex.json | Codex CLI | stream | ✅ Tested |
Session Modes:
stream- Process stays alive, multi-turn conversations via stdiniterative- New process per turn, history passed as context
Related Skills
- @agent-eval-harness - Capture, trials, compare commands
- @headless-adapters - Schema creation and validation
Scan to join WeChat group