Instructions

Primary Objective

Systematically research academic benchmarks, datasets, or research papers to extract and compile comparative information (e.g., into a summary table). The core workflow involves: 1) Identifying relevant sources, 2) Extracting key metadata, 3) Synthesizing findings into a structured output (like a LaTeX table).

Core Workflow

Clarify & Parse Request: Identify the specific benchmarks/datasets/papers mentioned by the user. Note any required output format (e.g., LaTeX table with specific columns) and constraints (e.g., "no commented lines").
Initial Information Gathering: For each identified entity (dataset/paper):
- Use local-web_search to find general information, official pages (GitHub, project sites), and relevant arXiv IDs.
- For arXiv papers, use arxiv_local-download_paper or fetch-fetch_markdown to obtain the paper content.
- Search for specific attributes requested by the user (e.g., "number of tasks," "training set," "difficulty levels").
Deep Dive & Verification: Read paper abstracts, introductions, and methodology sections (using arxiv_local-read_paper or parsed markdown) to confirm key details. Cross-reference information from multiple sources (official site, paper, blog posts) for accuracy.
Information Synthesis: Compile the extracted metadata into a structured format aligned with the user's request. Resolve any ambiguities (e.g., if a "task" count refers to broad categories or individual instances) based on the most authoritative source (typically the original paper).
Output Generation: Create the final deliverable (e.g., a .tex file). Ensure it strictly adheres to the user's formatting specifications. Optionally, provide a concise textual summary of the findings.

Key Metadata to Extract

When researching a benchmark/dataset, prioritize finding:

Full Name & Acronym
Number of Tasks/Categories: Distinguish between broad task categories and individual task instances.
Training Data Availability: Does it include a dedicated training set, or is it for evaluation only?
Difficulty Levels: Does it feature adjustable or tiered difficulty levels?
Core Purpose/Description
Primary Source (arXiv ID, GitHub repo)

Tool Usage Guidelines

local-web_search: Use for initial discovery and finding high-level descriptions. Employ specific queries combining the dataset name and target attributes (e.g., "BBH training set few-shot examples").
arxiv_local-download_paper / fetch-fetch_markdown: Use to access the canonical source for detailed information. Prefer arxiv_local-download_paper for full text analysis when needed.
filesystem-write_file / filesystem-read_file: Use for creating and verifying final output files in the workspace.
local-claim_done: Use only after successfully delivering the requested output and providing a final summary.

Output Standards

LaTeX Tables: Ensure the output contains only the specified table content, without extra comments, document headers, or unrelated text.
Summaries: Be concise but complete, highlighting the sourced information for each dataset.
Accuracy: Base conclusions on the original paper or official project documentation where possible. Acknowledge if information is not explicitly stated.

Common Pitfalls & Resolutions

Ambiguous Task Counts: If a paper mentions "5 task categories" (like KOR-Bench), report that as the task count unless the user specifies otherwise. Clarify in the summary if needed.
Missing Information: If a key attribute (e.g., training set) is not mentioned in primary sources, infer based on benchmark type (e.g., many evaluation benchmarks lack training sets) and denote with \ding{55}. State the assumption in your summary.
arXiv Paper Processing: If arxiv_local-download_paper returns a "converting" status, use fetch-fetch_markdown on the arXiv abstract page as a reliable fallback to get the paper's metadata and abstract.