Instructions
Primary Objective
Systematically research academic benchmarks, datasets, or research papers to extract and compile comparative information (e.g., into a summary table). The core workflow involves: 1) Identifying relevant sources, 2) Extracting key metadata, 3) Synthesizing findings into a structured output (like a LaTeX table).
Core Workflow
- Clarify & Parse Request: Identify the specific benchmarks/datasets/papers mentioned by the user. Note any required output format (e.g., LaTeX table with specific columns) and constraints (e.g., "no commented lines").
- Initial Information Gathering: For each identified entity (dataset/paper):
- Use
local-web_searchto find general information, official pages (GitHub, project sites), and relevant arXiv IDs. - For arXiv papers, use
arxiv_local-download_paperorfetch-fetch_markdownto obtain the paper content. - Search for specific attributes requested by the user (e.g., "number of tasks," "training set," "difficulty levels").
- Use
- Deep Dive & Verification: Read paper abstracts, introductions, and methodology sections (using
arxiv_local-read_paperor parsed markdown) to confirm key details. Cross-reference information from multiple sources (official site, paper, blog posts) for accuracy. - Information Synthesis: Compile the extracted metadata into a structured format aligned with the user's request. Resolve any ambiguities (e.g., if a "task" count refers to broad categories or individual instances) based on the most authoritative source (typically the original paper).
- Output Generation: Create the final deliverable (e.g., a
.texfile). Ensure it strictly adheres to the user's formatting specifications. Optionally, provide a concise textual summary of the findings.
Key Metadata to Extract
When researching a benchmark/dataset, prioritize finding:
- Full Name & Acronym
- Number of Tasks/Categories: Distinguish between broad task categories and individual task instances.
- Training Data Availability: Does it include a dedicated training set, or is it for evaluation only?
- Difficulty Levels: Does it feature adjustable or tiered difficulty levels?
- Core Purpose/Description
- Primary Source (arXiv ID, GitHub repo)
Tool Usage Guidelines
local-web_search: Use for initial discovery and finding high-level descriptions. Employ specific queries combining the dataset name and target attributes (e.g., "BBH training set few-shot examples").arxiv_local-download_paper/fetch-fetch_markdown: Use to access the canonical source for detailed information. Preferarxiv_local-download_paperfor full text analysis when needed.filesystem-write_file/filesystem-read_file: Use for creating and verifying final output files in the workspace.local-claim_done: Use only after successfully delivering the requested output and providing a final summary.
Output Standards
- LaTeX Tables: Ensure the output contains only the specified table content, without extra comments, document headers, or unrelated text.
- Summaries: Be concise but complete, highlighting the sourced information for each dataset.
- Accuracy: Base conclusions on the original paper or official project documentation where possible. Acknowledge if information is not explicitly stated.
Common Pitfalls & Resolutions
- Ambiguous Task Counts: If a paper mentions "5 task categories" (like KOR-Bench), report that as the task count unless the user specifies otherwise. Clarify in the summary if needed.
- Missing Information: If a key attribute (e.g., training set) is not mentioned in primary sources, infer based on benchmark type (e.g., many evaluation benchmarks lack training sets) and denote with
\ding{55}. State the assumption in your summary. - arXiv Paper Processing: If
arxiv_local-download_paperreturns a "converting" status, usefetch-fetch_markdownon the arXiv abstract page as a reliable fallback to get the paper's metadata and abstract.
Scan to join WeChat group