Web Data Collection & Analysis Skill
Overview
This skill guides the systematic collection, cleaning, analysis, and synthesis of information from web sources into structured Markdown reports.
The workflow consists of 6 steps:
- Receive Input — URL or research request
- Fetch Content — Static (web_fetch) or dynamic (agent-browser)
- Clean Data — Remove HTML noise, navigation, ads, scripts
- Extract Structure — Parse sections, headings, tables, code blocks
- Analyze Content — Generate insights and key concepts
- Generate Report — Create structured Markdown output
When to Use This Skill
- User provides a URL and wants you to "analyze", "understand", "summarize", or "research" it
- User asks to extract data from documentation portals, wikis, or specification sites
- User wants a comprehensive report on web content (competitor analysis, tech research, etc.)
- Multi-page documents need structure extraction (sections, key points, tables)
- Documents exist in multiple formats (web, PDF, Excel) and need unified analysis
Step 1: Receive Input
Clarify what the user wants analyzed:
- URL or resource: What's the source?
- Scope: Just overview, or deep technical details?
- Output style: Quick summary or comprehensive report?
- Focus areas: Any specific aspects to prioritize?
Example inputs:
- "Analyze this documentation: https://example.com/api-docs"
- "Research this GitHub repo and summarize the architecture"
- "Extract key concepts from this PDF specification"
Step 2: Fetch Content
Choose the fetch method based on the page type:
Static HTML pages (documentation, wikis, blogs):
Use web_fetch(url) for fast retrieval
Dynamic/JavaScript-heavy pages (React SPAs, dashboards):
Use agent-browser:
1. agent-browser.open(url)
2. agent-browser.wait_for_content() [if needed]
3. agent-browser.extract_text() or extract_structured()
Special file types:
- PDF: Use pdf_parser to extract text and structure
- Excel/CSV: Use xlsx_parser to read tables and metadata
- Git repos: Clone or browse via GitHub API
Step 3: Clean Data
Remove noise from fetched content:
HTML Cleaning Pipeline:
- Remove
<script>,<style>,<link>tags - Remove navigation menus, sidebars, footers
- Remove ads, tracking pixels, comments sections
- Normalize whitespace
- Decode HTML entities
Output: Clean, readable text with preserved structure
Step 4: Extract Structure
Parse the document into a structured hierarchy:
{
"title": "Document Title",
"metadata": {
"url": "https://...",
"fetch_date": "2024-XX-XX"
},
"sections": [
{
"heading": "Section 1",
"level": 1,
"content": "Section text...",
"subsections": [
{
"heading": "Subsection 1.1",
"level": 2,
"content": "Subsection text...",
"key_points": ["Point 1", "Point 2"]
}
],
"tables": [...],
"code_blocks": [...]
}
],
"links": [...]
}
Extract: headings, paragraphs, lists, tables, code blocks, links, images
Step 5: Analyze Content
Generate insights from the structured document:
Overview
- Summary: 1-2 paragraph overview of the document's purpose and main message
- Audience: Who is this for? (developers, business users, etc.)
- Primary focus: What's the main topic?
Key Concepts
- Extract major concepts, terminology, and ideas
- Define important terms specific to the domain
- List in logical order (foundational → advanced)
System Components / Architecture
- Identify major modules, services, or components
- Describe their roles and interactions
- Create a high-level system diagram if applicable
Technical Details
- Deep dive into implementation specifics
- Algorithms, data structures, API details
- Configuration, parameters, options
- Code examples and usage patterns
Important Notes
- Warnings or prerequisites
- Common pitfalls or gotchas
- Version compatibility information
- Dependency information
Possible Applications / Use Cases
- How could this information be applied?
- Real-world scenarios
- Integration points with other systems
- Best practices
Step 6: Generate Markdown Report
Create a structured, human-readable report using this template:
# Analysis Report: [Document Title]
**Source**: [URL]
**Analyzed**: [Date]
**Document Type**: [Type — documentation, specification, blog post, etc.]
---
## 1. Overview
[Summary of document's purpose and main message]
**Audience**: [Who this is for]
**Primary Focus**: [Main topic/domain]
---
## 2. Key Concepts
- **Concept 1**: Definition and context
- **Concept 2**: Definition and context
- **Concept 3**: Definition and context
---
## 3. System Components / Architecture
| Component | Description | Key Responsibility |
|-----------|-------------|-------------------|
| Module A | Brief description | What it does |
| Module B | Brief description | What it does |
[Or use text format for prose descriptions]
---
## 4. Technical Details
### [Subsystem/Feature 1]
[Deep technical explanation, code examples, parameters]
### [Subsystem/Feature 2]
[Deep technical explanation, code examples, parameters]
---
## 5. Important Notes
- **Note 1**: [Prerequisite, warning, or gotcha]
- **Note 2**: [Version compatibility or dependency info]
- **Note 3**: [Best practice or common pitfall]
---
## 6. Possible Applications
- **Use case 1**: [Description of how this could be applied]
- **Use case 2**: [Description of how this could be applied]
- **Integration point**: [How this integrates with other systems]
---
## 7. Summary & Recommendations
[Synthesize the analysis: what are the key takeaways? What should the user do next?]
Quality Checklist
Before finalizing the report:
- ✅ All major sections of the original document are represented
- ✅ Key technical details are accurate and complete
- ✅ Terminology is consistent throughout
- ✅ Code examples are properly formatted and runnable
- ✅ Links to original sources are preserved
- ✅ The report is understandable to the target audience
- ✅ All tables and structured data are properly formatted
- ✅ Key insights are highlighted and actionable
Examples
Example 1: API Documentation
Input: https://api.example.com/docs Output: Analysis of endpoints, parameters, authentication, response formats, rate limiting, error codes, and usage examples.
Example 2: Technical Specification
Input: GitHub specification document (.md) Output: Architecture overview, key algorithms, data structures, performance considerations, and implementation guidelines.
Example 3: GitHub Repository
Input: https://github.com/user/project Output: Project purpose, architecture, key modules, setup instructions, and contribution guidelines.
Safety & Best Practices
DO:
- Respect robots.txt and rate limiting
- Attribute sources and preserve original links
- Sanitize any embedded code before analysis
- Remove sensitive information (API keys, passwords, tokens)
DON'T:
- Scrape private or authenticated pages without permission
- Execute untrusted code from analyzed documents
- Expose credentials or sensitive data in reports
- Violate copyright by reproducing large content sections
Troubleshooting
| Issue | Solution | |-------|----------| | Page requires login | Note in report that analysis is limited; request credentials if appropriate | | Content is behind paywall | Analyze preview/abstract; note that full content is restricted | | Dynamic content won't load | Use agent-browser with longer wait times; note if key content is JS-dependent | | Huge document | Focus analysis on key sections; create table of contents for reference | | Mixed formats (web + PDF + code) | Analyze each format separately, then synthesize findings in unified report |
微信扫一扫