返回 Skill 列表
extension
分类: 开发与工程无需 API Key

fetcher

抓取网页、PDF和文档,并自动回退和内容提取。当用户说“抓取这个URL”、“下载这个页面”、“爬取这个网站”、“从...提取内容”、“获取PDF”或提供需要检索的URL时使用。

person作者: jakexiaohubgithub

Fetcher - Web Crawling

Fetch web pages and documents with automatic fallbacks, proxy rotation, and content extraction.

Self-contained skill - auto-installs via uvx from git (no pre-installation needed).

Fully automatic - Playwright browsers are installed on first run for SPA/JS page support.

Simplest Usage

# Via wrapper (recommended - auto-installs)
.agents/skills/fetcher/run.sh get https://example.com

# Or directly if fetcher is installed
fetcher get https://example.com

Common Commands

./run.sh get https://example.com                   # Fetch single URL
./run.sh get-manifest urls.txt                     # Fetch list of URLs
./run.sh get-manifest - < urls.txt                 # Fetch from stdin

Common Patterns

Fetch a single URL

fetcher get https://www.nasa.gov --out run/nasa

Outputs to run/nasa/:

  • consumer_summary.json - structured result
  • Walkthrough.md - human-readable summary
  • downloads/ - raw content files

Fetch multiple URLs

# From file (one URL per line)
fetcher get-manifest urls.txt --out run/batch

# From stdin
echo -e "https://example.com\nhttps://nasa.gov" | fetcher get-manifest -

ETL mode (full control)

fetcher-etl --inventory urls.jsonl --out run/etl_batch
fetcher-etl --manifest urls.txt --out run/demo

Check environment

fetcher doctor                    # Check dependencies and config
fetcher get --dry-run <url>       # Validate without fetching
fetcher-etl --help-full           # All options
fetcher-etl --find metrics        # Search options

Output Structure

run/artifacts/<run-id>/
├── results.jsonl              # Fetch results per URL
├── consumer_summary.json      # Summary stats
├── Walkthrough.md             # Human-readable summary
├── downloads/                 # Raw files (HTML, PDF, etc.)
├── text_blobs/                # Extracted text
├── markdown/                  # LLM-friendly markdown
├── fit_markdown/              # Pruned markdown for LLM input
├── junk_results.jsonl         # Failed/junk URLs
└── junk_table.md              # Quick triage table

Content Extraction

Enable markdown output

export FETCHER_EMIT_MARKDOWN=1
export FETCHER_EMIT_FIT_MARKDOWN=1  # Pruned for LLM input
fetcher get https://example.com

Rolling windows (for chunking)

export FETCHER_DOWNLOAD_MODE=rolling_extract
export FETCHER_ROLLING_WINDOW_SIZE=6000
export FETCHER_ROLLING_WINDOW_STEP=3000
fetcher get https://example.com

Advanced Features

HTTP caching

# Cache enabled by default
fetcher get https://example.com

# Disable cache for fresh fetch
fetcher get https://example.com --no-http-cache

PDF discovery

# Auto-fetch PDF links from HTML pages
export FETCHER_ENABLE_PDF_DISCOVERY=1
export FETCHER_PDF_DISCOVERY_MAX=3
fetcher get https://example.com

Proxy rotation (rate-limited sites)

export SPARTA_STEP06_PROXY_HOST=gw.iproyal.com
export SPARTA_STEP06_PROXY_PORT=12321
export SPARTA_STEP06_PROXY_USER=team
export SPARTA_STEP06_PROXY_PASSWORD=secret
fetcher-etl --inventory urls.jsonl

Brave/Wayback fallbacks

# Enable alternate URL resolution
export BRAVE_API_KEY=sk-your-key
fetcher-etl --use-alternates --inventory urls.jsonl

Python API

import asyncio
from fetcher.workflows.web_fetch import URLFetcher, FetchConfig, write_results
from pathlib import Path

async def main():
    config = FetchConfig(concurrency=4, per_domain=2)
    fetcher = URLFetcher(config)
    entries = [{"url": "https://www.nasa.gov"}]
    results, audit = await fetcher.fetch_many(entries)
    write_results(results, Path("artifacts/nasa.jsonl"))
    print(audit)

asyncio.run(main())

Single URL helper

from fetcher.workflows.fetcher import fetch_url

result = await fetch_url("https://example.com")
print(result.content_verdict)  # "ok", "empty", "paywall", etc.
print(result.text)             # Extracted text

FetchResult Fields

| Field | Description | |-------|-------------| | url | Original URL | | final_url | After redirects | | content_verdict | ok, empty, paywall, error, etc. | | text | Extracted text content | | file_path | Path to raw download | | markdown_path | Path to markdown (if enabled) | | from_cache | Whether result came from cache | | content_sha256 | Content hash for change detection |

Environment Variables

| Variable | Purpose | |----------|---------| | BRAVE_API_KEY | Enable Brave search fallbacks | | FETCHER_EMIT_MARKDOWN | Generate LLM-friendly markdown | | FETCHER_EMIT_FIT_MARKDOWN | Generate pruned markdown | | FETCHER_DOWNLOAD_MODE | text, download_only, rolling_extract | | FETCHER_HTTP_CACHE_DISABLE | Disable HTTP caching | | FETCHER_ENABLE_PDF_DISCOVERY | Auto-fetch embedded PDFs |

Troubleshooting

| Problem | Solution | |---------|----------| | Playwright missing | uvx --from "git+https://github.com/grahama1970/fetcher.git" playwright install chromium | | SPA page returns empty/thin | Playwright auto-fallback should trigger; check used_playwright in summary | | Stale cached results | Set FETCHER_HTTP_CACHE_DISABLE=1 for fresh fetch | | Rate limited | Configure proxy rotation or reduce concurrency | | Paywall detected | Check content_verdict and use alternates | | Empty content | Check junk_results.jsonl for diagnosis |

Run fetcher doctor to check environment and dependencies.

SPA/JavaScript Page Support

Fetcher automatically falls back to Playwright for known SPA domains. If a page returns thin/empty content:

  1. Check if used_playwright: 1 in consumer_summary.json
  2. If not, the domain may need to be added to SPA_FALLBACK_DOMAINS in fetcher source
  3. Force fresh fetch with FETCHER_HTTP_CACHE_DISABLE=1