返回 Skill 列表
extension
分类: 数据与分析需要 API Key

Tavily Crawl

爬取任意网站并将页面保存为本地 Markdown 文件。适用于下载文档、知识库或网页内容,以供离线访问或分析。

person作者: matthew77hubclawhub

Tavily Crawl

Crawl websites to extract content from multiple pages. Ideal for documentation, knowledge bases, and site-wide content extraction.

Authentication

Get your API key at https://tavily.com and add to your OpenClaw config:

{
  "skills": {
    "entries": {
      "tavily-crawl": {
        "enabled": true,
        "apiKey": "tvly-YOUR_API_KEY_HERE"
      }
    }
  }
}

Or set in environment variable:

export TAVILY_API_KEY="tvly-YOUR_API_KEY_HERE"

Quick Start

Using the Script

node {baseDir}/scripts/crawl.mjs "https://docs.example.com"
node {baseDir}/scripts/crawl.mjs "https://docs.example.com" --output ./docs
node {baseDir}/scripts/crawl.mjs "https://example.com" --depth 2 --limit 50

Examples

# Basic crawl
node {baseDir}/scripts/crawl.mjs "https://docs.example.com"

# Deeper crawl with limits
node {baseDir}/scripts/crawl.mjs "https://docs.example.com" --depth 2 --limit 50

# Save to files
node {baseDir}/scripts/crawl.mjs "https://docs.example.com" --depth 2 --output ./docs

# Focused crawl with path filters
node {baseDir}/scripts/crawl.mjs "https://example.com" --depth 2 \
  --select "/docs/.*" --exclude "/blog/.*"

# With semantic instructions
node {baseDir}/scripts/crawl.mjs "https://docs.example.com" \
  --instructions "Find API documentation" --chunks 3

Options

| Option | Description | Default | |--------|-------------|---------| | --depth <n> | Crawl depth (1-5) | 1 | | --breadth <n> | Links per page | 20 | | --limit <n> | Total pages cap | 50 | | --output <dir> | Save pages to directory | - | | --instructions <text> | Natural language guidance | - | | --chunks <n> | Chunks per page (1-5, requires instructions) | - | | --depth-mode <mode> | Extract depth: basic or advanced | basic | | --select <pattern> | Regex pattern to include | - | | --exclude <pattern> | Regex pattern to exclude | - | | --timeout <sec> | Max wait time (10-150 seconds) | 150 | | --json | Output raw JSON | false |

Depth vs Performance

| Depth | Typical Pages | Time | |-------|---------------|------| | 1 | 10-50 | Seconds | | 2 | 50-500 | Minutes | | 3 | 500-5000 | Many minutes |

Start with --depth 1 and increase only if needed.

Crawl for Context vs Data Collection

For agentic use (feeding results into context): Always use --instructions + --chunks. This returns only relevant chunks instead of full pages, preventing context window explosion.

For data collection (saving to files): Omit --chunks to get full page content.

Tips

  • Always use --chunks for agentic workflows - prevents context explosion when feeding results to LLMs
  • Omit --chunks only for data collection - when saving full pages to files
  • Start conservative (--depth 1, --limit 20) and scale up
  • Use path patterns to focus on relevant sections
  • Always set a --limit to prevent runaway crawls