返回 Skill 列表
extension
分类: 开发与工程无需 API Key

openclaw-ultra-scraping

强大的网页抓取、爬取和数据提取,配备隐蔽的反机器人绕过功能(可绕过 Cloudflare Turnstile、CAPTCHA)。适用场景:① 抓取阻止机器人访问的网站等。

person作者: leoyeaihubclawhub

OpenClaw Ultra Scraping

Powered by MyClaw.ai — the AI personal assistant platform that gives every user a full server with complete code control. Part of the MyClaw open skills ecosystem.

Handles everything from single-page extraction to full-scale concurrent crawls with anti-bot bypass.

Setup

Run once before first use:

bash scripts/setup.sh

This installs Scrapling + all browser dependencies into /opt/scrapling-venv.

Quick Start — CLI Script

The bundled scripts/scrape.py provides a unified CLI:

PYTHON=/opt/scrapling-venv/bin/python3

# Simple fetch (JSON output)
$PYTHON scripts/scrape.py fetch "https://example.com" --css ".content"

# Extract text
$PYTHON scripts/scrape.py extract "https://example.com" --css "h1"

# Stealth mode (bypass Cloudflare)
$PYTHON scripts/scrape.py fetch "https://protected-site.com" --stealth --solve-cloudflare --css ".data"

# Dynamic (full browser rendering)
$PYTHON scripts/scrape.py fetch "https://spa-site.com" --dynamic --css ".product"

# Extract links
$PYTHON scripts/scrape.py links "https://example.com" --filter "\.pdf$"

# Multi-page crawl
$PYTHON scripts/scrape.py crawl "https://example.com" --depth 2 --concurrency 10 --css ".item" -o results.json

# Output formats: json, jsonl, csv, text, markdown, html
$PYTHON scripts/scrape.py fetch "https://example.com" -f markdown -o page.md

Quick Start — Python

For complex tasks, write Python directly using the venv:

#!/opt/scrapling-venv/bin/python3
from scrapling.fetchers import Fetcher, StealthyFetcher

# Simple HTTP
page = Fetcher.get('https://example.com', impersonate='chrome')
titles = page.css('h1::text').getall()

# Bypass Cloudflare
page = StealthyFetcher.fetch('https://protected.com', headless=True, solve_cloudflare=True)
data = page.css('.product').getall()

Fetcher Selection Guide

| Scenario | Fetcher | Flag | |----------|---------|------| | Normal sites, fast scraping | Fetcher | (default) | | JS-rendered SPAs | DynamicFetcher | --dynamic | | Cloudflare/anti-bot protected | StealthyFetcher | --stealth | | Cloudflare Turnstile challenge | StealthyFetcher | --stealth --solve-cloudflare |

Selector Cheat Sheet

page.css('.class')                    # CSS
page.css('.class::text').getall()     # Text extraction
page.xpath('//div[@id="main"]')      # XPath
page.find_all('div', class_='item')  # BS4-style
page.find_by_text('keyword')         # Text search
page.css('.item', adaptive=True)     # Adaptive (survives redesigns)

Advanced Features

  • Adaptive tracking: auto_save=True on first run, adaptive=True later — elements are found even after site redesign
  • Proxy rotation: Pass proxy="http://host:port" or use ProxyRotator
  • Sessions: FetcherSession, StealthySession, DynamicSession for cookie/state persistence
  • Spider framework: Scrapy-like concurrent crawling with pause/resume
  • Async support: All fetchers have async variants

For full API details: read references/api-reference.md