XScrapy — AI-Native 通用爬虫引擎

一句话说明

给定任意 URL，自动识别页面类型并提取结构化数据，输出标准 JSON。

触发条件

当用户提到以下任一场景时触发本 Skill：

网页数据抓取/爬取 — "帮我爬这个网站"、"抓取 xxx 页面的数据"、"采集 xxx 的信息"
内容批量提取 — "把这篇文章的内容提取出来"、"获取所有商品信息"
价格/竞品监控 — "监控这个产品的价格变化"、"抓取竞争对手的产品列表"
新闻/资讯聚合 — "抓取今天的热点新闻"、"聚合某个网站的最新文章"
数据导出 — "把这个页面的数据导出成 JSON/CSV"

关键词匹配: 爬取 抓取 采集 scrape crawl spider extract 提取 xscrapy

使用方式

方式一：CLI 命令行（推荐）

# 基础用法 — 最简单，零配置
xscrapy run https://example.com/article/123

# 指定场景模板 + 字段过滤
xscrapy run https://shop.com/item/456 --template ecommerce \
  --fields title,price,rating,in_stock,brand

# 新闻文章提取
xscrapy run https://news.com/tech/ai-news --template news --max-pages 20

# 启用 AI 增强（更精准的提取）
xscrapy run https://example.com --ai --ai-model gpt-4o-mini

# 输出到指定文件
xscrapy run https://example.com --output result.json --format json

# CSV 输出
xscrapy run https://listing.example.com/products --template listing \
  --output products.csv --format csv

# 批量 URL
xscrapy run url1 url2 url3 --template ecommerce --output batch.json

# 干运行预览（不发送请求）
xscrapy run https://example.com --dry-run --verbose

方式二：Python API 调用

from xscrapy import XScrapyEngine
from xscrapy.models.data_models import (
    XScrapyConfig, RenderMode, OutputFormat, AIConfig,
    AntiDetectConfig, ProxyConfig
)

# 构建配置
config = XScrapyConfig(
    urls=["https://example.com/product/123"],
    template="ecommerce",
    fields=["title", "price", "rating", "in_stock", "images"],
    
    render_mode=RenderMode.AUTO,        # 自动选择渲染模式
    output=OutputConfig(
        format=OutputFormat.JSON,
        file_path="result.json",
        include_metadata=True,          # 包含元数据（来源URL、时间、置信度）
    ),
    
    anti_detect=AntiDetectConfig(
        enable_delay=True,
        min_delay=1.0,
        max_delay=3.0,
        concurrent_requests=8,
    ),
    
    ai=AIConfig(                     # 可选：AI 增强提取
        enabled=True,
        model="gpt-4o-mini",
    ),
)

# 执行
engine = XScrapyEngine(config)
result = engine.run()

print(f"成功: {result.success_count} 条")
for item in result.items:
    print(item.data["title"], item.metadata.confidence_score)

方式三：快速单页提取

from xscrapy.cli import quick_extract

data = quick_extract("https://news.example.com/article/123")
print(data["data"]["title"])
print(data["data"]["article_body"])

场景模板

| 模板 | 适用场景 | 推荐字段 | |------|---------|---------| | ecommerce | 电商产品详情 | title, price, currency, brand, category, rating, review_count, in_stock, images, specs, seller | | news | 新闻/博客文章 | title, author, published_date, article_body, tags, word_count, language, images, category | | social | 社交媒体帖子 | title, author, published_date, body_text, likes, shares, comments_count, images | | forum | 论坛/BBS 帖子 | title, author, published_date, body_text, comments, related_items, breadcrumbs | | docs | 技术文档/API 文档 | title, body_text, sections, code_blocks, internal_links, last_updated, version | | listing | 列表/搜索结果页 | items_preview, total_results, url, domain |

标准输出格式

每条数据都包含以下标准结构：

{
  "metadata": {
    "task_id": "xscrapy_20260407_163000_abc123",
    "spider": "universal",
    "source_url": "https://example.com/product/123",
    "crawled_at": "2026-04-07T16:30:00+08:00",
    "render_mode": "static",
    "confidence_score": 0.95,
    "version": "1.0.0",
    "extract_method": "hybrid",
    "response_time_ms": 520
  },
  "data": {
    "title": "产品名称",
    "url": "https://example.com/product/123",
    "domain": "example.com",
    "description": "产品描述...",
    "price": 299.00,
    "currency": "CNY",
    "brand": "品牌名",
    "category": "电子产品",
    "images": ["https://cdn.example.com/img.jpg"],
    "rating": 4.8,
    "review_count": 1234,
    "in_stock": true,
    "specs": {"颜色": "黑色", "尺寸": "15.6寸"},
    "_entities": {
      "email": ["contact@example.com"],
      "phone_cn": ["400-123-4567"],
      "price_cny": [{"amount": 299.0, "currency": "CNY"}]
    }
  },
  "_links": {
    "internal_links": [...],
    "external_links": [...]
  }
}

高级用法

自定义提取规则

from xscrapy.parsers.rule_parser import RuleBuilder

rules = RuleBuilder() \
    .add("title", "h1::text", required=True) \
    .add("price", ".product-price::text", type_hint="price") \
    .add("image", ".product-img img::attr(src)") \
    .add("sku", "#sku::text", description="商品SKU编号") \
    .build()

代理配置

# 单个代理
xscrapy run https://example.com --proxy socks5://127.0.0.1:1080

# 代理池文件
xscrapy run https://example.com --proxy-list proxies.txt

AI 智能提取

当规则无法覆盖的复杂页面时，启用 AI 提取：

xscrapy run https://complex-site.com/page --ai --ai-model gpt-4o-mini

AI 模式会：

分析页面语义结构
按需提取指定字段
补充规则漏掉的信息
返回带置信度评分的结果

反检测配置

config = XScrapyConfig(
    urls=[...],
    anti_detect=AntiDetectConfig(
        enable_ua_rotation=True,     # UA 自动轮换
        min_delay=2.0,               # 最小间隔 2s
        max_delay=5.0,               # 最大间隔 5s
        concurrent_requests=3,       # 低并发模拟人类
        respect_robots_txt=True,     # 尊重 robots.txt
    ),
)

安装与部署

快速开始

# 克隆或复制 xscrapy 目录到项目
cd xscrapy

# 创建虚拟环境
python -m venv venv
source venv/bin/activate  # Linux/Mac
# 或 venv\Scripts\activate  # Windows

# 安装依赖
pip install -e .

# 安装浏览器驱动（JS 渲染需要）
playwright install chromium

# 验证安装
xscrapy check

环境变量

| 变量 | 说明 | 示例 | |------|------|------| | XSCRAPY_AI_KEY | OpenAI API Key | sk-xxxxx | | XSCRAPY_AI_URL | 兼容 API Base URL | https://api.openai.com/v1 |

架构概览

用户输入 (URL/模板) 
    ↓
CLI / API 入口
    ↓
任务调度器 (Task Scheduler)
    ↓
Spider 引擎层 (Universal/List/Detail)
    ├── 反检测中间件 (UA轮换/延迟/指纹)
    ├── 渲染引擎 (Static/Splash/Playwright)
    └── 代理轮换 (RoundRobin/Random/Sticky)
    ↓
解析管线 (Pipeline)
    ├── 规则解析 (CSS/XPath/Regex)
    ├── AI 解析 (LLM 智能提取) ← 可选
    └── 实体提取 (邮箱/电话/价格/日期)
    ↓
处理管线
    ├── 清洗 → 去重 → 校验 → 富化 → 标准化
    ↓
输出管道 → JSON / NDJSON / CSV / DB / Webhook

注意事项

合法合规 — 请遵守目标网站的 robots.txt 和使用条款
请求频率 — 默认有合理的速率限制，避免对目标服务器造成压力
动态页面 — SPA/AJAX 页面需要开启 JS 渲染 (--render playwright)
AI 成本 — AI 模式会产生 API 调用费用，按需启用
数据质量 — 每条数据附带 confidence_score，低置信度结果建议人工审核

XScrapy v1.0.0 | Built with ❤️ and Scrapy