返回 Skill 列表
extension
分类: 数据与分析无需 API Key

Crawl4AI Web Crawler

使用 Crawl4AI 进行网页抓取和内容提取,适用于需要抓取网页内容、提取结构化数据、将网页转换为 Markdown 等场景

person作者: openlarkhubclawhub

Crawl4AI Web Crawler

Crawl4AI is an open-source, LLM-friendly web crawler on GitHub that converts web pages into clean Markdown or structured JSON, ideal for RAG, AI Agents, and data pipelines.

For detailed API parameters, see references/api-reference.md.

Trigger Words

"scrape," "crawl," "crawl," "extract webpage," "convert webpage to markdown," "structured extraction," etc.

Installation

pip install -U crawl4ai
crawl4ai-setup          # Automatically installs the Playwright browser
crawl4ai-doctor         # Verifies the installation

If the browser installation fails, run manually:

python -m playwright install --with-deps chromium

Core Architecture

Three core classes:

| Class | Purpose | |-------|---------| | AsyncWebCrawler | Main async crawler class, manages the browser lifecycle | | BrowserConfig | Browser settings (headless, UA, proxy, viewport, etc.) | | CrawlerRunConfig | Per-crawl settings (cache, extraction strategy, JS, screenshots, etc.) |

Basic Usage

Simplest Crawl

import asyncio
from crawl4ai import AsyncWebCrawler

async def main():
    async with AsyncWebCrawler() as crawler:
        result = await crawler.arun(url="https://example.com")
        print(result.markdown)  # LLM-ready Markdown

asyncio.run(main())

Crawl with Configuration

from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode

browser_cfg = BrowserConfig(headless=True, verbose=True)
run_cfg = CrawlerRunConfig(
    cache_mode=CacheMode.BYPASS,     # BYPASS=no cache, ENABLED=enable, WRITE_ONLY, READ_ONLY
    css_selector="main.article",     # Only extract the specified area
    word_count_threshold=10,         # Filter out short text blocks
    screenshot=True,                 # Take a screenshot
)

async with AsyncWebCrawler(config=browser_cfg) as crawler:
    result = await crawler.arun(url="https://example.com", config=run_cfg)
    print(result.markdown)
    if result.screenshot:
        print(f"Screenshot: {len(result.screenshot)} bytes base64")

Command Line Tool

# Basic crawl
crwl https://example.com -o markdown

# Deep crawl (BFS, up to 10 pages)
crwl https://docs.crawl4ai.com --deep-crawl bfs --max-pages 10

# LLM extraction
crwl https://example.com/products -q "Extract all product prices"

Markdown Generation

Using Content Filters

Raw Markdown is generated by default. Use DefaultMarkdownGenerator + content filters to get cleaner output:

from crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter
from crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator

# Method 1: PruningContentFilter — density-based pruning
md_gen = DefaultMarkdownGenerator(
    content_filter=PruningContentFilter(
        threshold=0.48,           # 0-1; the lower the value, the more is pruned
        threshold_type="fixed",   # "fixed" or "dynamic"
        min_word_threshold=0
    )
)

# Method 2: BM25ContentFilter — query relevance-based filtering
md_gen = DefaultMarkdownGenerator(
    content_filter=BM25ContentFilter(
        user_query="machine learning",  # Keywords to focus on
        bm25_threshold=1.0
    )
)

run_cfg = CrawlerRunConfig(markdown_generator=md_gen)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="...", config=run_cfg)
    print(len(result.markdown.raw_markdown))   # Raw MD
    print(len(result.markdown.fit_markdown))   # Filtered MD

Structured Data Extraction

CSS/XPath Extraction (No LLM Required, Fast and Free)

from crawl4ai import JsonCssExtractionStrategy
import json

schema = {
    "name": "Articles",
    "baseSelector": "article.post",     # Container for repeating elements
    "fields": [
        {"name": "title", "selector": "h2", "type": "text"},
        {"name": "url", "selector": "a", "type": "attribute", "attribute": "href"},
        {"name": "image", "selector": "img", "type": "attribute", "attribute": "src"},
    ]
}

run_cfg = CrawlerRunConfig(
    extraction_strategy=JsonCssExtractionStrategy(schema)
)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/blog", config=run_cfg)
    data = json.loads(result.extracted_content)
    print(data)  # [{"title": "...", "url": "...", "image": "..."}, ...]

Auto-Generate Schema (one-time LLM cost, then reuse for free):

from crawl4ai import LLMConfig

schema = JsonCssExtractionStrategy.generate_schema(
    html="<div class='product'>...",
    llm_config=LLMConfig(provider="openai/gpt-4o", api_token="your-key")
    # Or use a local model: LLMConfig(provider="ollama/llama3.3", api_token=None)
)

LLM Extraction (Suitable for Unstructured Content)

from pydantic import BaseModel, Field
from crawl4ai import LLMExtractionStrategy, LLMConfig

class Product(BaseModel):
    name: str = Field(..., description="Product name")
    price: str = Field(..., description="Price as string")
    description: str = Field(..., description="Short description")

llm_strategy = LLMExtractionStrategy(
    llm_config=LLMConfig(
        provider="openai/gpt-4o-mini",     # Also supports ollama/llama3, anthropic/claude-3, etc.
        api_token="your-api-key"
    ),
    schema=Product.model_json_schema(),
    extraction_type="schema",              # "schema" or "block"
    instruction="Extract all product objects with name, price, and description.",
    chunk_token_threshold=1000,            # Auto-chunk when exceeding this token count
    overlap_rate=0.1,                      # 10% overlap between chunks
    apply_chunking=True,
    input_format="markdown",               # "markdown" | "html" | "fit_markdown"
    extra_args={"temperature": 0.0, "max_tokens": 800}
)

run_cfg = CrawlerRunConfig(extraction_strategy=llm_strategy)

async with AsyncWebCrawler() as crawler:
    result = await crawler.arun(url="https://example.com/products", config=run_cfg)
    data = json.loads(result.extracted_content)
    llm_strategy.show_usage()  # Print token usage statistics

Extraction Strategy Selection Guide

| Scenario | Strategy | |----------|----------| | Repeating lists (products, articles, search results) | JsonCssExtractionStrategy | | Unstructured text requiring AI understanding | LLMExtractionStrategy | | High-frequency crawling of the same site | Generate Schema with LLM first, then extract via CSS |

Dynamic Page Handling

run_cfg = CrawlerRunConfig(
    js_code=[                          # JS executed on the page
        "window.scrollTo(0, document.body.scrollHeight)",
        "await new Promise(r => setTimeout(r, 2000))",
    ],
    wait_for="css:.content-loaded",     # Wait for a specific element to appear
    delay_before_return_html=2.0,       # Additional wait in seconds before returning
)

Batch Crawling

urls = ["https://example.com/page1", "https://example.com/page2", ...]

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun_many(urls=urls, config=run_cfg)
    for result in results:
        if result.success:
            print(result.markdown[:200])

arun_many() automatically handles rate limiting, memory monitoring, and concurrency control.

Browser Management

browser_cfg = BrowserConfig(
    browser_type="chromium",       # "chromium" | "firefox" | "webkit"
    headless=True,
    viewport_width=1920,
    viewport_height=1080,
    user_agent="Mozilla/5.0 ...",
    proxy="http://user:pass@proxy:8080",
    use_managed_browser=True,      # Use an existing browser instance
    user_data_dir="/path/to/profile",  # Persistent profile (to retain login state)
)

Deep Crawl (Site-Level Crawling)

from crawl4ai import DeepCrawlStrategy, BFSDeepCrawlStrategy

deep_crawl = BFSDeepCrawlStrategy(
    max_depth=3,                    # Maximum depth
    max_pages=50,                   # Maximum number of pages
    include_paths=["/docs/*"],      # Only crawl specified paths
    exclude_paths=["/blog/*"],      # Exclude specified paths
)

run_cfg = CrawlerRunConfig(deep_crawl_strategy=deep_crawl)

async with AsyncWebCrawler() as crawler:
    results = await crawler.arun(url="https://example.com", config=run_cfg)
    for r in results:
        print(f"{r.url}{len(r.markdown)} chars")

Docker Deployment

docker pull unclecode/crawl4ai:latest
docker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode/crawl4ai:latest

# Dashboard: http://localhost:11235/dashboard
# Playground: http://localhost:11235/playground

Python Client:

import requests

resp = requests.post("http://localhost:11235/crawl",
    json={"urls": ["https://example.com"], "priority": 10})

task_id = resp.json()["task_id"]
result = requests.get(f"http://localhost:11235/task/{task_id}")
print(result.json())

CrawlResult Key Fields

result.url              # Final URL (after any redirects)
result.html             # Raw HTML
result.cleaned_html     # Cleaned HTML
result.markdown         # Markdown formatted output (contains raw_markdown and fit_markdown)
result.extracted_content # JSON string returned by the extraction strategy
result.screenshot       # Base64 screenshot
result.media            # Image/video information
result.links            # Internal and external link information
result.success          # Whether the crawl was successful
result.error_message    # Error message

FAQ

Playwright browser not installed:

python -m playwright install --with-deps chromium

Cache issues causing stale data to be returned: Set cache_mode=CacheMode.BYPASS to skip the cache.

Dynamic content not loading: Use wait_for="css:selector" to wait for the target element, or js_code to execute scrolling.

Out of memory (batch crawling): Reduce the concurrency level; arun_many() automatically monitors memory and adapts.

Anti-bot / detection: Enable use_managed_browser=True in BrowserConfig or configure a proxy.