Firecrawl Skill
This skill extends Manus with the capability to search, scrape, crawl, and extract structured data from any website using Firecrawl [1] [2].
- Author: Simon-Pierre Boucher
- Target Audience: AI Engineers, Agent Developers, Data Engineers, Web Scraping Engineers
1. Core Workflows
1.1 Scraping a Single URL (/scrape)
Use when you need the text content, Markdown, HTML, or screenshots of a specific webpage [2].
- Initialize the Firecrawl client with your API key [1].
- Specify output
formats(e.g.,["markdown"]or["markdown", "screenshot"]) [2]. - Apply custom browser actions (e.g., click, wait, write) if the page has dynamic content or requires interaction [2].
- Optionally filter the DOM using
includeTagsorexcludeTags[2].
1.2 Crawling a Domain (/crawl)
Use when you need to discover and scrape all pages under a specific domain or path recursively [1] [2].
- Start an asynchronous crawl job by specifying the starting
url[2]. - Set depth limits (
maxDepth) and page limits (limit) to control token and credit usage [2]. - Configure
scrapeOptionsto ensure each crawled page is parsed with the correct format (e.g., Markdown only) [2]. - Poll the crawl status using the
jobIduntil completed [2].
1.3 Mapping a Domain (/map)
Use when you need to quickly discover all URLs belonging to a domain without scraping page content [1] [2].
- Provide the base
url[2]. - Optionally provide a
searchfilter to only return URLs matching a specific keyword or path [2]. - Set
includeSubdomainstotrueif you need sub-domain discovery [2].
1.4 Structured Extraction (/extract)
Use when you need to parse raw web pages and extract structured JSON data conforming to a specific schema [3].
- Provide an array of
urlsand a natural language extractionprompt[3]. - Define the target schema using a JSON Schema, Pydantic model (Python), or Zod schema (TypeScript) [3].
- Run the extraction to retrieve guaranteed, type-safe JSON [3].
2. Resource Guides
For comprehensive API parameters, SDK code templates, and configuration options, read the following reference files:
- API Reference & SDK Snippets: Read
references/api_reference.mdfor complete endpoint request/response schemas, Python SDK templates, and TypeScript/Zod snippets. - Self-Hosting & Docker: Read
references/self_hosting.mdfor production-ready Docker Compose configurations, environment variables, and scaling guidelines.
3. Best Practices & Anti-Patterns
3.1 Best Practices
- Always use
onlyMainContent: trueto strip out navigation bars, headers, and footers. This dramatically reduces downstream LLM token costs and keeps context windows clean [2]. - Leverage
/mapbefore/crawlif you only need to discover pages or filter specific URLs to scrape. Mapping is significantly faster and cheaper than full crawls [1] [2]. - Implement exponential backoff with jitter when handling rate limits (
429) or transient server errors (5xx) to ensure scraping resiliency [4]. - Set explicit CPU and RAM limits on your containers if self-hosting to prevent headless Chromium from consuming all host system resources [5].
3.2 Anti-Patterns
- Do not use hard-coded
waitFordelays when scraping dynamic content. Instead, use selector-based waits (e.g.,{"type": "wait", "selector": "#loaded-element"}) to minimize request latency [2]. - Do not run synchronous crawls. Crawling is an inherently long-running process; always use the asynchronous
/crawlendpoint and poll for results or use webhooks [2]. - Do not reuse browser sessions across unrelated scraping tasks if security isolation is required. Firecrawl relies on ephemeral containers to prevent session contamination [5].
References
[1] Firecrawl Homepage, "The API to search, scrape, and interact with the web at scale." URL: https://github.com/firecrawl/firecrawl
[2] Firecrawl Documentation, "Advanced Scraping Guide." URL: https://docs.firecrawl.dev/advanced-scraping-guide
[3] Firecrawl Documentation, "Agent Endpoint." URL: https://docs.firecrawl.dev/features/agent
[4] Firecrawl Documentation, "Rate Limits." URL: https://docs.firecrawl.dev/rate-limits
[5] Firecrawl GitHub Repository, "Self-hosting Firecrawl Guide." URL: https://raw.githubusercontent.com/firecrawl/firecrawl/main/SELF_HOST.md
Scan to join WeChat group