Web Scraper

Overview

Professional web scraping skill using agent-browser. Extract structured data from any website with support for JavaScript-rendered content, pagination, and complex selectors.

Use Cases

E-commerce: Product listings, prices, reviews, inventory
Real Estate: Property listings, prices, agent contacts
Job Boards: Job postings, salaries, requirements
News/Media: Articles, headlines, publication dates
Directories: Business listings, contact information
Competitor Monitoring: Prices, products, content changes

Quick Start

Scrape Single Page

python scripts/scrape_page.py \
  --url "https://example.com/products" \
  --fields "title= h2.title,price=.price,link=a.href" \
  --output products.csv

Scrape with Pagination

python scripts/scrape_paginated.py \
  --url "https://example.com/products?page={page}" \
  --pages 10 \
  --fields "title,price,description" \
  --output all_products.csv

Scripts

scrape_page.py

Scrape a single page or static list.

Arguments:

--url - Target URL
--fields - Field definitions (name=selector format, comma-separated)
--output - Output file (CSV, JSON, or XLSX)
--format - Output format (csv, json, xlsx)
--wait - Wait time for dynamic content (seconds)

Field Definition Format:

fieldname=css_selector

Examples:

title=h1.product-title
price=.price-tag
description=.product-description
image=img.product-image.src
link=a.product-link.href

scrape_paginated.py

Scrape multiple pages with pagination.

Arguments:

--url - URL pattern (use {page} for page number)
--pages - Number of pages to scrape
--fields - Field definitions
--output - Output file
--delay - Delay between pages (seconds)
--next-selector - CSS selector for "next page" button (alternative to URL pattern)

scrape_infinite_scroll.py

Scrape pages with infinite scroll loading.

Arguments:

--url - Target URL
--scrolls - Number of scroll actions
--fields - Field definitions
--output - Output file
--scroll-delay - Delay between scrolls (ms)

scrape_dynamic.py

Scrape JavaScript-heavy sites with custom interactions.

Arguments:

--url - Target URL
--actions - JSON file with interaction sequence
--fields - Field definitions
--output - Output file

Configuration

Actions JSON Format (for dynamic scraping)

{
  "actions": [
    {"type": "click", "selector": "#load-more"},
    {"type": "wait", "ms": 2000},
    {"type": "scroll", "direction": "down", "pixels": 500},
    {"type": "fill", "selector": "#search", "value": "keyword"},
    {"type": "press", "key": "Enter"}
  ]
}

Output Formats

CSV:

title,price,link,url
"Product A",29.99,https://...,https://...
"Product B",39.99,https://...,https://...

JSON:

[
  {
    "title": "Product A",
    "price": "29.99",
    "link": "https://...",
    "scraped_at": "2026-03-07T16:00:00"
  }
]

Excel (XLSX):

Same as CSV but with formatting options
Multiple sheets support
Auto-fit columns

Examples

Example 1: Scrape E-commerce Products

python scripts/scrape_paginated.py \
  --url "https://example.com/shop?page={page}" \
  --pages 5 \
  --fields "name=.product-name,price=.price,rating=.stars,reviews=.review-count,url=a.href" \
  --output products.csv \
  --delay 3

Example 2: Scrape News Articles

python scripts/scrape_page.py \
  --url "https://news-site.com/latest" \
  --fields "headline=h2.article-title,summary=.article-summary,author=.byline,date=.publish-date,url=a.read-more.href" \
  --output articles.json \
  --format json

Example 3: Scrape Job Postings

python scripts/scrape_infinite_scroll.py \
  --url "https://jobs-site.com/search" \
  --scrolls 10 \
  --fields "title=.job-title,company=.company-name,location=.location,salary=.salary,posted=.date-posted,url=a.job-link.href" \
  --output jobs.csv \
  --scroll-delay 1500

Example 4: Scrape Real Estate Listings

python scripts/scrape_paginated.py \
  --url "https://realestate.com/listings?page={page}" \
  --pages 20 \
  --fields "address=.property-address,price=.listing-price,beds=.bedrooms,baths=.bathrooms,sqft=.square-feet,url=a.property-link.href" \
  --output listings.xlsx \
  --format xlsx \
  --delay 5

Best Practices

Respect robots.txt - Check and follow site rules
Rate limiting - Add delays between requests (2-5s recommended)
Error handling - Handle missing elements gracefully
User-Agent - Use realistic browser headers
Retry logic - Implement retries for failed requests
Data validation - Validate extracted data before saving
Storage - Save intermediate results for long scrapes

Anti-Scraping Measures

Some sites employ anti-scraping techniques:

| Measure | Countermeasure | |---------|----------------| | IP blocking | Use proxies, rotate IPs | | CAPTCHA | Manual solving or CAPTCHA services | | Rate limiting | Increase delays, randomize timing | | JavaScript challenges | Use browser automation (agent-browser) | | Honeypot traps | Avoid hidden fields, validate selectors |

Legal Considerations

Public data: Generally legal to scrape
Terms of Service: Review site ToS before scraping
Copyright: Don't republish copyrighted content
Personal data: GDPR/privacy laws may apply
Commercial use: May require permission

Disclaimer: This skill is for educational purposes. Users are responsible for compliance with applicable laws and website terms.

Troubleshooting

Elements not found: Verify CSS selectors with browser dev tools
Empty results: Check if content is JavaScript-rendered (use dynamic scraping)
Timeout errors: Increase wait time or check network
Blocked requests: Add delays, rotate user agents, or use proxies
Incomplete data: Verify pagination or scroll handling

References

CSS Selector Guide

See references/css-selectors.md for comprehensive selector examples.

Common Website Patterns

See references/website-patterns.md for common HTML structures and selectors.