Auto Scraping to CSV — Agent-Driven Web Scraping
Scrape any webpage using text-based DOM manipulation and export structured data to CSV. The agent handles complex page nuances — infinite scroll, pagination, popups, lazy loading — and asks clarifying questions when data is ambiguous. No external LLM required.
Philosophy: Let the Agent Figure It Out
Traditional scraping requires you to inspect HTML, write CSS selectors, handle edge cases, and debug when the site changes. This skill flips that:
You say what you want. The agent handles the how.
You: "Scrape product catalog"
Agent: "I see 50 products across 5 pages with infinite scroll.
I also see 'Price', 'Sale Price', and 'Member Price' columns.
Which price field should I extract?"
You: "Sale Price"
Agent: [handles scrolling, pagination, extracts 50 rows] → products.csv
The agent will:
- Explore the page structure via text-based DOM
- Detect complexity — scroll, pagination, tabs, filters
- Ask questions when ambiguous (multiple price fields, missing data, format choices)
- Handle edge cases — dismiss popups, wait for lazy loading, retry on errors
- Export CSV — clean, structured, ready to use
When to Use
- Product catalogs: "Scrape all laptops with prices and ratings"
- News/articles: "Get latest blog posts with titles, dates, authors"
- Directory listings: "Extract all company names, emails, and websites"
- Table data: "Get the pricing table from this SaaS page"
- Real estate: "Scrape listings with price, beds, baths, square footage"
- Job boards: "Get job titles, companies, locations, and salary ranges"
- Social feeds: "Extract posts with engagement counts (likes, comments, shares)"
- Research data: "Get citation counts, authors, publication dates from this index"
How It Works
Claude (Host Model)
↕ HTTP
Bridge Server (Node.js + Playwright)
↕ page.evaluate()
Browser (Chromium) ← Page-Agent injected
- Bridge launches a local Chromium browser via Playwright
- Page-Agent is injected as an IIFE script from CDN into the target page
- Page-Agent indexes the DOM and generates a simplified text representation:
[5]<a >Widget Pro Laptop /></a> [12]<div >$1,299.99 /></div> [18]<div >4.5 stars (128 reviews) /></div> - Claude receives the text state, understands the page structure, decides actions
- Agent asks questions when data is ambiguous or complex
- Loop continues until all data is extracted or user says stop
- CSV export converts structured JSON to CSV
First-Time Setup
1. Install Playwright
npm install -D playwright
npx playwright install chromium
2. Place the Bridge Script
# Copy from this skill folder to your agents directory
cp .claude/skills/auto-scraping-to-csv/page-agent-bridge.mjs .claude/agents/
3. Start the Bridge
node .claude/agents/page-agent-bridge.mjs
Default port: 9876.
4. Verify Health
curl http://localhost:9876/health
Agent-Driven Scraping Workflow
Step 1: Request What You Want
/scrape-to-csv https://example.com/products
"Extract all products with name, price, rating, and number of reviews"
Step 2: Agent Explores and Asks Questions
The agent will fetch the DOM state and may ask:
Agent: "I found 24 products on the current page. The site has infinite scroll
and I can see more products loading as I scroll.
How many products do you want total? (type 'all' for everything)"
You: "50"
Agent: "I see three price fields:
- Original Price: $1,299
- Sale Price: $999
- Member Price: $899
Which one should I extract?"
You: "Sale Price"
Agent: "Some products are out of stock and show 'N/A' for price.
Should I:
A) Include them with 'N/A'
B) Skip those rows
C) Mark as 'Out of Stock' in a separate column"
You: "C"
Step 3: Agent Handles Complexity
The agent automatically:
- Scrolls to load more content
- Clicks pagination buttons
- Waits for lazy-loaded images/data
- Dismisses cookie banners and popups
- Retries if elements are slow to appear
- Stops gracefully if the site blocks scraping
Step 4: Preview and Confirm
Agent: "Extracted 50 products. Here's a preview:
| name | price | rating | reviews | stock_status |
|------|-------|--------|---------|-------------|
| Widget Pro | $999 | 4.5 | 128 | In Stock |
| Gadget Mini | $499 | 4.2 | 89 | In Stock |
| ... | ... | ... | ... | ... |
Save as 'products.csv'? (yes/no/change columns)"
You: "yes"
Step 5: CSV Delivered
Agent: "Saved to ./products.csv — 50 rows, 5 columns"
Handling Complex Scenarios
Infinite Scroll
You: "Scrape all posts from this feed"
Agent: "The page loads 10 posts at a time via infinite scroll.
I've scrolled 3 times and found 30 posts so far.
Should I continue scrolling? (yes/no/scroll N more times)"
You: "scroll 5 more times"
Agent: [scrolls] → "Found 80 total posts. Saved to feed.csv"
Pagination
You: "Get all products from this catalog"
Agent: "I see pagination: Page 1 of 12.
I can scrape all 12 pages (may take 2-3 minutes) or a subset.
How many pages?"
You: "all"
Agent: [clicks through 12 pages] → "360 products saved to catalog.csv"
Login Required
You: "Scrape my order history"
Agent: "The page redirects to a login screen.
I need credentials to proceed.
Please provide:
- Username/email
- Password
Or type 'skip' to abort."
You: "skip"
Agent: "Aborted. No data extracted."
Multiple Data Formats
You: "Get prices from this page"
Agent: "I found prices in multiple formats:
- '$19.99' (US format)
- '€17.99' (EU format)
- '¥2,980' (JP format)
Which format should I standardize to?"
You: "Keep as-is, add a 'currency' column"
Agent: [extracts with currency detection] → prices.csv
Missing / Inconsistent Data
You: "Extract employee data from this directory"
Agent: "I see 100 employee cards. Some are missing:
- 12 missing phone numbers
- 8 missing job titles
- 3 missing photos
How should I handle missing fields?
A) Leave blank
B) Fill with 'N/A'
C) Skip those rows entirely"
You: "B"
Agent: [extracts 100 rows with 'N/A' for missing fields] → employees.csv
Natural Language Commands
/scrape-to-csv <url> <description>
General scraping with CSV export.
/scrape-to-csv https://news.ycombinator.com
"Get top 30 stories with title, URL, points, and comment counts"
/scrape-to-csv https://www.anthropic.com/news
"Latest blog posts: title, date, category, URL"
/scrape-to-csv https://example.com/realestate
"Listings: address, price, beds, baths, sqft, listing agent"
/scrape-table <url> <selector_or_description>
Extract a specific table.
/scrape-table https://example.com/pricing
"The comparison table with Basic/Pro/Enterprise columns"
/scrape-table https://example.com/sales
"Q4 2024 revenue breakdown table"
/scrape-news <url>
Optimized for news/blog scraping.
/scrape-news https://techcrunch.com
"Latest 20 articles: title, author, date, excerpt, URL"
/scrape-news https://blog.openai.com
"All posts from 2024: title, date, tags, URL"
/scrape-products <url>
Optimized for e-commerce.
/scrape-products https://amazon.com/s?k=laptops
"Laptops: name, brand, price, rating, prime eligible, URL"
/scrape-products https://shopify-store.com/collections/all
"All products: name, price, compare-at price, availability"
Output Format
The agent produces a structured markdown report:
## Scraping Report — example.com/products
**Session:** a1b2c3d4 | **Duration:** 2m 14s | **Rows:** 50
### Task
Extract all products with name, price, rating, and number of reviews
### Agent Decisions
- **Pagination**: Detected infinite scroll, scrolled 5 times
- **Price field**: Chose "Sale Price" per user request
- **Missing data**: Filled out-of-stock prices with "N/A" per user request
- **Columns**: name, sale_price, rating, review_count, stock_status
### Sample Data
| name | sale_price | rating | review_count | stock_status |
|------|-----------|--------|-------------|-------------|
| Widget Pro | $999 | 4.5 | 128 | In Stock |
| Gadget Mini | $499 | 4.2 | 89 | In Stock |
| Super Gizmo | $1,299 | 4.8 | 256 | Out of Stock |
### File
`./products.csv` — 50 rows, 5 columns
CSV Conversion Options
Option A — Python (recommended)
import json, csv, re
# Bridge returns: "✅ Executed JavaScript. Result: [{...}, {...}]"
msg = """PASTE_BRIDGE_RESPONSE_HERE"""
match = re.search(r'Result: (\[.*\])', msg)
if match:
data = json.loads(match.group(1))
with open('output.csv', 'w', newline='') as f:
writer = csv.DictWriter(f, fieldnames=data[0].keys())
writer.writeheader()
writer.writerows(data)
print(f"Wrote {len(data)} rows to output.csv")
Option B — Node.js
const fs = require('fs');
const data = JSON.parse(fs.readFileSync('data.json', 'utf8'));
const headers = Object.keys(data[0]);
const csv = [
headers.join(','),
...data.map(row => headers.map(h => `"${(row[h]||'').replace(/"/g,'""')}"`).join(','))
].join('\n');
fs.writeFileSync('output.csv', csv);
Troubleshooting
Bridge won't start
Error: Cannot find module 'playwright'
Fix: npm install -D playwright && npx playwright install chromium
Site blocks scraping
Agent detects: "The site returned 403 Forbidden. This may be bot protection." Options:
- Try
headless: false(looks more like a real user) - Add delays between requests
- Use a different user agent
Page loads but no data found
Agent detects: "The page loaded but I see mostly navigation elements. Content may be behind a login or loaded dynamically." Agent asks: "Should I wait longer, scroll down, or do you have login credentials?"
Data looks wrong
Agent detects: "Prices show as 'NaN' or empty. The site may use JavaScript to render prices." Agent asks: "Should I try executing JavaScript to extract the real values, or skip this field?"
Comparison with Other Tools
| Tool | Setup | Selectors | Complex Pages | Agent Questions | Best For | |------|-------|-----------|---------------|-----------------|----------| | Auto Scraping to CSV | npm install | None needed | Handles automatically | Yes, clarifies ambiguity | One-off extraction, exploratory scraping | | BeautifulSoup | pip install | Required | Manual handling | No | Known structure, repeated scraping | | Scrapy | Project setup | Required | Middleware needed | No | Large-scale crawling, pipelines | | Playwright E2E | npm install | Required | Manual handling | No | Testing, automation | | Browser-Use | API key | None | Partial | Limited | Multi-page research |
Use this skill when:
- You want to scrape without writing selectors
- The page structure is complex or unknown
- You need the agent to handle edge cases (scroll, popups, pagination)
- You want clarifying questions when data is ambiguous
- You need quick one-off extraction to CSV
Bridge API Reference
POST /sessions
Launch a new browser session.
Body:
{ "url": "https://example.com", "headless": false }
Response: { "id": "abc123", "url": "https://example.com" }
GET /sessions/:id/state
Get text-based DOM state.
Response: { url, title, header, content, footer }
POST /sessions/:id/act
Execute an action.
Body:
{ "action": "executeJavascript", "params": { "script": "return document.title;" } }
DELETE /sessions/:id
Close session.
POST /shutdown
Stop bridge.
Skill: auto-scraping-to-csv v1.0.0 | Bridge: page-agent-bridge.mjs | Powered by Alibaba Page-Agent + Playwright
Scan to join WeChat group