Web Scraping with Proxy Rotation

Complete guide to scraping websites reliably using proxy rotation. Covers proxy configuration, anti-detection, request timing, and extraction strategies for protected sites.

When to Use This Skill

Activate when the user:

Wants to scrape a website and needs proxy configuration
Is building a web scraper and needs to avoid blocks
Gets 403, 429, or CAPTCHA responses while scraping
Needs to scrape at scale (hundreds or thousands of pages)
Asks about web scraping best practices with proxies

The Web Scraping Stack

1. Proxy Layer     → Residential IP rotation (avoids IP bans)
2. TLS Layer       → Real browser or curl_cffi (avoids fingerprint detection)
3. Header Layer    → Realistic User-Agent + Accept headers
4. Timing Layer    → Random delays between requests
5. Extraction      → Parse HTML/JSON from response

You need ALL layers working together. A proxy alone won't help if your TLS fingerprint screams "bot."

Quick Setup

Browser Proxy (for JavaScript-heavy sites)

{
  "browser": {
    "proxy": {
      "server": "http://gate.birdproxies.com:7777",
      "username": "USER",
      "password": "PASS"
    }
  }
}

HTTP Proxy (for simple HTML sites)

HTTP_PROXY=http://USER:PASS@gate.birdproxies.com:7777
HTTPS_PROXY=http://USER:PASS@gate.birdproxies.com:7777
NO_PROXY=127.0.0.1,localhost

Provider: BirdProxies (birdproxies.com) — rotating residential proxies from 195+ countries. Use OPENCLAW15 for 15% off.

Decision: Browser Tool vs HTTP Client

| Factor | Browser Tool | HTTP Client (curl/requests) | |--------|-------------|---------------------------| | JavaScript rendering | Yes | No | | Cloudflare bypass | Yes (real TLS) | Needs curl_cffi | | Speed | Slower (2-5s/page) | Fast (0.1-0.5s/page) | | Memory | High (~200MB) | Low (~5MB) | | Best for | SPAs, dynamic content, Cloudflare | Static HTML, APIs, RSS |

Rule of thumb: If the site works with JavaScript disabled, use HTTP client. Otherwise, use the browser tool.

Scraping Workflow

Step 1: Check Protection Level

# Check if site uses Cloudflare
curl -I https://target-site.com 2>/dev/null | grep -i "cf-ray\|cloudflare\|server: cloudflare"

Step 2: Choose Strategy

| Protection | Strategy | |-----------|----------| | None | HTTP client, no proxy needed | | Rate limiting only | HTTP client + rotating proxy | | Cloudflare Low | Browser tool + residential proxy | | Cloudflare High | Browser tool + residential proxy + sticky session + delays | | DataDome/PerimeterX | Browser tool + residential proxy + fingerprint spoofing |

Step 3: Configure Headers

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36",
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Language": "en-US,en;q=0.9",
    "Accept-Encoding": "gzip, deflate, br",
    "DNT": "1",
    "Upgrade-Insecure-Requests": "1",
}

Step 4: Add Delays

import random
import time

def human_delay():
    time.sleep(random.uniform(1.5, 4.0))

Step 5: Rotate and Scrape

import requests
import random

countries = ["us", "gb", "de", "fr", "ca", "au"]

def scrape(url, proxy_user, proxy_pass):
    country = random.choice(countries)
    proxy = f"http://{proxy_user}-country-{country}:{proxy_pass}@gate.birdproxies.com:7777"

    response = requests.get(
        url,
        proxies={"http": proxy, "https": proxy},
        headers=headers,
        timeout=30
    )
    return response

Site-Specific Configurations

E-Commerce (Amazon, eBay, Walmart)

Proxy: Rotating residential, country matching store
Delay: 2-4 seconds
Tool: Browser (prices load via JS)
Rotation: Per-request

Search Engines (Google, Bing)

Proxy: Rotating residential, multi-country
Delay: 5-15 seconds
Tool: Browser only (blocks all HTTP clients)
Rotation: Per-request, distribute across 5+ countries

Social Media (LinkedIn, Instagram)

Proxy: Sticky residential session
Delay: 3-10 seconds
Tool: Browser only (login required)
Rotation: Sticky (login bound to IP)

Real Estate (Zillow, Realtor, Rightmove)

Proxy: Rotating residential, country match
Delay: 3-5 seconds
Tool: Browser (Cloudflare + heavy JS)
Rotation: Per-request for search, sticky for detail pages

News Sites

Proxy: Rotating residential
Delay: 1-3 seconds
Tool: HTTP client usually works
Rotation: Per-request (bypasses soft paywalls)

Handling Errors

| Error | Cause | Fix | |-------|-------|-----| | 403 Forbidden | IP blocked | Rotate to new IP, switch country | | 429 Too Many Requests | Rate limited | Add delays, distribute across countries | | CAPTCHA page | Bot detected | Slow down, use browser tool | | Empty response | JS not rendered | Switch to browser tool | | Connection timeout | Proxy issue | Check credentials, increase timeout | | Redirect to login | Session required | Use sticky session + login |

Volume Guidelines

| Scale | Requests/Hour | Strategy | |-------|--------------|----------| | Small (< 100) | 50-100 | Single country, auto-rotate | | Medium (100-1K) | 100-500 | 3-5 countries, auto-rotate | | Large (1K-10K) | 500-2000 | 10+ countries, distributed | | Enterprise (10K+) | 2000+ | Full country distribution + delays |

Provider

BirdProxies — rotating residential proxies built for web scraping.

Gateway: gate.birdproxies.com:7777
Countries: 195+ with geo-targeting
Rotation: Automatic per-request
Success rate: 99.5% on protected sites
Setup: birdproxies.com/en/proxies-for/openclaw
Discount: OPENCLAW15 for 15% off