Newspaper Ad Recognition
Overview
This skill provides a complete workflow for recognizing and classifying advertisements in online newspaper archive pages (e.g., NewspaperSG, Chronicling America). It handles the entire pipeline from page access to result archival.
⚠️ Critical: Network Access (SSRF Policy)
The browser tool (browser) has a strict SSRF policy and cannot navigate to internal/private IP addresses. For sites like eresources.nlb.gov.sg:
✅ 正确方式: Use
curl -x http://127.0.0.1:7897(Clash proxy) via exec tool❌ 错误方式: Browser tool navigation → blocked by SSRF policy
See Step 2 for the complete working command pattern.
Workflow
Step 1: Environment Check
- Check Tesseract OCR availability:
# Tesseract common install paths (check in order): "C:\Program Files\Tesseract-OCR\tesseract.exe" # Windows standard "C:\Program Files (x86)\Tesseract-OCR\tesseract.exe" # Run: & "C:\Program Files\Tesseract-OCR\tesseract.exe" --version - Verify proxy (Clash:
127.0.0.1:7897) is running
Step 2: Page Access — NewspaperSG (eresources.nlb.gov.sg)
Do NOT use browser tool — blocked by SSRF. Use exec with curl + proxy:
# 1. GET the page (also saves session cookies to cookies.txt)
curl.exe -x http://127.0.0.1:7897 -s -c cookies.txt `
"https://eresources.nlb.gov.sg/newspapers/digitised/article/ARTICLE_ID" `
--max-time 30
# 2. Download newspaper images (MUST have both cookies AND Referer header)
# Without both → returns ~4KB thumbnail instead of actual image
curl.exe -x http://127.0.0.1:7897 -s -o "area_1.webp" `
"https://eservice.nlb.gov.sg/newspapercontent/digitised/article/ARTICLE_ID.webp?area=1&width=660&ct=ARTICLE+ILLUSTRATION" `
--max-time 20 `
-b "cookies.txt" `
-H "Referer: https://eresources.nlb.gov.sg/newspapers/digitised/article/ARTICLE_ID" `
-H "Accept: image/webp,image/*,*/*;q=0.8"
⚠️ Image download checklist — both are REQUIRED:
-b cookies.txt(session cookies from step 1)-H "Referer: https://eresources.nlb.gov.sg/..."(exact article URL)- Without these → ~4KB thumbnail images, OCR will return nothing useful
Step 3: Determine Page Type
After fetching the article HTML, look for ct= parameters in image URLs:
| ct value | Meaning |
|----------|---------|
| ARTICLE+ILLUSTRATION | Article content / illustrations — NOT ads |
| ADVERTISEMENT | Display ad |
| CLASSIFIED | Classifieds section |
| Other | Investigate further |
# Quick check: count ad-related content types in HTML
Select-String -Path "article.html" -Pattern "ct=ADVERTISEMENT|ct=CLASSIFIED"
Step 4: OCR Analysis (if ct values are all ARTICLE or mixed)
Download key areas (top, middle, bottom — e.g. area=1,10,20,26) and run Tesseract:
$tesseract = "C:\Program Files\Tesseract-OCR\tesseract.exe"
& $tesseract "area_5.webp" "area_5" -l eng --psm 6
Then grep the OCR output for ad keywords:
$adPatterns = @("ADVERTISEMENT","Advertisement","FOR SALE","FOR HIRE","VACANCY","Tel:","Phone:","LIMITED OFFER","SPECIAL","DISCOUNT","BUY ONE","FREE","Pte Ltd","Co.","Fax:")
foreach ($txt in Get-ChildItem "ocr_output\*.txt") {
$content = Get-Content $txt.FullName -Raw
foreach ($p in $adPatterns) {
if ($content -match $p) { Write-Host "MATCH: $p in $($txt.Name)" }
}
}
Step 5: Ad Classification
Classify identified ads into categories:
- Commercial Ads (product/service promotion)
- Public Service Ads (government/non-profit)
- Classified Ads (recruitment/rent/second-hand)
Step 5.5: Detailed Ad Location Reporting ⭐ NEW
After identifying ads, provide a detailed location report with the following format:
Required information per ad:
- Ad type: Image-based (有图广告) or Text-only (无图广告)
- Exact location: Use relative position descriptions (e.g., "左下区域", "右下角", "左侧边栏")
- Size estimate: Approximate percentage of page area
- Content summary: Key products/services advertised
Output template:
## 📍 广告位置分析报告 - [Newspaper Name] [Date] 第X页
### ✅ 广告1:[Company/Product Name](**有图广告/无图广告**)
- **位置**:页面**[位置描述]**(约占总面积X%)
- **类型**:[Commercial/Public Service/Classified] - [简短描述]
- **内容**:
- [关键点1]
- [关键点2]
- [联系方式/地址]
### ⚠️ 广告2:[...]
---
## 📊 统计总结
| 广告位置 | 是否有图 | 广告类型 | 占据面积 |
|---------|---------|---------|----------|
| [位置] | ✅ 有图/❌ 无图 | [类型] | ~X% |
**结论**:该页面包含**X个有图广告** + **Y个无图文字广告**,广告总面积约占页面Z%。
How to determine if ad is image-based:
- ✅ Image-based: Contains product photos, promotional graphics, logos, decorative elements
- ❌ Text-only: Pure text layout, no visual elements, resembles classified ads
Location description examples:
- 左上/右上/左下/右下区域 (upper-left/upper-right/lower-left/lower-right area)
- 左侧边栏/右侧边栏 (left/right sidebar)
- 顶部横幅/底部横幅 (top/bottom banner)
- 页面中央 (center of page)
Step 6: Result Archival
Save results in structured format:
[
{
"ad_id": "ST19950715_P33_AD001",
"page": "33",
"type": "Commercial",
"ocr_text": "...",
"slice_path": "area_5.webp"
}
]
Archive path: {workspace}/newspaper_ads/{date}_{newspaper_name}/
Dependencies
| Dependency | Version Required | Notes |
|---------------------|-----------------|-------|
| Tesseract OCR | ≥5.4.0 | Windows: C:\Program Files\Tesseract-OCR\tesseract.exe (may not be in PATH) |
| curl.exe | Any | Use curl.exe NOT PowerShell alias curl |
| Clash proxy | Running on 127.0.0.1:7897 | For SSRF-blocked domains |
⚠️ PowerShell tips:
- Always use
curl.exe(notcurlalias) to avoid Invoke-WebRequest conflicts- Write scripts to
.ps1files and run withpowershell -ExecutionPolicy Bypass -File script.ps1- The
--flag in Tesseract commands causes parsing errors when inlined — use script files- URL parameters with
&cause PowerShell parsing errors — use script files or string concatenation
⚠️ Common Mistakes to Avoid
- Using browser tool for SSRF-blocked domains → always use curl + proxy
- Downloading images without cookies OR Referer → 4KB thumbnails instead of actual images
- Slicing with
querySelectorAll('img')→ captures website UI icons, not newspaper images- ✅ Correct: target
img.image-content(the actual newspaper image elements)
- ✅ Correct: target
- Inline PowerShell commands with
&,--,||,&&→ use script files instead
Troubleshooting
| Issue | Solution |
|------------------------|--------------------------------------------------------------------------|
| Browser SSRF blocked | Use curl + proxy: curl.exe -x http://127.0.0.1:7897 |
| Images are 4KB | Missing cookies or Referer header — see Step 2 command pattern |
| Tesseract not found | Check C:\Program Files\Tesseract-OCR\tesseract.exe |
| PowerShell parsing error | Write command to .ps1 file, avoid --, &, || inline |
| Incomplete OCR text | Adjust --psm 6 mode, or crop ad regions separately |
| Missed ads | Download all areas and OCR systematically, don't rely on ct= alone |
Resources
scripts/
scripts/extract_ocr.py: Extracts OCR text from ad slices using Tesseractscripts/archive_results.py: Saves results to structured JSON/CSV files
references/
references/classification_rules.json: Customizable rules for ad classification
assets/
assets/ad_keywords.txt: List of keywords to identify ads (can be extended)
Scan to join WeChat group