Back to skills
extension
Category: OtherNo API key required

PDF Reader

Extract text from PDF files with automatic OCR fallback for scanned/image-based PDFs. Use when: (1) a user sends a PDF file and the framework did not auto-in...

personAuthor: panpeter2024hubclawhub

PDF Reader

Extract text from any PDF — text-layer or scanned image.

How It Works

PDF received
  ├─ Has text layer? ──→ pdftotext (fast, high quality)
  │     └─ Text too sparse? ──→ Fall back to OCR
  └─ Detected as scan? ──→ Skip text, go straight to OCR
                               pdftoppm → tesseract

Quick Start

Run the bundled script via exec:

bash <skill-dir>/scripts/pdf-extract.sh /path/to/file.pdf

Save to file:

bash <skill-dir>/scripts/pdf-extract.sh /path/to/file.pdf --output /tmp/result.txt

Then read /tmp/result.txt with the read tool.

When This Skill Triggers

  1. User sends a PDF in chat but no <file> text content was injected (only file path visible)
  2. Injected content is empty, garbled, or truncated
  3. User explicitly asks to read/extract/OCR a PDF file
  4. A PDF on disk needs text extraction for downstream processing

Typical Workflow

  1. Identify the PDF file path (usually /root/.openclaw/media/inbound/...)
  2. Run the extraction script
  3. Read the output and respond to the user

Example:

# Extract and save
bash <skill-dir>/scripts/pdf-extract.sh "/root/.openclaw/media/inbound/document.pdf" -o /tmp/pdf-text.txt

# Then use read tool on /tmp/pdf-text.txt

Script Options

| Flag | Description | Default | |------|-------------|---------| | --lang | Tesseract languages (validated against allowlist) | chi_sim+eng | | --dpi | Image resolution for OCR | 300 | | --output / -o | Save to file instead of stdout | stdout | | --ocr-only | Force OCR, skip text extraction | off | | --text-only | Text extraction only, no OCR fallback | off | | --auto-install | Auto-install missing tools (poppler, tesseract) | off |

Dependencies

By default, the script does not install packages automatically. If tools are missing, it prints install instructions and exits.

To enable auto-install, pass --auto-install:

bash <skill-dir>/scripts/pdf-extract.sh file.pdf --auto-install

This installs poppler-utils and tesseract-ocr via apt-get, yum, or brew as needed.

Pre-install recommended (run once on the server):

apt-get install -y poppler-utils tesseract-ocr tesseract-ocr-chi-sim

Language Support

Default: Chinese Simplified + English (chi_sim+eng).

The --lang parameter is validated against a strict allowlist of official tesseract language codes. Invalid or malformed values are rejected.

Other languages:

# Japanese + English
bash <skill-dir>/scripts/pdf-extract.sh file.pdf --lang jpn+eng

# Korean
bash <skill-dir>/scripts/pdf-extract.sh file.pdf --lang kor

Tesseract language packs are auto-installed based on --lang.

Limitations

  • OCR quality depends on scan quality; low-resolution or handwritten PDFs may produce errors
  • Encrypted/password-protected PDFs are not supported
  • Large PDFs (50+ pages) may take 1-2 minutes for OCR
  • Pure-image pages (photos, diagrams without text) produce noise — this is expected