返回 Skill 列表
extension
分类: 其它需要 API Key

HTML OCR

对包含嵌入式图片或扫描件的HTML页面进行OCR,使用MinerU从HTML文件或网页的图片中提取文字。功能:额外的OCR...

person作者: mzlzycahubclawhub

HTML OCR

Use OCR to extract text from HTML files that contain scanned images or image-embedded content using MinerU.

Install

npm install -g mineru-open-api
# or via Go (macOS/Linux):
go install github.com/opendatalab/MinerU-Ecosystem/cli/mineru-open-api@latest

Quick Start

# OCR extraction from local HTML file (requires token)
mineru-open-api extract page.html --ocr -o ./out/

# With VLM model for better accuracy
mineru-open-api extract page.html --ocr --model vlm -o ./out/

Authentication

Token required:

mineru-open-api auth             # Interactive token setup
export MINERU_TOKEN="your-token" # Or via environment variable

Create token at: https://mineru.net/apiManage/token

Capabilities

  • Supported input: local .html file
  • OCR requires extract with token — not available in flash-extract
  • Use --ocr flag to enable OCR on image-embedded content in HTML
  • Use --model vlm for complex or mixed-content pages

Notes

  • HTML is NOT supported by flash-extract; use extract with token
  • If the HTML has normal text content, OCR is not needed — use html-extract instead
  • Output goes to stdout by default; use -o <dir> to save to a file or directory
  • All progress/status messages go to stderr; document content goes to stdout
  • MinerU is open-source by OpenDataLab (Shanghai AI Lab): https://github.com/opendatalab/MinerU