Back to skills
extension
Category: OtherAPI key required

银行流水扫描件整理(律师整理银行流水神器!)

This skill provides a complete workflow for OCR-recognizing bank statement PDFs and generating structured Excel files. It handles: PDF-to-image conversion, multi-page OCR via vision API, bank name normalization, and Excel generation with per-bank sheets and page source tracking. Trigger when user asks to: 整理银行流水, OCR银行流水, 提取银行流水数据, 生成银行流水Excel, 处理银行流水PDF, 批量OCR, 银行流水数据提取, 整理流水文件, 银行流水AI整理, PDF流水转Excel

personAuthor: user_d2f40577hubcommunity

Bank Statement OCR Workflow

Complete pipeline for converting bank statement PDFs into structured Excel files with per-bank sheets.

Prerequisites

Install dependencies in managed Python venv:

pip install PyMuPDF openpyxl Pillow requests

Set environment variables for OCR API:

  • OCR_API_URL or OPENAI_BASE_URL: Vision API endpoint
  • OCR_API_KEY or OPENAI_API_KEY: API key
  • OCR_MODEL: Model name (default: gpt-4o)

Workflow

Step 0: Check API Configuration

Before starting any work, check if the user has configured the OCR API credentials.

Check environment variables:

  • OCR_API_KEY or OPENAI_API_KEY
  • OCR_API_URL or OPENAI_BASE_URL
  • OCR_MODEL (optional, defaults to gpt-4o)

If any required variable is missing (especially API_KEY), ask the user to provide the following information using ask_followup_question:

请提供OCR识别所需的API配置信息:
1. API Key(必填):用于调用vision模型的密钥
2. API URL(可选):API端点地址,默认为 https://api.openai.com/v1/chat/completions
3. Model(可选):模型名称,默认为 gpt-4o

After receiving the user's input, set the environment variables for the current session (do NOT write them to any file or persist them):

# Set for current session only
$env:OCR_API_KEY = "user_provided_key"
$env:OCR_API_URL = "user_provided_url"   # if provided
$env:OCR_MODEL = "user_provided_model"   # if provided

Do NOT proceed to Step 1 until API_KEY is confirmed.

Step 1: Prepare Workspace

Create directory structure in the workspace:

batch_v2/          # Will hold PNG images per PDF
batch_v2_results/  # Will hold OCR JSON results

Step 2: Convert PDFs to Images

Use PyMuPDF (fitz) to convert each PDF to PNG pages at 200 DPI.

For each PDF file, create a subdirectory under batch_v2/ (e.g., pdf1/, pdf2/) and save pages as page_001.png, page_002.png, etc.

import fitz, os
doc = fitz.open("input.pdf")
out_dir = "batch_v2/pdf1"
os.makedirs(out_dir, exist_ok=True)
for i in range(len(doc)):
    pix = doc[i].get_pixmap(dpi=200)
    pix.save(os.path.join(out_dir, f"page_{i+1:03d}.png"))
doc.close()

Important: If replacing an existing PDF's images, clear old PNGs first. Always verify actual page count matches PDF.

Step 3: Run OCR (Parallel Batch Worker)

Run scripts/batch_worker.py for each PDF label. The script supports auto-parallel mode — pages are distributed across multiple workers automatically.

Auto-parallel mode (recommended):

# Process all pages of a PDF (auto 3 workers)
python scripts/batch_worker.py <label> <start_page> <end_page>

# Example: process pages 1-59 with 3 workers (default)
python scripts/batch_worker.py pdf1 1 59

# Custom worker count
python scripts/batch_worker.py pdf1 1 59 --workers 5

Arguments: label start_page end_page [--workers N]

  • Pages are interleaved across workers for load balancing
  • Default worker count: 3 (configurable via OCR_WORKERS env var)
  • Progress, ETA, and error count are reported in real-time

Legacy single-worker mode (still supported):

python scripts/batch_worker.py W1 pdf1 1 2 3 4

Key behaviors:

  • Each page result is appended to batch_v2_results/{label}_ocr.json immediately after processing
  • Thread-safe file writes prevent corruption when multiple workers write simultaneously
  • Bank/name/account fields are inherited from the latest previously-processed page (may be from a different worker)
  • Monitor progress — check for type=error pages. Re-process any failed pages after completion

Step 4: Verify OCR Results

After all pages are processed, verify by checking each page's bank name, record count, and error status:

import json
with open("batch_v2_results/pdf1_ocr.json", "r", encoding="utf-8") as f:
    data = json.load(f)
for p in data:
    print(f"page {p['pdf_page']}: bank={p.get('bank','')} type={p.get('type','')} records={len(p.get('records',[]))}")

Critical verification steps:

  1. Check for any type=error pages → re-process
  2. Check bank name consistency across pages
  3. Always ask user to confirm the correct bank-per-page mapping before generating Excel

Common OCR misidentification: CMB (招商银行) pages often get misidentified as CCB (建设银行) or ICBC (工商银行). See references/troubleshooting.md for details.

Step 5: Configure Overrides

Create overrides.json in workspace root to handle:

  • Bank name corrections per page (when OCR misidentifies the bank)
  • Page count limits (when PDF has more pages than actual content)
  • PDF-to-filename mapping
{
  "pdf_config": [
    {"key": "pdf1", "name": "文件名", "owner": "默认户名"}
  ],
  "page_bank_override": {
    "pdf3_10": "招商银行"
  },
  "pdf_max_page": {
    "pdf2": 12
  }
}

owner field explanation: The owner in pdf_config is the default account holder name for the PDF. During Excel generation, the actual page-level account holder (name field from OCR) is used with forward-inheritance:

  • If a page's OCR result has a name value, it becomes the current page's owner
  • If a page has no name (empty), it inherits the owner from the previous page
  • If the first page has no name, the owner from pdf_config is used as fallback
  • A single bank sheet may contain records from multiple owners (e.g., both spouses' accounts); all distinct owners are shown in the sheet title

Key format for page_bank_override: "{pdf_label}_{page_number}": "correct_bank_name".

Important: Always ask the user to confirm the correct bank-per-page mapping and identify the account holder(s) for each PDF before generating Excel.

Step 6: Generate Excel

Run scripts/generate_excel.py:

python scripts/generate_excel.py --workspace /path/to/workspace

Output: One Excel file per PDF (named {原PDF名}_AI整理.xlsx), each containing:

  • Per-bank sheets with all transactions
  • Source page number column ("来源页码")
  • Frozen headers, auto-filter, alternating row colors
  • Summary statistics at bottom of each sheet

Output Format

Each Excel file contains sheets named by bank. Each sheet has:

  • Title row: "{归属人} - {银行名} 流水明细"
  • Source info row: "来源:{PDF名} | 共 N 条记录"
  • Column headers at row 4: 日期, 摘要, 收入, 支出, 余额, 对方户名, etc.
  • All records with 来源页码 as last column
  • Statistics section: total count, income count, expense count, totals, net flow

Scripts

  • scripts/batch_worker.py: OCR worker for processing PNG pages via vision API
  • scripts/generate_excel.py: Excel generator from OCR JSON results

References

  • references/troubleshooting.md: OCR misidentification patterns and solutions