Bank Statement OCR Workflow

Complete pipeline for converting bank statement PDFs into structured Excel files with per-bank sheets.

Prerequisites

Install dependencies in managed Python venv:

pip install PyMuPDF openpyxl Pillow requests

Set environment variables for OCR API:

OCR_API_URL or OPENAI_BASE_URL: Vision API endpoint
OCR_API_KEY or OPENAI_API_KEY: API key
OCR_MODEL: Model name (default: gpt-4o)

Workflow

Step 0: Check API Configuration

Before starting any work, check if the user has configured the OCR API credentials.

Check environment variables:

OCR_API_KEY or OPENAI_API_KEY
OCR_API_URL or OPENAI_BASE_URL
OCR_MODEL (optional, defaults to gpt-4o)

If any required variable is missing (especially API_KEY), ask the user to provide the following information using ask_followup_question:

请提供OCR识别所需的API配置信息：
1. API Key（必填）：用于调用vision模型的密钥
2. API URL（可选）：API端点地址，默认为 https://api.openai.com/v1/chat/completions
3. Model（可选）：模型名称，默认为 gpt-4o

After receiving the user's input, set the environment variables for the current session (do NOT write them to any file or persist them):

# Set for current session only
$env:OCR_API_KEY = "user_provided_key"
$env:OCR_API_URL = "user_provided_url"   # if provided
$env:OCR_MODEL = "user_provided_model"   # if provided

Do NOT proceed to Step 1 until API_KEY is confirmed.

Step 1: Prepare Workspace

Create directory structure in the workspace:

batch_v2/          # Will hold PNG images per PDF
batch_v2_results/  # Will hold OCR JSON results

Step 2: Convert PDFs to Images

Use PyMuPDF (fitz) to convert each PDF to PNG pages at 200 DPI.

For each PDF file, create a subdirectory under batch_v2/ (e.g., pdf1/, pdf2/) and save pages as page_001.png, page_002.png, etc.

import fitz, os
doc = fitz.open("input.pdf")
out_dir = "batch_v2/pdf1"
os.makedirs(out_dir, exist_ok=True)
for i in range(len(doc)):
    pix = doc[i].get_pixmap(dpi=200)
    pix.save(os.path.join(out_dir, f"page_{i+1:03d}.png"))
doc.close()

Important: If replacing an existing PDF's images, clear old PNGs first. Always verify actual page count matches PDF.

Step 3: Run OCR (Parallel Batch Worker)

Run scripts/batch_worker.py for each PDF label. The script supports auto-parallel mode — pages are distributed across multiple workers automatically.

Auto-parallel mode (recommended):

# Process all pages of a PDF (auto 3 workers)
python scripts/batch_worker.py <label> <start_page> <end_page>

# Example: process pages 1-59 with 3 workers (default)
python scripts/batch_worker.py pdf1 1 59

# Custom worker count
python scripts/batch_worker.py pdf1 1 59 --workers 5

Arguments: label start_page end_page [--workers N]

Pages are interleaved across workers for load balancing
Default worker count: 3 (configurable via OCR_WORKERS env var)
Progress, ETA, and error count are reported in real-time

Legacy single-worker mode (still supported):

python scripts/batch_worker.py W1 pdf1 1 2 3 4

Key behaviors:

Each page result is appended to batch_v2_results/{label}_ocr.json immediately after processing
Thread-safe file writes prevent corruption when multiple workers write simultaneously
Bank/name/account fields are inherited from the latest previously-processed page (may be from a different worker)
Monitor progress — check for type=error pages. Re-process any failed pages after completion

Step 4: Verify OCR Results

After all pages are processed, verify by checking each page's bank name, record count, and error status:

import json
with open("batch_v2_results/pdf1_ocr.json", "r", encoding="utf-8") as f:
    data = json.load(f)
for p in data:
    print(f"page {p['pdf_page']}: bank={p.get('bank','')} type={p.get('type','')} records={len(p.get('records',[]))}")

Critical verification steps:

Check for any type=error pages → re-process
Check bank name consistency across pages
Always ask user to confirm the correct bank-per-page mapping before generating Excel

Common OCR misidentification: CMB (招商银行) pages often get misidentified as CCB (建设银行) or ICBC (工商银行). See references/troubleshooting.md for details.

Step 5: Configure Overrides

Create overrides.json in workspace root to handle:

Bank name corrections per page (when OCR misidentifies the bank)
Page count limits (when PDF has more pages than actual content)
PDF-to-filename mapping

{
  "pdf_config": [
    {"key": "pdf1", "name": "文件名", "owner": "默认户名"}
  ],
  "page_bank_override": {
    "pdf3_10": "招商银行"
  },
  "pdf_max_page": {
    "pdf2": 12
  }
}

owner field explanation: The owner in pdf_config is the default account holder name for the PDF. During Excel generation, the actual page-level account holder (name field from OCR) is used with forward-inheritance:

If a page's OCR result has a name value, it becomes the current page's owner
If a page has no name (empty), it inherits the owner from the previous page
If the first page has no name, the owner from pdf_config is used as fallback
A single bank sheet may contain records from multiple owners (e.g., both spouses' accounts); all distinct owners are shown in the sheet title

Key format for page_bank_override: "{pdf_label}_{page_number}": "correct_bank_name".

Important: Always ask the user to confirm the correct bank-per-page mapping and identify the account holder(s) for each PDF before generating Excel.

Step 6: Generate Excel

Run scripts/generate_excel.py:

python scripts/generate_excel.py --workspace /path/to/workspace

Output: One Excel file per PDF (named {原PDF名}_AI整理.xlsx), each containing:

Per-bank sheets with all transactions
Source page number column ("来源页码")
Frozen headers, auto-filter, alternating row colors
Summary statistics at bottom of each sheet

Output Format

Each Excel file contains sheets named by bank. Each sheet has:

Title row: "{归属人} - {银行名} 流水明细"
Source info row: "来源：{PDF名} | 共 N 条记录"
Column headers at row 4: 日期, 摘要, 收入, 支出, 余额, 对方户名, etc.
All records with 来源页码 as last column
Statistics section: total count, income count, expense count, totals, net flow

Scripts

scripts/batch_worker.py: OCR worker for processing PNG pages via vision API
scripts/generate_excel.py: Excel generator from OCR JSON results

References

references/troubleshooting.md: OCR misidentification patterns and solutions