Bank Statement OCR Workflow
Complete pipeline for converting bank statement PDFs into structured Excel files with per-bank sheets.
Prerequisites
Install dependencies in managed Python venv:
pip install PyMuPDF openpyxl Pillow requests
Set environment variables for OCR API:
OCR_API_URLorOPENAI_BASE_URL: Vision API endpointOCR_API_KEYorOPENAI_API_KEY: API keyOCR_MODEL: Model name (default: gpt-4o)
Workflow
Step 0: Check API Configuration
Before starting any work, check if the user has configured the OCR API credentials.
Check environment variables:
OCR_API_KEYorOPENAI_API_KEYOCR_API_URLorOPENAI_BASE_URLOCR_MODEL(optional, defaults togpt-4o)
If any required variable is missing (especially API_KEY), ask the user to provide the following information using ask_followup_question:
请提供OCR识别所需的API配置信息:
1. API Key(必填):用于调用vision模型的密钥
2. API URL(可选):API端点地址,默认为 https://api.openai.com/v1/chat/completions
3. Model(可选):模型名称,默认为 gpt-4o
After receiving the user's input, set the environment variables for the current session (do NOT write them to any file or persist them):
# Set for current session only
$env:OCR_API_KEY = "user_provided_key"
$env:OCR_API_URL = "user_provided_url" # if provided
$env:OCR_MODEL = "user_provided_model" # if provided
Do NOT proceed to Step 1 until API_KEY is confirmed.
Step 1: Prepare Workspace
Create directory structure in the workspace:
batch_v2/ # Will hold PNG images per PDF
batch_v2_results/ # Will hold OCR JSON results
Step 2: Convert PDFs to Images
Use PyMuPDF (fitz) to convert each PDF to PNG pages at 200 DPI.
For each PDF file, create a subdirectory under batch_v2/ (e.g., pdf1/, pdf2/) and save pages as page_001.png, page_002.png, etc.
import fitz, os
doc = fitz.open("input.pdf")
out_dir = "batch_v2/pdf1"
os.makedirs(out_dir, exist_ok=True)
for i in range(len(doc)):
pix = doc[i].get_pixmap(dpi=200)
pix.save(os.path.join(out_dir, f"page_{i+1:03d}.png"))
doc.close()
Important: If replacing an existing PDF's images, clear old PNGs first. Always verify actual page count matches PDF.
Step 3: Run OCR (Parallel Batch Worker)
Run scripts/batch_worker.py for each PDF label. The script supports auto-parallel mode — pages are distributed across multiple workers automatically.
Auto-parallel mode (recommended):
# Process all pages of a PDF (auto 3 workers)
python scripts/batch_worker.py <label> <start_page> <end_page>
# Example: process pages 1-59 with 3 workers (default)
python scripts/batch_worker.py pdf1 1 59
# Custom worker count
python scripts/batch_worker.py pdf1 1 59 --workers 5
Arguments: label start_page end_page [--workers N]
- Pages are interleaved across workers for load balancing
- Default worker count: 3 (configurable via
OCR_WORKERSenv var) - Progress, ETA, and error count are reported in real-time
Legacy single-worker mode (still supported):
python scripts/batch_worker.py W1 pdf1 1 2 3 4
Key behaviors:
- Each page result is appended to
batch_v2_results/{label}_ocr.jsonimmediately after processing - Thread-safe file writes prevent corruption when multiple workers write simultaneously
- Bank/name/account fields are inherited from the latest previously-processed page (may be from a different worker)
- Monitor progress — check for
type=errorpages. Re-process any failed pages after completion
Step 4: Verify OCR Results
After all pages are processed, verify by checking each page's bank name, record count, and error status:
import json
with open("batch_v2_results/pdf1_ocr.json", "r", encoding="utf-8") as f:
data = json.load(f)
for p in data:
print(f"page {p['pdf_page']}: bank={p.get('bank','')} type={p.get('type','')} records={len(p.get('records',[]))}")
Critical verification steps:
- Check for any
type=errorpages → re-process - Check bank name consistency across pages
- Always ask user to confirm the correct bank-per-page mapping before generating Excel
Common OCR misidentification: CMB (招商银行) pages often get misidentified as CCB (建设银行) or ICBC (工商银行). See references/troubleshooting.md for details.
Step 5: Configure Overrides
Create overrides.json in workspace root to handle:
- Bank name corrections per page (when OCR misidentifies the bank)
- Page count limits (when PDF has more pages than actual content)
- PDF-to-filename mapping
{
"pdf_config": [
{"key": "pdf1", "name": "文件名", "owner": "默认户名"}
],
"page_bank_override": {
"pdf3_10": "招商银行"
},
"pdf_max_page": {
"pdf2": 12
}
}
owner field explanation: The owner in pdf_config is the default account holder name for the PDF. During Excel generation, the actual page-level account holder (name field from OCR) is used with forward-inheritance:
- If a page's OCR result has a
namevalue, it becomes the current page's owner - If a page has no
name(empty), it inherits the owner from the previous page - If the first page has no
name, theownerfrompdf_configis used as fallback - A single bank sheet may contain records from multiple owners (e.g., both spouses' accounts); all distinct owners are shown in the sheet title
Key format for page_bank_override: "{pdf_label}_{page_number}": "correct_bank_name".
Important: Always ask the user to confirm the correct bank-per-page mapping and identify the account holder(s) for each PDF before generating Excel.
Step 6: Generate Excel
Run scripts/generate_excel.py:
python scripts/generate_excel.py --workspace /path/to/workspace
Output: One Excel file per PDF (named {原PDF名}_AI整理.xlsx), each containing:
- Per-bank sheets with all transactions
- Source page number column ("来源页码")
- Frozen headers, auto-filter, alternating row colors
- Summary statistics at bottom of each sheet
Output Format
Each Excel file contains sheets named by bank. Each sheet has:
- Title row: "{归属人} - {银行名} 流水明细"
- Source info row: "来源:{PDF名} | 共 N 条记录"
- Column headers at row 4: 日期, 摘要, 收入, 支出, 余额, 对方户名, etc.
- All records with 来源页码 as last column
- Statistics section: total count, income count, expense count, totals, net flow
Scripts
scripts/batch_worker.py: OCR worker for processing PNG pages via vision APIscripts/generate_excel.py: Excel generator from OCR JSON results
References
references/troubleshooting.md: OCR misidentification patterns and solutions
Scan to contact