返回 Skill 列表
extension
分类: 其它需要 API Key

银行流水扫描件整理(律师整理银行流水神器!)

本Skill专注于银行流水扫描件的自动化处理,可以帮助律师将客户提供的大量不规范银行流水扫描件按照银行、开户人整理成可供筛选的Excel表格,经过多名律师实际使用,效率倍增! 核心功能包括: 多格式OCR识别:支持图片、PDF格式的银行流水自动读取,精准提取交易日期、金额、交易对手、摘要等核心字段 数据标准化整理:将识别后的非结构化数据转化为规范的结构化格式,自动处理大小写金额、正负标记等细节 Excel一键生成:直接输出符合财务规范的Excel表格,包含数据透视表、分类汇总等实用功能 批量处理能力:支持多文件同时处理,大幅提升大批量流水数据的整理效率 多Worker并行处理,效率大幅提升。 注意:本Skill需要使用具有图像识别能力的大模型模型API,首次运行会要求填写。 触发场景:当你需要整理银行流水、OCR识别银行流水图片/PDF、提取流水交易数据、生成标准化Excel报表,或者需要对批量流水文件进行自动化处理时,该技能会自动启动工作流。

person作者: user_d2f40577hubcommunity

Bank Statement OCR Workflow

Complete pipeline for converting bank statement PDFs into structured Excel files with per-bank sheets.

Prerequisites

Install dependencies in managed Python venv:

pip install PyMuPDF openpyxl Pillow requests

Set environment variables for OCR API:

  • OCR_API_URL or OPENAI_BASE_URL: Vision API endpoint
  • OCR_API_KEY or OPENAI_API_KEY: API key
  • OCR_MODEL: Model name (default: gpt-4o)

Workflow

Step 0: Check API Configuration

Before starting any work, check if the user has configured the OCR API credentials.

Check environment variables:

  • OCR_API_KEY or OPENAI_API_KEY
  • OCR_API_URL or OPENAI_BASE_URL
  • OCR_MODEL (optional, defaults to gpt-4o)

If any required variable is missing (especially API_KEY), ask the user to provide the following information using ask_followup_question:

请提供OCR识别所需的API配置信息:
1. API Key(必填):用于调用vision模型的密钥
2. API URL(可选):API端点地址,默认为 https://api.openai.com/v1/chat/completions
3. Model(可选):模型名称,默认为 gpt-4o

After receiving the user's input, set the environment variables for the current session (do NOT write them to any file or persist them):

# Set for current session only
$env:OCR_API_KEY = "user_provided_key"
$env:OCR_API_URL = "user_provided_url"   # if provided
$env:OCR_MODEL = "user_provided_model"   # if provided

Do NOT proceed to Step 1 until API_KEY is confirmed.

Step 1: Prepare Workspace

Create directory structure in the workspace:

batch_v2/          # Will hold PNG images per PDF
batch_v2_results/  # Will hold OCR JSON results

Step 2: Convert PDFs to Images

Use PyMuPDF (fitz) to convert each PDF to PNG pages at 200 DPI.

For each PDF file, create a subdirectory under batch_v2/ (e.g., pdf1/, pdf2/) and save pages as page_001.png, page_002.png, etc.

import fitz, os
doc = fitz.open("input.pdf")
out_dir = "batch_v2/pdf1"
os.makedirs(out_dir, exist_ok=True)
for i in range(len(doc)):
    pix = doc[i].get_pixmap(dpi=200)
    pix.save(os.path.join(out_dir, f"page_{i+1:03d}.png"))
doc.close()

Important: If replacing an existing PDF's images, clear old PNGs first. Always verify actual page count matches PDF.

Step 3: Run OCR (Parallel Batch Worker)

Run scripts/batch_worker.py for each PDF label. The script supports auto-parallel mode — pages are distributed across multiple workers automatically.

Auto-parallel mode (recommended):

# Process all pages of a PDF (auto 3 workers)
python scripts/batch_worker.py <label> <start_page> <end_page>

# Example: process pages 1-59 with 3 workers (default)
python scripts/batch_worker.py pdf1 1 59

# Custom worker count
python scripts/batch_worker.py pdf1 1 59 --workers 5

Arguments: label start_page end_page [--workers N]

  • Pages are interleaved across workers for load balancing
  • Default worker count: 3 (configurable via OCR_WORKERS env var)
  • Progress, ETA, and error count are reported in real-time

Legacy single-worker mode (still supported):

python scripts/batch_worker.py W1 pdf1 1 2 3 4

Key behaviors:

  • Each page result is appended to batch_v2_results/{label}_ocr.json immediately after processing
  • Thread-safe file writes prevent corruption when multiple workers write simultaneously
  • Bank/name/account fields are inherited from the latest previously-processed page (may be from a different worker)
  • Monitor progress — check for type=error pages. Re-process any failed pages after completion

Step 4: Verify OCR Results

After all pages are processed, verify by checking each page's bank name, record count, and error status:

import json
with open("batch_v2_results/pdf1_ocr.json", "r", encoding="utf-8") as f:
    data = json.load(f)
for p in data:
    print(f"page {p['pdf_page']}: bank={p.get('bank','')} type={p.get('type','')} records={len(p.get('records',[]))}")

Critical verification steps:

  1. Check for any type=error pages → re-process
  2. Check bank name consistency across pages
  3. Always ask user to confirm the correct bank-per-page mapping before generating Excel

Common OCR misidentification: CMB (招商银行) pages often get misidentified as CCB (建设银行) or ICBC (工商银行). See references/troubleshooting.md for details.

Step 5: Configure Overrides

Create overrides.json in workspace root to handle:

  • Bank name corrections per page (when OCR misidentifies the bank)
  • Page count limits (when PDF has more pages than actual content)
  • PDF-to-filename mapping
{
  "pdf_config": [
    {"key": "pdf1", "name": "文件名", "owner": "默认户名"}
  ],
  "page_bank_override": {
    "pdf3_10": "招商银行"
  },
  "pdf_max_page": {
    "pdf2": 12
  }
}

owner field explanation: The owner in pdf_config is the default account holder name for the PDF. During Excel generation, the actual page-level account holder (name field from OCR) is used with forward-inheritance:

  • If a page's OCR result has a name value, it becomes the current page's owner
  • If a page has no name (empty), it inherits the owner from the previous page
  • If the first page has no name, the owner from pdf_config is used as fallback
  • A single bank sheet may contain records from multiple owners (e.g., both spouses' accounts); all distinct owners are shown in the sheet title

Key format for page_bank_override: "{pdf_label}_{page_number}": "correct_bank_name".

Important: Always ask the user to confirm the correct bank-per-page mapping and identify the account holder(s) for each PDF before generating Excel.

Step 6: Generate Excel

Run scripts/generate_excel.py:

python scripts/generate_excel.py --workspace /path/to/workspace

Output: One Excel file per PDF (named {原PDF名}_AI整理.xlsx), each containing:

  • Per-bank sheets with all transactions
  • Source page number column ("来源页码")
  • Frozen headers, auto-filter, alternating row colors
  • Summary statistics at bottom of each sheet

Output Format

Each Excel file contains sheets named by bank. Each sheet has:

  • Title row: "{归属人} - {银行名} 流水明细"
  • Source info row: "来源:{PDF名} | 共 N 条记录"
  • Column headers at row 4: 日期, 摘要, 收入, 支出, 余额, 对方户名, etc.
  • All records with 来源页码 as last column
  • Statistics section: total count, income count, expense count, totals, net flow

Scripts

  • scripts/batch_worker.py: OCR worker for processing PNG pages via vision API
  • scripts/generate_excel.py: Excel generator from OCR JSON results

References

  • references/troubleshooting.md: OCR misidentification patterns and solutions