Word Document Splitter & Merger
Split large .docx files into smaller documents by article/section, then reassemble after editing.
Dependencies
pip install python-docx lxml --break-system-packages
Quick Reference
| Task | Command |
|------|---------|
| List styles | python scripts/docx_splitter.py doc.docx --list-styles |
| Split by style | python scripts/docx_splitter.py doc.docx --split-on "Heading 1" |
| Split by pattern | python scripts/docx_splitter.py doc.docx --pattern "ARTICLE" |
| Merge chunks | python scripts/docx_merger.py ./chunks/ -o merged.docx |
Splitting Documents
Step 1: Identify Split Points
Always check available styles first:
python scripts/docx_splitter.py document.docx --list-styles
Step 2: Split
By heading style (preferred for well-formatted documents):
python scripts/docx_splitter.py document.docx --split-on "Heading 1" --output-dir ./chunks
By text pattern (for inconsistent styling):
python scripts/docx_splitter.py document.docx --pattern "ARTICLE" --output-dir ./chunks
python scripts/docx_splitter.py document.docx --pattern "^SECTION \d+" --output-dir ./chunks
Split Options
| Option | Description |
|--------|-------------|
| --split-on, -s | Style name (e.g., "Heading 1", "Heading 2") |
| --pattern, -p | Regex pattern (e.g., "ARTICLE", "^SECTION \d+") |
| --output-dir, -o | Output directory (default: <input>_split/) |
| --no-header | Exclude preamble content from each chunk |
Output Structure
Files are numbered for proper ordering:
document_split/
├── 00_Preamble.docx # Content before first split point
├── 01_ARTICLE_I.docx
├── 02_ARTICLE_II.docx
└── ...
Merging Documents
Reassemble edited chunks:
python scripts/docx_merger.py ./chunks/ -o final_document.docx
Merge specific files in order:
python scripts/docx_merger.py file1.docx file2.docx file3.docx -o merged.docx
Merge Options
| Option | Description |
|--------|-------------|
| -o, --output | Output path (required) |
| --page-breaks | Add page breaks between sections |
| --include-duplicates | Keep duplicate preamble content |
Typical Workflow
# 1. Check document structure
python scripts/docx_splitter.py BigContract.docx --list-styles
# 2. Split by articles
python scripts/docx_splitter.py BigContract.docx --split-on "Heading 1" -o ./chunks
# 3. Edit individual chunks (with Harvey, Claude, or manually)
# 4. Reassemble
python scripts/docx_merger.py ./chunks/ -o BigContract_Final.docx
Limitations
- Cross-references: Internal references preserved but not auto-updated if sections renumber
- Tables: Tables spanning split boundaries may break
- Numbering: Automatic numbering restarts per chunk
- Track changes: Preserved but may need reconciliation after merge
Tips for Legal Documents
- Split at ARTICLE level for credit agreements and loan documents
- Use
--patternwhen heading styles are inconsistent (common in opposing counsel docs) - For very large definition sections, re-split that chunk by "Heading 2"
- Review merged document at section boundaries for formatting issues
微信扫一扫