tomark-skill — 文档批量转 Markdown
Overview
To convert Word/WPS documents (.doc, .docx, .wps) to Markdown (.md) format.
Supported formats:
.docx— converted viamammoth(preserves headings, bold, lists).doc/.wps— converted via WPS Office COM (KWps.Application) to .docx first, then mammoth- Batch or single file, with sub-directory structure preserved
Prerequisites (Windows):
- Python 3.x
mammothlibrary:pip install mammothpywin32library:pip install pywin32- WPS Office installed (for .doc/.wps files)
Workflow Decision Tree
User request
│
├─ Single file?
│ └─ Call convert_file() directly
│
└─ Folder / batch?
└─ Run scripts/convert_to_markdown.py with SRC_DIR set
Step 1 — Check Prerequisites
To verify the environment before running:
import shutil, importlib
# Check mammoth
try:
import mammoth
print("mammoth: ok")
except ImportError:
print("mammoth: MISSING — run: pip install mammoth")
# Check pywin32 (for .doc/.wps)
try:
import win32com.client
print("pywin32: ok")
except ImportError:
print("pywin32: MISSING — run: pip install pywin32")
# Check WPS Office
import glob
wps = glob.glob("C:/Program Files (x86)/Kingsoft/WPS Office/*/office6/wps.exe")
print("WPS Office:", wps[0] if wps else "NOT FOUND")
If WPS Office is not installed, .doc/.wps files cannot be converted.
Only .docx files can be processed with mammoth alone.
Step 2 — Configure and Run
To run the batch conversion script:
- Open
scripts/convert_to_markdown.py - Set
SRC_DIRto the source folder path - Optionally set
OUT_DIR(default:<SRC_DIR>/markdown_output/) - Run:
python -X utf8 scripts/convert_to_markdown.py
To convert a single file inline:
from pathlib import Path
# import the helper functions from the script
from scripts.convert_to_markdown import convert_file
src = Path(r"D:\documents\example.doc")
out = Path(r"D:\documents\example.md")
ok, msg = convert_file(src, out)
print(ok, msg)
Step 3 — Interpret the Output
After conversion completes:
- All
.mdfiles are inOUT_DIR, mirroring the original sub-directory structure - A
转换报告.mdreport is generated listing successes and failures - Typical failure causes:
- WPS Office not installed (for .doc/.wps)
- Password-protected documents
- Severely corrupted files
Encoding Notes
- Always run with
python -X utf8on Windows to avoid GBK encoding issues - The script forces
sys.stdoutto UTF-8 internally - Output
.mdfiles are always written as UTF-8
Resources
scripts/
convert_to_markdown.py— main batch conversion script (configureSRC_DIRat the top)
references/
format_guide.md— notes on mammoth output format and post-processing tips
扫码联系在线客服