Back to skills
extension
Category: OtherNo API key required

tomark-skill

>批量将 .doc/.docx/.wps 文档转换为 Markdown,docx → mammoth;.doc/.wps → WPS COM → .docx → mammoth 特色 彻底解决中文乱码问题,支持 WPS 专有格式

personAuthor: user_f474679ahubcommunity

tomark-skill — 文档批量转 Markdown

Overview

To convert Word/WPS documents (.doc, .docx, .wps) to Markdown (.md) format.

Supported formats:

  • .docx — converted via mammoth (preserves headings, bold, lists)
  • .doc / .wps — converted via WPS Office COM (KWps.Application) to .docx first, then mammoth
  • Batch or single file, with sub-directory structure preserved

Prerequisites (Windows):

  • Python 3.x
  • mammoth library: pip install mammoth
  • pywin32 library: pip install pywin32
  • WPS Office installed (for .doc/.wps files)

Workflow Decision Tree

User request
    │
    ├─ Single file?
    │       └─ Call convert_file() directly
    │
    └─ Folder / batch?
            └─ Run scripts/convert_to_markdown.py with SRC_DIR set

Step 1 — Check Prerequisites

To verify the environment before running:

import shutil, importlib

# Check mammoth
try:
    import mammoth
    print("mammoth: ok")
except ImportError:
    print("mammoth: MISSING — run: pip install mammoth")

# Check pywin32 (for .doc/.wps)
try:
    import win32com.client
    print("pywin32: ok")
except ImportError:
    print("pywin32: MISSING — run: pip install pywin32")

# Check WPS Office
import glob
wps = glob.glob("C:/Program Files (x86)/Kingsoft/WPS Office/*/office6/wps.exe")
print("WPS Office:", wps[0] if wps else "NOT FOUND")

If WPS Office is not installed, .doc/.wps files cannot be converted. Only .docx files can be processed with mammoth alone.


Step 2 — Configure and Run

To run the batch conversion script:

  1. Open scripts/convert_to_markdown.py
  2. Set SRC_DIR to the source folder path
  3. Optionally set OUT_DIR (default: <SRC_DIR>/markdown_output/)
  4. Run: python -X utf8 scripts/convert_to_markdown.py

To convert a single file inline:

from pathlib import Path
# import the helper functions from the script
from scripts.convert_to_markdown import convert_file

src = Path(r"D:\documents\example.doc")
out = Path(r"D:\documents\example.md")
ok, msg = convert_file(src, out)
print(ok, msg)

Step 3 — Interpret the Output

After conversion completes:

  • All .md files are in OUT_DIR, mirroring the original sub-directory structure
  • A 转换报告.md report is generated listing successes and failures
  • Typical failure causes:
    • WPS Office not installed (for .doc/.wps)
    • Password-protected documents
    • Severely corrupted files

Encoding Notes

  • Always run with python -X utf8 on Windows to avoid GBK encoding issues
  • The script forces sys.stdout to UTF-8 internally
  • Output .md files are always written as UTF-8

Resources

scripts/

  • convert_to_markdown.py — main batch conversion script (configure SRC_DIR at the top)

references/

  • format_guide.md — notes on mammoth output format and post-processing tips