返回 Skill 列表
extension
分类: 效率与办公无需 API Key

Pandoc

使用 pandoc 在多种格式间转换文档,支持 HTML、Markdown、DOCX、PDF、EPUB、LaTeX 等格式。

person作者: oliver-hrkltzhubclawhub

Pandoc Document Converter

Convert documents between any formats pandoc supports, with full control over styling, templates, table of contents, metadata, and PDF engine selection.

Quick Start

For most conversions, use the helper script at scripts/convert.sh:

bash <skill-dir>/scripts/convert.sh <input-file> <output-file> [options...]

The script auto-detects formats from file extensions and applies sensible defaults (standalone output, appropriate PDF engine, default LaTeX margins for LaTeX-based PDF engines). It also checks that pandoc, the input file, the output directory, and any requested PDF engine are available. Any extra arguments are passed through to pandoc.

How Conversions Work

Pandoc reads a source format into an internal AST, then writes it out in the target format. This means you can go from nearly any supported input to any supported output. The key decision points are:

  1. Input format — usually auto-detected from the file extension
  2. Output format — auto-detected from the output file extension
  3. PDF engine — for PDF output, choose between xelatex (best Unicode/font support), lualatex (strong Unicode/fonts), tectonic (self-contained TeX), pdflatex (fastest, good for ASCII-heavy docs), or HTML/CSS engines like weasyprint, wkhtmltopdf, or prince
  4. Styling — CSS for HTML-based outputs, LaTeX templates for PDF, reference docs for DOCX/ODT

Common Conversion Patterns

HTML → PDF

pandoc input.html -o output.pdf --pdf-engine=weasyprint -s

If the HTML uses external CSS, include it:

pandoc input.html -o output.pdf --pdf-engine=weasyprint -s --css=style.css

Markdown → PDF

pandoc input.md -o output.pdf --pdf-engine=xelatex -s --toc --toc-depth=3

Markdown → DOCX

pandoc input.md -o output.docx -s

To use a reference (template) document for styling:

pandoc input.md -o output.docx --reference-doc=template.docx

Markdown → HTML

pandoc input.md -o output.html -s --css=style.css --toc

DOCX → Markdown

pandoc input.docx -o output.md --extract-media=./media

Markdown → EPUB

pandoc input.md -o output.epub -s --toc --epub-cover-image=cover.jpg

LaTeX → PDF

pandoc input.tex -o output.pdf --pdf-engine=xelatex

CSV → HTML table

pandoc input.csv -o output.html -s

Styling and Appearance

CSS for HTML-based outputs

Create or use a CSS file and pass it with --css=path/to/style.css. For PDF output via weasyprint, wkhtmltopdf, or prince, CSS is respected directly. For PDF via LaTeX engines, CSS is usually ignored — use LaTeX variables or templates instead.

A sensible default stylesheet is provided at assets/default.css. Use it when the user wants a clean, readable output without specifying their own styles:

pandoc input.md -o output.html -s --css=<skill-dir>/assets/default.css

LaTeX variables for PDF styling

Control margins, fonts, and paper size without a full template:

pandoc input.md -o output.pdf --pdf-engine=xelatex \
  -V geometry:margin=1in \
  -V fontsize=12pt \
  -V mainfont="DejaVu Serif" \
  -V documentclass=article

Reference documents for DOCX/ODT

To match a corporate style, provide a reference document:

pandoc input.md -o output.docx --reference-doc=brand-template.docx

Advanced Features

Table of Contents

Add --toc and optionally --toc-depth=N (default 3):

pandoc input.md -o output.pdf --pdf-engine=xelatex -s --toc --toc-depth=2

Metadata

Set title, author, date via YAML frontmatter in the source file or via -M:

pandoc input.md -o output.pdf --pdf-engine=xelatex -s \
  -M title="My Report" -M author="Jane Doe" -M date="2026-03-15"

Filters and Lua filters

Pandoc supports filters that transform the AST. Lua filters are self-contained:

pandoc input.md -o output.pdf --lua-filter=my-filter.lua

Multiple input files

Pandoc concatenates multiple inputs:

pandoc chapter1.md chapter2.md chapter3.md -o book.pdf --pdf-engine=xelatex -s --toc

Extracting media from DOCX/EPUB

pandoc input.docx -o output.md --extract-media=./media

Troubleshooting

| Problem | Likely cause | Fix | |---|---|---| | PDF has missing characters | Font doesn't support the glyphs | Use --pdf-engine=xelatex with -V mainfont="DejaVu Serif" | | PDF conversion fails | No compatible PDF engine installed | Check which xelatex lualatex tectonic pdflatex weasyprint wkhtmltopdf prince and install one that matches your output needs | | DOCX looks unstyled | No reference doc | Create a styled DOCX template and pass --reference-doc | | HTML images missing | Relative paths broken | Use --self-contained to embed images as base64 | | CSS has no effect on PDF | LaTeX PDF engine selected | Use --pdf-engine=weasyprint, --pdf-engine=wkhtmltopdf, or --pdf-engine=prince | | Table of contents empty | No headings in source | Ensure source uses # headings (Markdown) or <h1><h6> (HTML) |

Format Reference

For a full list of supported input and output formats, see references/formats.md.

Choosing the Right Approach

When a user asks to convert a document, think about:

  1. What's the source format? Check the file extension or ask. If it's ambiguous (e.g., a .txt that's actually Markdown), specify -f markdown explicitly.
  2. What's the target format? Map the user's intent to a file extension.
  3. Does it need styling? If the user wants it to "look nice" or "be professional," add CSS (for HTML) or LaTeX variables (for PDF) or a reference doc (for DOCX).
  4. Does it need structure? TOC, numbered sections, metadata — add these when the document is long or formal.
  5. Are there images or media? Use --self-contained for HTML, --extract-media when converting from DOCX/EPUB to text formats.

Always use the helper script scripts/convert.sh as the starting point — it handles the most common gotchas automatically, picks a reasonable PDF engine, and prints recovery hints when PDF conversion fails. Add extra pandoc flags as needed for the specific use case.