PDF Downloader
Workflow
- Identify the best source URL.
- If the user gives a URL, use it directly.
- If the user gives only a title, search the web and prefer official or primary sources.
- If there is a direct
.pdflink orContent-Type: application/pdf, download it directly.
- Save final files under
output/pdf/unless the user gives another destination. - Run the bundled script:
python3 ~/.codex/skills/pdf-downloader/scripts/download_pdf.py "$URL" --output-dir output/pdf
- Validate the result:
- Check file size and page count with
pypdf. - Render a thumbnail with
qlmanageon macOS when available, or open/inspect another quick preview if available. - Confirm the title/source URL appears in generated offline PDFs.
- For generated webpage PDFs, scan the first page visually when possible. Latin-heavy pages should use normal Latin word wrapping, not CJK character wrapping; CJK-heavy pages should still render with CJK-capable fonts.
- Check file size and page count with
- Reply with absolute clickable file paths and note whether the file was directly downloaded or generated from the webpage.
Blocked Pages
Some sites block direct Python requests but are still readable through browser/search tooling. If the script fails with 403, challenge pages, empty content, or missing article text:
- Use browsing/search to retrieve the article text from the official page.
- Save a temporary Markdown file with title, source URL, date/author if known, headings, paragraphs, bullets, code blocks, tables, and image captions.
- Convert it with:
python3 ~/.codex/skills/pdf-downloader/scripts/download_pdf.py --markdown tmp/article.md --source-url "$URL" --output-dir output/pdf --filename article-name.pdf
This fallback is an offline webpage capture, not an official PDF. Say that clearly.
Script Notes
scripts/download_pdf.pywrites a.pdf; for generated webpages it also writes a sibling.md.- Use
--filename name.pdffor stable names. - Use
--title "Title"when the source page lacks a clean title. - The script chooses reader-friendly fonts automatically: English/WinAnsi pages prefer built-in Helvetica with normal LTR wrapping; mixed Unicode pages use local TTF fonts when needed; CJK-heavy pages prefer CJK-capable fonts and only then use CJK wrapping. If no local TTF is available, it falls back to built-in PDF fonts or
STSong-Lightfor CJK. - If a generated PDF looks cramped, regenerate it with the current script before trying manual fixes; older outputs may have used CJK wrapping for English pages.
- If the source page contains dynamic images or video, preserve captions/source links rather than trying to embed media unless the user asks for a visual archive.
扫码联系在线客服