返回 Skill 列表
extension
分类: 数据与分析无需 API Key

Web Video Transcribe DOCX

离线优先工作流,用于将中文网页视频或音频转为文字并生成 Word 文档,适用于 Codex 需要 (1) 提取可播放媒体流的场景。

person作者: c-narcissushubclawhub

Web Video Transcribe Docx

Overview

Use the bundled scripts to extract media, download it, transcribe it offline, and render DOCX output.

Prefer the deterministic scripts before hand-rolling new code.

Use {baseDir} when constructing file paths inside this skill so the instructions stay portable across agents and marketplaces.

Environment

  • Require Python 3 and local filesystem access.
  • Require network access for first-run model download and for fetching page or media URLs.
  • Require a local Chrome or Edge browser only when extracting media from a web page.

Quick Start

  1. Run python {baseDir}/scripts/bootstrap_env.py once in the target environment.
  2. For a generic web page URL, run python {baseDir}/scripts/pipeline_web_to_docx.py <url> --output-dir <dir>.
  3. For a direct media URL, run python {baseDir}/scripts/download_url.py <url> <output> and then python {baseDir}/scripts/transcribe_sensevoice.py --input <file> --output-txt <txt> --output-docx <docx>.
  4. For a local media file, run python {baseDir}/scripts/transcribe_sensevoice.py --input <file> --output-txt <txt> --output-docx <docx>.
  5. If the user asks for a polished reading version rather than a raw transcript, read references/cleanup-guidelines.md, produce a refined .txt, and then render it with python {baseDir}/scripts/transcript_to_docx.py.

Example Requests

  • "Transcribe the Chinese audio from this web video and export it as a Word document."
  • "Turn this MP4 into a transcript, then reorganize it into chaptered reading notes."
  • "This page needs a Referer header for media download. Extract the media stream and convert it to DOCX."

Workflow

1. Classify the source

  • Generic page URL: Use python {baseDir}/scripts/pipeline_web_to_docx.py first. If the page is especially stubborn and it is a Toutiao page, python {baseDir}/scripts/pipeline_toutiao_to_docx.py and python {baseDir}/scripts/extract_toutiao_media.py remain available as site-specific fallbacks.
  • Direct media URL: Use python {baseDir}/scripts/download_url.py, then transcribe.
  • Local file: Transcribe directly.

2. Preserve raw outputs

  • Keep the raw transcript as its own .txt.
  • If you produce a cleaned or chapterized version, save it as a separate file.
  • Do not overwrite the raw transcript unless the user explicitly asks.

3. Prefer the audio stream

  • If a page exposes a dedicated audio stream, prefer downloading that instead of the full video stream.
  • If the page only exposes a video stream, let ffmpeg decode audio during transcription.
  • If the page exposes HLS or DASH manifests, prefer downloading them through the bundled downloader or pipeline instead of raw HTTP GET.

4. Refine conservatively

  • Preserve meaning.
  • Fix obvious ASR mistakes, punctuation, paragraph breaks, headings, and chapter boundaries.
  • Do not invent quotes or historical claims that are not supported by the transcript.
  • If a passage is too noisy to restore confidently, keep it neutral instead of fabricating detail.

5. Stay within scope

  • Only download URLs that the user supplied directly or that the extractor captured from the target page.
  • Do not request, store, or exfiltrate cookies, access tokens, or account credentials.
  • Do not attempt to bypass DRM, login walls, or geo-restriction controls.
  • If a page requires authenticated browser state that is not already available, say so plainly and stop at the supported boundary.

6. Render deliverables

  • Use python {baseDir}/scripts/transcript_to_docx.py for generic TXT-to-DOCX rendering.
  • Use the raw transcript for auditability and the refined transcript for reading quality.

Scripts

  • scripts/bootstrap_env.py Install or verify the Python packages used by this skill.
  • scripts/extract_web_media.py Open a generic web page in a real browser, capture likely media URLs plus common download headers, and emit a JSON manifest.
  • scripts/extract_toutiao_media.py Open a Toutiao page in a real browser, capture audio/video URLs, and emit a JSON manifest with the same schema as the generic extractor.
  • scripts/download_url.py Download a direct media URL to disk with a stable user agent, optional headers, and HLS/DASH handling.
  • scripts/transcribe_sensevoice.py Download the SenseVoice model on demand, segment media, run offline ASR, and emit TXT and optional DOCX.
  • scripts/transcript_to_docx.py Render timestamped transcripts or chapterized notes into a Word document.
  • scripts/pipeline_web_to_docx.py Run the generic end-to-end pipeline: extract, download, transcribe, and render.
  • scripts/pipeline_toutiao_to_docx.py Run the Toutiao-specialized end-to-end pipeline for cases where the generic extractor is not preferred.

References

Validation

  • Run python {baseDir}/scripts/bootstrap_env.py before first use in a fresh environment.
  • Validate the skill folder with skill-creator/scripts/quick_validate.py.
  • Prefer testing --help and one representative happy path after changing the scripts.
  • If extraction fails on a page, capture a direct media URL with browser tooling and continue with the downloader + transcriber.
  • Do not promise support for DRM-protected streams, authenticated cookies, or sites that only expose encrypted EME playback.