返回 Skill 列表
extension
分类: 数据与分析无需 API Key

tra-extract-text

使用trafilatura CLI从网页提取可读文本、Markdown、HTML、JSON或XML内容,支持元数据和输出格式设置。

person作者: googhubclawhub

tra-extract-text

Extract text from web pages using the trafilatura command-line tool.

Installation

pip install trafilatura

Usage

Basic text extraction (Markdown)

trafilatura -u URL --markdown

Extract raw text (no formatting)

trafilatura -u URL --text

Output to file

trafilatura -u URL --markdown > output.md
trafilatura -u URL --text > output.txt

CLI Options

| Option | Description | |--------|-------------| | -u, --url | Target URL (required) | | --markdown | Output as Markdown (default) | | --text | Output as plain text | | --html | Output as HTML | | --json | Output as JSON | | --xml | Output as XML | | -o, --output | Write to file instead of stdout | | --with-metadata | Include metadata (title, author, date) | | --license | Show license info |

Examples

Extract a Medium article to markdown:

trafilatura -u "https://medium.com/example/article" --markdown

Extract and save:

trafilatura -u "https://news.example.com/article" --markdown -o article.md

Extract with metadata:

trafilatura -u "https://example.com/post" --markdown --with-metadata