Back to skills
extension
Category: Data & AnalyticsNo API key required

x推文自动抓取工具

x推文自动抓取工具 — 抓取指定 x.com 用户的推文,默认抓取前一天 00:00~23:59(北京时间)的推文,也支持自定义时间范围,翻译为中英双语对照,自动推送到飞书文档生成日报,并将链接发送到微信。

personAuthor: user_e8dcb574hubcommunity

x推文自动抓取工具

抓取 x.com 用户推文,生成中英双语日报,发布到飞书云文档,并将链接推送到微信。

完整流程 / Full Workflow:

  1. 启动 Chrome CDP → 2. 抓取推文 → 3. 翻译排版 → 4. 推送飞书 → 5. 发送链接到微信

支持两种时间范围模式:

  • 默认模式: 前一天 00:00 ~ 23:59(北京时间),即完整一天
  • 自定义模式: 通过 TIME_START / TIME_END 指定任意起止时间(ISO 8601 格式)

触发条件 / Trigger Criteria

当用户提出以下请求时使用本技能:

  • 抓取 x.com 用户推文并推送到飞书
  • 从 x.com 内容生成"日报"(daily report)
  • 获取用户从前一天 0 点到 24 点(北京时间)的推文
  • 将飞书日报链接发送到微信
  • "Scrape tweets from x.com and push to Feishu, send link to WeChat"

Workflow

Phase 1 — Launch Chrome with CDP

The scraping script requires a logged-in Chrome instance with DevTools Protocol enabled. The user's normal Chrome cannot be reused directly (sandbox restrictions); a temporary profile must be created.

  1. Kill existing Chrome:

    pkill -9 -f "Google Chrome"
    
  2. Copy essential session files to a temp profile (do NOT copy the full profile — it is tens of GB and will hang):

    rm -rf /tmp/chrome-debug-profile
    mkdir -p /tmp/chrome-debug-profile/Default
    for f in "Cookies" "Cookies-journal" "Login Data" "Login Data-journal" \
             "Network" "Preferences" "Web Data" "Web Data-journal"; do
      src="$HOME/Library/Application Support/Google/Chrome/Default/$f"
      [ -e "$src" ] && cp -r "$src" /tmp/chrome-debug-profile/Default/ 2>/dev/null
    done
    # Also copy top-level files
    for f in "Local State" "Last Version"; do
      src="$HOME/Library/Application Support/Google/Chrome/$f"
      [ -e "$src" ] && cp "$src" /tmp/chrome-debug-profile/ 2>/dev/null
    done
    
  3. Launch Chrome with the temp profile and debugging port:

    nohup /Applications/Google\ Chrome.app/Contents/MacOS/Google\ Chrome \
      --remote-debugging-port=9222 \
      --user-data-dir=/tmp/chrome-debug-profile \
      --no-first-run \
      --no-default-browser-check \
      > /tmp/chrome-cdp.log 2>&1 &
    sleep 6
    
  4. Verify CDP is reachable:

    curl -s --connect-timeout 5 http://127.0.0.1:9222/json/version
    

    If this returns JSON with a Browser field, Chrome is ready.

All of the above MUST use dangerouslyDisableSandbox: true — macOS sandbox blocks process management of system Chrome otherwise.

Phase 2 — Scrape Tweets

Run the bundled scraping script.

Default time range (前一天 00:00 ~ 23:59 北京时间):

cd <workspace> && \
  TARGET_USERNAME="DeItaone" \
  OUTPUT_DIR="<workspace>" \
  NODE_OPTIONS="" \
  NODE_PATH=<workspace_node_modules> \
  <node_path> <skill_dir>/scripts/scrape_tweets.js

Custom time range (自定义时间范围):

cd <workspace> && \
  TARGET_USERNAME="DeItaone" \
  TIME_START="2026-05-24T09:00:00+08:00" \
  TIME_END="2026-05-26T09:00:00+08:00" \
  OUTPUT_DIR="<workspace>" \
  NODE_OPTIONS="" \
  NODE_PATH=<workspace_node_modules> \
  <node_path> <skill_dir>/scripts/scrape_tweets.js

Environment variables / 环境变量:

| 变量 / Variable | 说明 / Description | 默认值 / Default | |---|---|---| | TARGET_USERNAME | x.com 用户名(不含 @) | DeItaone | | TIME_START | 时间窗口起点(ISO 8601),如 2026-05-24T09:00:00+08:00 | 自动计算(前一天 00:00 北京时间) | | TIME_END | 时间窗口终点(ISO 8601),如 2026-05-26T09:00:00+08:00 | 自动计算(前一天 23:59 北京时间) | | OUTPUT_DIR | tweets_raw.json 输出目录 | 当前工作目录 | | CDP_URL | Chrome DevTools Protocol 地址 | http://127.0.0.1:9222 |

时间格式说明 / Time Format:

  • 支持时区偏移:2026-05-24T09:00:00+08:00(北京时间)、2026-05-24T01:00:00Z(UTC)
  • TIME_START 和 TIME_END 必须同时提供或同时省略
  • TIME_START 必须早于 TIME_END

The script:

If the script reports "Not logged in", the temp profile did not retain the session. In that case, ask the user to log into x.com in the debug Chrome window and re-run.

Phase 3 — Translate & Format

  1. Read tweets_raw.json and translate each tweet into Chinese. Keep ticker tags ($NVDA, $TSLA) in the translation. Use financial-news terminology.

  2. Write translations as a JSON mapping file translations.json:

    {
      "ENGLISH PREFIX TEXT...": "中文翻译...",
      ...
    }
    

    Use the first 60–80 characters of each English tweet as the key.

  3. Run the formatting script:

    TRANSLATIONS_PATH=<workspace>/translations.json \
      python3 <skill_dir>/scripts/format_for_feishu.py \
        <workspace>/tweets_raw.json \
        <workspace>/report.md \
        --title "Title 日报" \
        --author @username
    
  4. The script produces a markdown file with the structure defined in references/feishu_format.md. It matches translations by longest prefix match and inserts placeholders for any unmatched tweets.

  5. Scan the generated markdown for (翻译待补充) placeholders. For any remaining, manually add the translations by editing the markdown file.

Phase 4 — Push to Feishu

Use lark-cli to create a Feishu cloud document from the markdown:

LARK_CLI="<path-to-lark-cli>"
NODE_OPTIONS="" "$LARK_CLI" docs +create \
  --api-version v2 \
  --as user \
  --doc-format markdown \
  --content "@<path-to-report.md>" \
  --title "Title 日报"
  • Always use --api-version v2 and --doc-format markdown.
  • The @ prefix on --content signals a file path.
  • Run from the directory containing the markdown, or use an absolute path.
  • The command returns a Feishu document URL. Save this URL — it will be used in Phase 5 to send to WeChat. Also display it to the user.

Phase 5 — Push Link to WeChat / 推送链接到微信

After the Feishu document is created, send the link to the user's WeChat via the WorkBuddy Mini Program (微信小程序) using deliver_attachments.

  1. Create a simple summary file containing the Feishu link:

    cat > <workspace>/feishu_link.md << 'EOF'
    # 📊 推文日报已生成
    
    **飞书文档链接 / Feishu Doc Link:**
    <FEISHU_DOC_URL>
    
    **博主 / Author:** @<TARGET_USERNAME>
    **时间范围 / Time Range:** <time_range>
    **推文数量 / Tweet Count:** <count>
    
    ---
    点击上方链接查看完整日报 / Click the link above to view the full report.
    EOF
    
  2. Use deliver_attachments to push the summary to WeChat:

    deliver_attachments({
      attachments: ["<workspace>/feishu_link.md"],
      explanation: "推送飞书日报链接到微信小程序"
    })
    

Note / 注意: This requires the user to have the "产物回传到小程序" (Deliver Artifacts to Mini Program) toggle enabled in WorkBuddy Mini Program connection settings.

Phase 6 — Cleanup

  • Close the debug Chrome: pkill -f "chrome-debug-profile"
  • The tweets_raw.json, translations.json, report.md, and feishu_link.md files in the workspace are intermediate artifacts. Keep them for traceability.

Key Pitfalls

| Problem | Cause | Fix | |---------|-------|-----| | CDP connection refused | Chrome not running with --remote-debugging-port | Re-launch Chrome per Phase 1 | | "Not logged in" in script output | Temp profile missing cookies | Ask user to log in via the debug Chrome window | | Full profile copy hangs | Chrome profile is 10–50 GB | Only copy the files listed in Phase 1 step 2 | | xcancel.com pagination blocked | Anti-bot verification | Never use xcancel/Nitter — always use x.com via CDP | | lark-cli auth expired | Token TTL | Re-run — the CLI auto-refreshes |

Bundle Contents

  • scripts/scrape_tweets.js — Playwright CDP scraper for x.com timelines
  • scripts/format_for_feishu.py — Generates bilingual markdown from raw tweets
  • references/feishu_format.md — Document structure and Feishu CLI reference