WeChat Article Extractor

Extract metadata and content from WeChat Official Account (微信公众号) articles.

Capabilities

Parse WeChat article URLs (mp.weixin.qq.com)
Extract article metadata: title, author, description, publish time
Extract account info: name, avatar, alias, description
Get article content (HTML)
Get cover image URL
Support multiple article types: post, video, image, voice, text, repost
Handle various error cases: deleted content, expired links, access limits

Usage

Basic Extraction from URL

const { extract } = require('./scripts/extract.js');

const result = await extract('https://mp.weixin.qq.com/s?__biz=...');
// Returns: { done: true, code: 0, data: {...} }

Extraction from HTML

const html = await fetch(url).then(r => r.text());
const result = await extract(html, { url: sourceUrl });

Options

const result = await extract(url, {
  shouldReturnContent: true,      // Return HTML content (default: true)
  shouldReturnRawMeta: false,     // Return raw metadata (default: false)
  shouldFollowTransferLink: true, // Follow migrated account links (default: true)
  shouldExtractMpLinks: false,    // Extract embedded mp.weixin links (default: false)
  shouldExtractTags: false,       // Extract article tags (default: false)
  shouldExtractRepostMeta: false  // Extract repost source info (default: false)
});

Response Format

Success Response

{
  done: true,
  code: 0,
  data: {
    // Account info
    account_name: "公众号名称",
    account_alias: "微信号",
    account_avatar: "头像URL",
    account_description: "功能介绍",
    account_id: "原始ID",
    account_biz: "biz参数",
    account_biz_number: 1234567890,
    account_qr_code: "二维码URL",

    // Article info
    msg_title: "文章标题",
    msg_desc: "文章摘要",
    msg_content: "HTML内容",
    msg_cover: "封面图URL",
    msg_author: "作者",
    msg_type: "post", // post|video|image|voice|text|repost
    msg_has_copyright: true,
    msg_publish_time: Date,
    msg_publish_time_str: "2024/01/15 10:30:00",

    // Link params
    msg_link: "文章链接",
    msg_source_url: "阅读原文链接",
    msg_sn: "sn参数",
    msg_mid: 1234567890,
    msg_idx: 1
  }
}

Error Response

{
  done: false,
  code: 1001,
  msg: "无法获取文章信息"
}

Error Codes

| Code | Message | Description | |------|---------|-------------| | 1000 | 文章获取失败 | General failure | | 1001 | 无法获取文章信息 | Missing title or publish time | | 1002 | 请求失败 | HTTP request failed | | 1003 | 响应为空 | Empty response | | 1004 | 访问过于频繁 | Rate limited | | 1005 | 脚本解析失败 | Script parsing error | | 1006 | 公众号已迁移 | Account migrated | | 2001 | 请提供文章内容或链接 | Missing input | | 2002 | 链接已过期 | Link expired | | 2003 | 内容涉嫌侵权 | Content removed (copyright) | | 2004 | 无法获取迁移后的链接 | Migration link failed | | 2005 | 内容已被发布者删除 | Content deleted by author | | 2006 | 内容因违规无法查看 | Content blocked | | 2007 | 内容发送失败 | Failed to send | | 2008 | 系统出错 | System error | | 2009 | 不支持的链接 | Unsupported URL | | 2010 | 内容获取失败 | Content fetch failed | | 2011 | 涉嫌过度营销 | Marketing/spam content | | 2012 | 账号已被屏蔽 | Account blocked | | 2013 | 账号已自主注销 | Account deleted | | 2014 | 内容被投诉 | Content reported | | 2015 | 账号处于迁移流程中 | Account migrating | | 2016 | 冒名侵权 | Impersonation |

Dependencies

Required npm packages:

cheerio - HTML parsing
dayjs - Date formatting
request-promise - HTTP requests
qs - Query string parsing
lodash.unescape - HTML entities

Notes

Handles various WeChat page structures and anti-scraping measures
Automatically detects article type from page content
Supports extracting from Sogou WeChat search results (weixin.sogou.com)
Some fields may be null depending on article type and page structure