Back to skills
extension
Category: OtherNo API key required

文章获取爬虫

Crawls Chinese government and news websites for AI, digitalization, and informatization articles. Invoke when user needs to collect articles about artificial intelligence, digital transformation, or technology policy from Chinese official sources.

personAuthor: user_6ed3129ahubcommunity

Web Crawler Skill

This skill provides web crawling capabilities to collect articles about artificial intelligence, digitalization, and informatization from Chinese government and news websites.

Supported Sources

Government Websites

  • 国家能源局 (www.nea.gov.cn)
  • 工业和信息化部 (www.miit.gov.cn)
  • 国家互联网信息办公室 (www.cac.gov.cn)
  • 国家发展和改革委员会 (www.ndrc.gov.cn)
  • 四川省经济和信息化厅 (jxt.sc.gov.cn)
  • 国家信息化专家咨询委员会 (www.sic.gov.cn)

News Websites

  • 新华网 (www.xinhuanet.com)
  • 人民网 (www.people.com.cn)
  • 央视新闻 (news.cctv.com)
  • 观察者网 (www.guancha.cn)
  • 澎湃新闻 (www.thepaper.cn)

Keywords Filter

Include Keywords

  • 人工智能 (Artificial Intelligence)
  • 数字化 (Digitalization)
  • 信息化 (Informatization)
  • 智能化 (Intelligentization)
  • 智能技术 (Intelligent Technology)
  • AI
  • 算法 (Algorithm)
  • 大数据 (Big Data)
  • 云计算 (Cloud Computing)
  • 数字经济 (Digital Economy)
  • 数字转型 (Digital Transformation)

Exclude Keywords

  • 报表, 采购, 公示, 公告, 招标, 投标, 结果, 租赁, 服务
  • 通知, 办法, 目录, 处罚, 检查, 认定, 标准
  • 设备, 购买, 购置, 询价, 竞争性, 谈判
  • 单一来源, 中标, 成交, 合同

Usage

Run the Crawler

python gov_crawler.py

Output Structure

  • original_articles/ - Original article content
  • article_links/ - Article metadata and links
  • ai_summaries/ - AI-generated summaries
  • crawled_links.json - Deduplication tracking

Features

  • Automatic deduplication
  • Publication date sorting
  • Incremental file indexing
  • AI content summarization
  • Error handling and retry mechanisms

Customization

To modify keywords or add new sources, edit the gov_crawler.py file:

  1. Add new keywords: Update the keywords list in each crawler class
  2. Add new sources: Create a new crawler class inheriting from BaseCrawler
  3. Adjust filters: Modify the exclude_keywords list

Dependencies

requests
beautifulsoup4
lxml

Install with: pip install -r requirements.txt