Web Crawler Skill

This skill provides web crawling capabilities to collect articles about artificial intelligence, digitalization, and informatization from Chinese government and news websites.

Supported Sources

Government Websites

国家能源局 (www.nea.gov.cn)
工业和信息化部 (www.miit.gov.cn)
国家互联网信息办公室 (www.cac.gov.cn)
国家发展和改革委员会 (www.ndrc.gov.cn)
四川省经济和信息化厅 (jxt.sc.gov.cn)
国家信息化专家咨询委员会 (www.sic.gov.cn)

News Websites

新华网 (www.xinhuanet.com)
人民网 (www.people.com.cn)
央视新闻 (news.cctv.com)
观察者网 (www.guancha.cn)
澎湃新闻 (www.thepaper.cn)

Keywords Filter

Include Keywords

人工智能 (Artificial Intelligence)
数字化 (Digitalization)
信息化 (Informatization)
智能化 (Intelligentization)
智能技术 (Intelligent Technology)
AI
算法 (Algorithm)
大数据 (Big Data)
云计算 (Cloud Computing)
数字经济 (Digital Economy)
数字转型 (Digital Transformation)

Exclude Keywords

报表, 采购, 公示, 公告, 招标, 投标, 结果, 租赁, 服务
通知, 办法, 目录, 处罚, 检查, 认定, 标准
设备, 购买, 购置, 询价, 竞争性, 谈判
单一来源, 中标, 成交, 合同

Usage

Run the Crawler

python gov_crawler.py

Output Structure

original_articles/ - Original article content
article_links/ - Article metadata and links
ai_summaries/ - AI-generated summaries
crawled_links.json - Deduplication tracking

Features

Automatic deduplication
Publication date sorting
Incremental file indexing
AI content summarization
Error handling and retry mechanisms

Customization

To modify keywords or add new sources, edit the gov_crawler.py file:

Add new keywords: Update the keywords list in each crawler class
Add new sources: Create a new crawler class inheriting from BaseCrawler
Adjust filters: Modify the exclude_keywords list

Dependencies

requests
beautifulsoup4
lxml

Install with: pip install -r requirements.txt