Web Crawler Skill
This skill provides web crawling capabilities to collect articles about artificial intelligence, digitalization, and informatization from Chinese government and news websites.
Supported Sources
Government Websites
- 国家能源局 (www.nea.gov.cn)
- 工业和信息化部 (www.miit.gov.cn)
- 国家互联网信息办公室 (www.cac.gov.cn)
- 国家发展和改革委员会 (www.ndrc.gov.cn)
- 四川省经济和信息化厅 (jxt.sc.gov.cn)
- 国家信息化专家咨询委员会 (www.sic.gov.cn)
News Websites
- 新华网 (www.xinhuanet.com)
- 人民网 (www.people.com.cn)
- 央视新闻 (news.cctv.com)
- 观察者网 (www.guancha.cn)
- 澎湃新闻 (www.thepaper.cn)
Keywords Filter
Include Keywords
- 人工智能 (Artificial Intelligence)
- 数字化 (Digitalization)
- 信息化 (Informatization)
- 智能化 (Intelligentization)
- 智能技术 (Intelligent Technology)
- AI
- 算法 (Algorithm)
- 大数据 (Big Data)
- 云计算 (Cloud Computing)
- 数字经济 (Digital Economy)
- 数字转型 (Digital Transformation)
Exclude Keywords
- 报表, 采购, 公示, 公告, 招标, 投标, 结果, 租赁, 服务
- 通知, 办法, 目录, 处罚, 检查, 认定, 标准
- 设备, 购买, 购置, 询价, 竞争性, 谈判
- 单一来源, 中标, 成交, 合同
Usage
Run the Crawler
python gov_crawler.py
Output Structure
original_articles/- Original article contentarticle_links/- Article metadata and linksai_summaries/- AI-generated summariescrawled_links.json- Deduplication tracking
Features
- Automatic deduplication
- Publication date sorting
- Incremental file indexing
- AI content summarization
- Error handling and retry mechanisms
Customization
To modify keywords or add new sources, edit the gov_crawler.py file:
- Add new keywords: Update the
keywordslist in each crawler class - Add new sources: Create a new crawler class inheriting from
BaseCrawler - Adjust filters: Modify the
exclude_keywordslist
Dependencies
requests
beautifulsoup4
lxml
Install with: pip install -r requirements.txt
Scan to contact