返回 Skill 列表
extension
分类: 开发与工程无需 API Key

Sitemap Generator

通过爬取网站生成 XML 网站地图。适用于创建 sitemap.xml(SEO),审计站点结构,发现域名下所有页面等场景。

person作者: johnnywang2001hubclawhub

Sitemap Generator

Crawl any website and produce a standards-compliant XML sitemap ready for search engine submission.

Quick Start

python3 scripts/sitemap_gen.py https://example.com

Output: sitemap.xml in the current directory.

Commands

# Basic — crawl and write sitemap.xml
python3 scripts/sitemap_gen.py https://example.com

# Custom output path
python3 scripts/sitemap_gen.py https://example.com -o /tmp/sitemap.xml

# Limit crawl scope
python3 scripts/sitemap_gen.py https://example.com --max-pages 500 --max-depth 3

# Polite crawling with delay
python3 scripts/sitemap_gen.py https://example.com --delay 1.0

# Set SEO hints
python3 scripts/sitemap_gen.py https://example.com --changefreq daily --priority 0.8

# Verbose progress
python3 scripts/sitemap_gen.py https://example.com -v

# Pipe to stdout
python3 scripts/sitemap_gen.py https://example.com -o -

Options

| Flag | Default | Description | |------|---------|-------------| | --output, -o | sitemap.xml | Output file path (use - for stdout) | | --max-pages | 200 | Maximum pages to crawl | | --max-depth | 5 | Maximum link depth from start URL | | --delay | 0.2 | Seconds between requests | | --timeout | 10 | Request timeout in seconds | | --changefreq | weekly | Sitemap changefreq hint | | --priority | 0.5 | Sitemap priority hint (0.0–1.0) | | --verbose, -v | off | Print crawl progress to stderr |

Dependencies

pip install requests beautifulsoup4

Notes

  • Only crawls same-domain pages (no external links)
  • Skips binary files (images, CSS, JS, PDFs, fonts)
  • Respects the delay setting to avoid overwhelming servers
  • Output conforms to the sitemaps.org 0.9 protocol