Crawl your sites, keep sitemaps fresh, and push new pages to search engines automatically.
webdex discovers what pages actually exist on your sites (by following links, not trusting your sitemap), then submits new URLs to Bing/Yandex/Google so they get indexed faster. It can also write and commit updated sitemaps to your repos.
Sitemaps go stale. You add a blog post, forget to update the sitemap, and search engines don't know it exists. Or your framework generates a sitemap that misses pages. webdex fixes this by crawling your live sites and comparing what it finds against what your sitemap says.
- Crawls your sites — follows
<a>links from the root, seeds from existing sitemaps (handles sitemap indexes, checks robots.txt) - Manages sitemaps — generates and commits
sitemap.xmlto your git repos, or verifies framework-generated sitemaps and warns about mismatches - Submits to IndexNow — pushes new URLs to Bing, Yandex, Naver, Seznam, and Amazon in one API call
- Submits to Google — pushes sitemaps via Search Console API, optionally uses the Indexing API for individual URLs
- Discovers domains — uses the Cloudflare API to find all your zones, subdomains, Workers, and Pages projects
Only new URLs (since the last run) get submitted. State is tracked in a local JSON file so you don't spam search engines with unchanged content.
git clone https://github.com/xdsai/webdex.git
cd webdex
python3 -m venv .venv
.venv/bin/pip install -e .
cp config.example.yaml config.yaml
# edit config.yaml — add your domains and credentials
.venv/bin/webdex setup # generates IndexNow key, creates scoped CF token
.venv/bin/webdex run # crawl + submitcloudflare:
api_token: "" # scoped token (created by `webdex setup`)
account_email: "" # only needed with global_api_key
global_api_key: "" # used by `webdex setup` to create a scoped token
indexnow:
key: "" # auto-generated by `webdex setup`
google:
service_account_file: "" # path to GCP service account JSON key
sites:
- domain: example.com
repo: /path/to/repo # local git repo
sitemap_path: sitemap.xml # path in repo
sitemap_strategy: managed # webdex writes the sitemap
- domain: blog.example.com
repo: /path/to/blog
sitemap_path: null
sitemap_strategy: verify # webdex only checks the sitemap
crawl:
max_depth: 3
max_pages: 500
timeout: 10
state_file: state.jsonmanaged — webdex generates sitemap.xml from crawled URLs, writes it to the repo, and runs git add && commit && push. Use this for static sites or any site where the sitemap is a plain file you control.
verify — webdex compares crawled URLs against the live sitemap and logs warnings about missing pages. Use this for frameworks that auto-generate sitemaps (Astro, Next.js, Hugo, etc.) where you don't want webdex writing to the repo.
Used for domain discovery and automated Google verification via DNS. Not required if you just list your domains in the config manually.
Put your global API key and account email in the config, then webdex setup creates a scoped token with minimal permissions (Zone Read, DNS Read/Write).
No account needed. webdex setup generates a random key. You host a text file containing that key at each domain's root — search engines use it to verify you own the site.
For example, if the key is abc123, serve a file at https://example.com/abc123.txt containing abc123.
Requires a GCP service account:
- Create a service account in the GCP Console
- Download the JSON key file
- Enable the Search Console API, Indexing API, and Site Verification API
- Set
google.service_account_filein your config - Run
webdex setup-google— this automatically verifies domain ownership via DNS TXT records (requires Cloudflare credentials) and adds your sites to Search Console
webdex run # full cycle: crawl → update sitemaps → submit to search engines
webdex crawl <domain> # crawl a single domain and print discovered URLs
webdex discover # list all domains/subdomains from Cloudflare
webdex setup # generate IndexNow key + create scoped CF token
webdex setup-google # verify domains and add to Google Search Console
Use -v for verbose output. Use -c path/to/config.yaml for a custom config location.
Add a cron job to run daily:
0 6 * * * cd /path/to/webdex && .venv/bin/webdex run >> webdex.log 2>&1
| Engine | Method | Notes |
|---|---|---|
| Bing | IndexNow | Also covers Yahoo |
| Yandex | IndexNow | |
| Naver | IndexNow | |
| Seznam | IndexNow | |
| Amazon | IndexNow | Joined 2025 |
| Search Console API | Sitemap submission; Indexing API for individual URLs (limited) |
IndexNow submissions go to a single endpoint (api.indexnow.org) which fans out to all participating engines.
Google does not support IndexNow. The Indexing API officially only supports JobPosting and BroadcastEvent schema types — it accepts other URLs but Google has warned this may stop working.
- Checks
robots.txtfor sitemap references - Tries
/sitemap.xml,/sitemap-index.xml(handles sitemap indexes) - Follows internal
<a>links up tomax_depth, same-origin only - Uses the final URL after redirects as canonical
- Skips non-HTML responses, assets, API endpoints, auth pages
- No JavaScript rendering — if your links are only in JS, seed them via sitemap
Building webdex — the story behind the project.
MIT