webdex

Crawl your sites, keep sitemaps fresh, and push new pages to search engines automatically.

webdex discovers what pages actually exist on your sites (by following links, not trusting your sitemap), then submits new URLs to Bing/Yandex/Google so they get indexed faster. It can also write and commit updated sitemaps to your repos.

Why

Sitemaps go stale. You add a blog post, forget to update the sitemap, and search engines don't know it exists. Or your framework generates a sitemap that misses pages. webdex fixes this by crawling your live sites and comparing what it finds against what your sitemap says.

What it does

Crawls your sites — follows <a> links from the root, seeds from existing sitemaps (handles sitemap indexes, checks robots.txt)
Manages sitemaps — generates and commits sitemap.xml to your git repos, or verifies framework-generated sitemaps and warns about mismatches
Submits to IndexNow — pushes new URLs to Bing, Yandex, Naver, Seznam, and Amazon in one API call
Submits to Google — pushes sitemaps via Search Console API, optionally uses the Indexing API for individual URLs
Discovers domains — uses the Cloudflare API to find all your zones, subdomains, Workers, and Pages projects

Only new URLs (since the last run) get submitted. State is tracked in a local JSON file so you don't spam search engines with unchanged content.

Quick start

git clone https://github.com/xdsai/webdex.git
cd webdex
python3 -m venv .venv
.venv/bin/pip install -e .
cp config.example.yaml config.yaml
# edit config.yaml — add your domains and credentials
.venv/bin/webdex setup     # generates IndexNow key, creates scoped CF token
.venv/bin/webdex run       # crawl + submit

Configuration

cloudflare:
  api_token: ""          # scoped token (created by `webdex setup`)
  account_email: ""      # only needed with global_api_key
  global_api_key: ""     # used by `webdex setup` to create a scoped token

indexnow:
  key: ""                # auto-generated by `webdex setup`

google:
  service_account_file: ""  # path to GCP service account JSON key

sites:
  - domain: example.com
    repo: /path/to/repo        # local git repo
    sitemap_path: sitemap.xml  # path in repo
    sitemap_strategy: managed  # webdex writes the sitemap

  - domain: blog.example.com
    repo: /path/to/blog
    sitemap_path: null
    sitemap_strategy: verify   # webdex only checks the sitemap

crawl:
  max_depth: 3
  max_pages: 500
  timeout: 10

state_file: state.json

Sitemap strategies

managed — webdex generates sitemap.xml from crawled URLs, writes it to the repo, and runs git add && commit && push. Use this for static sites or any site where the sitemap is a plain file you control.

verify — webdex compares crawled URLs against the live sitemap and logs warnings about missing pages. Use this for frameworks that auto-generate sitemaps (Astro, Next.js, Hugo, etc.) where you don't want webdex writing to the repo.

Credentials

Cloudflare (optional)

Used for domain discovery and automated Google verification via DNS. Not required if you just list your domains in the config manually.

Put your global API key and account email in the config, then webdex setup creates a scoped token with minimal permissions (Zone Read, DNS Read/Write).

IndexNow

No account needed. webdex setup generates a random key. You host a text file containing that key at each domain's root — search engines use it to verify you own the site.

For example, if the key is abc123, serve a file at https://example.com/abc123.txt containing abc123.

Google Search Console (optional)

Requires a GCP service account:

Create a service account in the GCP Console
Download the JSON key file
Enable the Search Console API, Indexing API, and Site Verification API
Set google.service_account_file in your config
Run webdex setup-google — this automatically verifies domain ownership via DNS TXT records (requires Cloudflare credentials) and adds your sites to Search Console

Commands

webdex run              # full cycle: crawl → update sitemaps → submit to search engines
webdex crawl <domain>   # crawl a single domain and print discovered URLs
webdex discover         # list all domains/subdomains from Cloudflare
webdex setup            # generate IndexNow key + create scoped CF token
webdex setup-google     # verify domains and add to Google Search Console

Use -v for verbose output. Use -c path/to/config.yaml for a custom config location.

Running on a schedule

Add a cron job to run daily:

0 6 * * * cd /path/to/webdex && .venv/bin/webdex run >> webdex.log 2>&1

Search engine coverage

Engine	Method	Notes
Bing	IndexNow	Also covers Yahoo
Yandex	IndexNow
Naver	IndexNow
Seznam	IndexNow
Amazon	IndexNow	Joined 2025
Google	Search Console API	Sitemap submission; Indexing API for individual URLs (limited)

IndexNow submissions go to a single endpoint (api.indexnow.org) which fans out to all participating engines.

Google does not support IndexNow. The Indexing API officially only supports JobPosting and BroadcastEvent schema types — it accepts other URLs but Google has warned this may stop working.

How the crawler works

Checks robots.txt for sitemap references
Tries /sitemap.xml, /sitemap-index.xml (handles sitemap indexes)
Follows internal <a> links up to max_depth, same-origin only
Uses the final URL after redirects as canonical
Skips non-HTML responses, assets, API endpoints, auth pages
No JavaScript rendering — if your links are only in JS, seed them via sitemap

Blog post

Building webdex — the story behind the project.

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
webdex		webdex
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.example.yaml		config.example.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

webdex

Why

What it does

Quick start

Configuration

Sitemap strategies

Credentials

Cloudflare (optional)

IndexNow

Google Search Console (optional)

Commands

Running on a schedule

Search engine coverage

How the crawler works

Blog post

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

webdex

Why

What it does

Quick start

Configuration

Sitemap strategies

Credentials

Cloudflare (optional)

IndexNow

Google Search Console (optional)

Commands

Running on a schedule

Search engine coverage

How the crawler works

Blog post

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages