Skip to content

xdsai/webdex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

webdex

Crawl your sites, keep sitemaps fresh, and push new pages to search engines automatically.

webdex discovers what pages actually exist on your sites (by following links, not trusting your sitemap), then submits new URLs to Bing/Yandex/Google so they get indexed faster. It can also write and commit updated sitemaps to your repos.

Why

Sitemaps go stale. You add a blog post, forget to update the sitemap, and search engines don't know it exists. Or your framework generates a sitemap that misses pages. webdex fixes this by crawling your live sites and comparing what it finds against what your sitemap says.

What it does

  1. Crawls your sites — follows <a> links from the root, seeds from existing sitemaps (handles sitemap indexes, checks robots.txt)
  2. Manages sitemaps — generates and commits sitemap.xml to your git repos, or verifies framework-generated sitemaps and warns about mismatches
  3. Submits to IndexNow — pushes new URLs to Bing, Yandex, Naver, Seznam, and Amazon in one API call
  4. Submits to Google — pushes sitemaps via Search Console API, optionally uses the Indexing API for individual URLs
  5. Discovers domains — uses the Cloudflare API to find all your zones, subdomains, Workers, and Pages projects

Only new URLs (since the last run) get submitted. State is tracked in a local JSON file so you don't spam search engines with unchanged content.

Quick start

git clone https://github.com/xdsai/webdex.git
cd webdex
python3 -m venv .venv
.venv/bin/pip install -e .
cp config.example.yaml config.yaml
# edit config.yaml — add your domains and credentials
.venv/bin/webdex setup     # generates IndexNow key, creates scoped CF token
.venv/bin/webdex run       # crawl + submit

Configuration

cloudflare:
  api_token: ""          # scoped token (created by `webdex setup`)
  account_email: ""      # only needed with global_api_key
  global_api_key: ""     # used by `webdex setup` to create a scoped token

indexnow:
  key: ""                # auto-generated by `webdex setup`

google:
  service_account_file: ""  # path to GCP service account JSON key

sites:
  - domain: example.com
    repo: /path/to/repo        # local git repo
    sitemap_path: sitemap.xml  # path in repo
    sitemap_strategy: managed  # webdex writes the sitemap

  - domain: blog.example.com
    repo: /path/to/blog
    sitemap_path: null
    sitemap_strategy: verify   # webdex only checks the sitemap

crawl:
  max_depth: 3
  max_pages: 500
  timeout: 10

state_file: state.json

Sitemap strategies

managed — webdex generates sitemap.xml from crawled URLs, writes it to the repo, and runs git add && commit && push. Use this for static sites or any site where the sitemap is a plain file you control.

verify — webdex compares crawled URLs against the live sitemap and logs warnings about missing pages. Use this for frameworks that auto-generate sitemaps (Astro, Next.js, Hugo, etc.) where you don't want webdex writing to the repo.

Credentials

Cloudflare (optional)

Used for domain discovery and automated Google verification via DNS. Not required if you just list your domains in the config manually.

Put your global API key and account email in the config, then webdex setup creates a scoped token with minimal permissions (Zone Read, DNS Read/Write).

IndexNow

No account needed. webdex setup generates a random key. You host a text file containing that key at each domain's root — search engines use it to verify you own the site.

For example, if the key is abc123, serve a file at https://example.com/abc123.txt containing abc123.

Google Search Console (optional)

Requires a GCP service account:

  1. Create a service account in the GCP Console
  2. Download the JSON key file
  3. Enable the Search Console API, Indexing API, and Site Verification API
  4. Set google.service_account_file in your config
  5. Run webdex setup-google — this automatically verifies domain ownership via DNS TXT records (requires Cloudflare credentials) and adds your sites to Search Console

Commands

webdex run              # full cycle: crawl → update sitemaps → submit to search engines
webdex crawl <domain>   # crawl a single domain and print discovered URLs
webdex discover         # list all domains/subdomains from Cloudflare
webdex setup            # generate IndexNow key + create scoped CF token
webdex setup-google     # verify domains and add to Google Search Console

Use -v for verbose output. Use -c path/to/config.yaml for a custom config location.

Running on a schedule

Add a cron job to run daily:

0 6 * * * cd /path/to/webdex && .venv/bin/webdex run >> webdex.log 2>&1

Search engine coverage

Engine Method Notes
Bing IndexNow Also covers Yahoo
Yandex IndexNow
Naver IndexNow
Seznam IndexNow
Amazon IndexNow Joined 2025
Google Search Console API Sitemap submission; Indexing API for individual URLs (limited)

IndexNow submissions go to a single endpoint (api.indexnow.org) which fans out to all participating engines.

Google does not support IndexNow. The Indexing API officially only supports JobPosting and BroadcastEvent schema types — it accepts other URLs but Google has warned this may stop working.

How the crawler works

  1. Checks robots.txt for sitemap references
  2. Tries /sitemap.xml, /sitemap-index.xml (handles sitemap indexes)
  3. Follows internal <a> links up to max_depth, same-origin only
  4. Uses the final URL after redirects as canonical
  5. Skips non-HTML responses, assets, API endpoints, auth pages
  6. No JavaScript rendering — if your links are only in JS, seed them via sitemap

Blog post

Building webdex — the story behind the project.

License

MIT

About

Crawl your sites, keep sitemaps fresh, and push new pages to search engines automatically.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages