Skip to content

DocsSiteFetcher returns whole-site llms.txt instead of the requested page #137

Description

@aljazerzen

Summary

For any host matching a docs prefix (e.g. docs.*), DocsSiteFetcher ignores the requested page path, probes {origin}/llms-full.txt / {origin}/llms.txt, and returns that entire index in place of the page that was actually requested. This means fetching a specific documentation page returns the whole site index instead of the page content.

Reproduction

fetchkit fetch 'https://docs.tenzir.com/reference/operators/where/'
# => body is <!-- Source: llms.txt --> ... the full ~141 KB site index
fetchkit fetch 'https://docs.tenzir.com/guides/quickstart/'
# => same ~141 KB llms.txt, despite being a different page

The origin server returns the normal page (HTML) regardless of Accept:

curl -sL -H 'Accept: text/markdown' https://docs.tenzir.com/reference/operators/where/ \
  -w 'code=%{http_code} type=%{content_type} size=%{size_download}\n' -o /dev/null
# => code=200 type=text/html; charset=utf-8 size=92147

So the substitution is done entirely by fetchkit, not by the site.

Root cause

crates/fetchkit/src/fetchers/docs_site.rs:

  • DocsSiteFetcher::matches() matches any host via DOCS_HOST_PREFIXES = ["docs.", "wiki.", "developer.", "devdocs."] (and DOCS_HOSTS suffixes).

  • For a matched host that is not itself a direct llms.txt URL, fetch() discards the path, probes {origin}/llms-full.txt then {origin}/llms.txt, and on the first hit returns:

    content: Some(format!("<!-- Source: {} -->\n\n{}", source, content)),

    regardless of which page was requested. The real page is fetched only when no llms.txt exists.

Why this is wrong

A site's llms.txt is not a substitute for an arbitrary page. llms.txt is an index/map of the documentation; it does not necessarily contain the content of every page, and frequently a specific page URL is not covered in llms.txt at all. Returning the index for a specific page request therefore:

  • drops the content the caller actually asked for,
  • returns content for a different resource than the requested URL (surprising and undocumented), and
  • can return a large index (~141 KB here) instead of a small page.

Many docs sites also expose per-page markdown directly (e.g. https://docs.tenzir.com/reference/operators/where.md, ~1 KB), so the page content is readily available without the index.

Requested change

Stop substituting llms.txt for specific page requests. Suggestions:

  • Only return llms.txt / llms-full.txt for a direct request to those URLs (already handled via is_llms_txt_url).
  • For a specific page URL on a docs host, fetch that page (optionally preferring a per-page .md variant when available), and never fall back to the whole-site index.
  • If index-preference is desired, make it opt-in rather than the default, and document it.

Notes

  • This behavior is undocumented (fetchkit --help, fetchkit fetch --help, fetchkit --llmtxt).
  • Observed on fetchkit 0.3.0.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions