Summary
For any host matching a docs prefix (e.g. docs.*), DocsSiteFetcher ignores the requested page path, probes {origin}/llms-full.txt / {origin}/llms.txt, and returns that entire index in place of the page that was actually requested. This means fetching a specific documentation page returns the whole site index instead of the page content.
Reproduction
fetchkit fetch 'https://docs.tenzir.com/reference/operators/where/'
# => body is <!-- Source: llms.txt --> ... the full ~141 KB site index
fetchkit fetch 'https://docs.tenzir.com/guides/quickstart/'
# => same ~141 KB llms.txt, despite being a different page
The origin server returns the normal page (HTML) regardless of Accept:
curl -sL -H 'Accept: text/markdown' https://docs.tenzir.com/reference/operators/where/ \
-w 'code=%{http_code} type=%{content_type} size=%{size_download}\n' -o /dev/null
# => code=200 type=text/html; charset=utf-8 size=92147
So the substitution is done entirely by fetchkit, not by the site.
Root cause
crates/fetchkit/src/fetchers/docs_site.rs:
-
DocsSiteFetcher::matches() matches any host via DOCS_HOST_PREFIXES = ["docs.", "wiki.", "developer.", "devdocs."] (and DOCS_HOSTS suffixes).
-
For a matched host that is not itself a direct llms.txt URL, fetch() discards the path, probes {origin}/llms-full.txt then {origin}/llms.txt, and on the first hit returns:
content: Some(format!("<!-- Source: {} -->\n\n{}", source, content)),
regardless of which page was requested. The real page is fetched only when no llms.txt exists.
Why this is wrong
A site's llms.txt is not a substitute for an arbitrary page. llms.txt is an index/map of the documentation; it does not necessarily contain the content of every page, and frequently a specific page URL is not covered in llms.txt at all. Returning the index for a specific page request therefore:
- drops the content the caller actually asked for,
- returns content for a different resource than the requested URL (surprising and undocumented), and
- can return a large index (~141 KB here) instead of a small page.
Many docs sites also expose per-page markdown directly (e.g. https://docs.tenzir.com/reference/operators/where.md, ~1 KB), so the page content is readily available without the index.
Requested change
Stop substituting llms.txt for specific page requests. Suggestions:
- Only return
llms.txt / llms-full.txt for a direct request to those URLs (already handled via is_llms_txt_url).
- For a specific page URL on a docs host, fetch that page (optionally preferring a per-page
.md variant when available), and never fall back to the whole-site index.
- If index-preference is desired, make it opt-in rather than the default, and document it.
Notes
- This behavior is undocumented (
fetchkit --help, fetchkit fetch --help, fetchkit --llmtxt).
- Observed on fetchkit 0.3.0.
Summary
For any host matching a docs prefix (e.g.
docs.*),DocsSiteFetcherignores the requested page path, probes{origin}/llms-full.txt/{origin}/llms.txt, and returns that entire index in place of the page that was actually requested. This means fetching a specific documentation page returns the whole site index instead of the page content.Reproduction
The origin server returns the normal page (HTML) regardless of
Accept:So the substitution is done entirely by fetchkit, not by the site.
Root cause
crates/fetchkit/src/fetchers/docs_site.rs:DocsSiteFetcher::matches()matches any host viaDOCS_HOST_PREFIXES = ["docs.", "wiki.", "developer.", "devdocs."](andDOCS_HOSTSsuffixes).For a matched host that is not itself a direct
llms.txtURL,fetch()discards the path, probes{origin}/llms-full.txtthen{origin}/llms.txt, and on the first hit returns:regardless of which page was requested. The real page is fetched only when no llms.txt exists.
Why this is wrong
A site's
llms.txtis not a substitute for an arbitrary page.llms.txtis an index/map of the documentation; it does not necessarily contain the content of every page, and frequently a specific page URL is not covered inllms.txtat all. Returning the index for a specific page request therefore:Many docs sites also expose per-page markdown directly (e.g.
https://docs.tenzir.com/reference/operators/where.md, ~1 KB), so the page content is readily available without the index.Requested change
Stop substituting
llms.txtfor specific page requests. Suggestions:llms.txt/llms-full.txtfor a direct request to those URLs (already handled viais_llms_txt_url)..mdvariant when available), and never fall back to the whole-site index.Notes
fetchkit --help,fetchkit fetch --help,fetchkit --llmtxt).