DocsSiteFetcher returns whole-site llms.txt instead of the requested page

## Summary

For any host matching a docs prefix (e.g. `docs.*`), `DocsSiteFetcher` ignores the requested page path, probes `{origin}/llms-full.txt` / `{origin}/llms.txt`, and returns that entire index in place of the page that was actually requested. This means fetching a specific documentation page returns the whole site index instead of the page content.

## Reproduction

```bash
fetchkit fetch 'https://docs.tenzir.com/reference/operators/where/'
# => body is  ... the full ~141 KB site index
fetchkit fetch 'https://docs.tenzir.com/guides/quickstart/'
# => same ~141 KB llms.txt, despite being a different page
```

The origin server returns the normal page (HTML) regardless of `Accept`:

```bash
curl -sL -H 'Accept: text/markdown' https://docs.tenzir.com/reference/operators/where/ \
  -w 'code=%{http_code} type=%{content_type} size=%{size_download}\n' -o /dev/null
# => code=200 type=text/html; charset=utf-8 size=92147
```

So the substitution is done entirely by fetchkit, not by the site.

## Root cause

`crates/fetchkit/src/fetchers/docs_site.rs`:

- `DocsSiteFetcher::matches()` matches any host via `DOCS_HOST_PREFIXES = ["docs.", "wiki.", "developer.", "devdocs."]` (and `DOCS_HOSTS` suffixes).
- For a matched host that is not itself a direct `llms.txt` URL, `fetch()` discards the path, probes `{origin}/llms-full.txt` then `{origin}/llms.txt`, and on the first hit returns:

  ```rust
  content: Some(format!("\n\n{}", source, content)),
  ```

  regardless of which page was requested. The real page is fetched only when no llms.txt exists.

## Why this is wrong

**A site's `llms.txt` is not a substitute for an arbitrary page.** `llms.txt` is an index/map of the documentation; it does not necessarily contain the content of every page, and frequently a specific page URL is not covered in `llms.txt` at all. Returning the index for a specific page request therefore:

- drops the content the caller actually asked for,
- returns content for a different resource than the requested URL (surprising and undocumented), and
- can return a large index (~141 KB here) instead of a small page.

Many docs sites also expose per-page markdown directly (e.g. `https://docs.tenzir.com/reference/operators/where.md`, ~1 KB), so the page content is readily available without the index.

## Requested change

Stop substituting `llms.txt` for specific page requests. Suggestions:

- Only return `llms.txt` / `llms-full.txt` for a direct request to those URLs (already handled via `is_llms_txt_url`).
- For a specific page URL on a docs host, fetch that page (optionally preferring a per-page `.md` variant when available), and never fall back to the whole-site index.
- If index-preference is desired, make it opt-in rather than the default, and document it.

## Notes

- This behavior is undocumented (`fetchkit --help`, `fetchkit fetch --help`, `fetchkit --llmtxt`).
- Observed on fetchkit 0.3.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

DocsSiteFetcher returns whole-site llms.txt instead of the requested page #137

Summary

Reproduction

Root cause

Why this is wrong

Requested change

Notes

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

DocsSiteFetcher returns whole-site llms.txt instead of the requested page #137

Description

Summary

Reproduction

Root cause

Why this is wrong

Requested change

Notes

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions