Skip to content

fix: Render and index <img> as first-class inline content#744

Merged
danhdoan merged 9 commits into
mainfrom
proj/byterover-tool-mode
May 30, 2026
Merged

fix: Render and index <img> as first-class inline content#744
danhdoan merged 9 commits into
mainfrom
proj/byterover-tool-mode

Conversation

@RyanNg1403
Copy link
Copy Markdown
Collaborator

proj: merge byterover-tool-mode into main — first-class rendering + curate

--response-file flags

Body:

Summary

  • Problem: tool-mode project work has accumulated on proj/byterover-tool-mode and needs to
    land on main.
  • Why it matters: two completed features are gated behind this merge — agents and users can't
    access them on main yet.
  • What changed: integrates two independently-reviewed feature PRs from the project branch.
  • What did NOT change (scope boundary): this PR introduces no new code itself; it is a pure
    project-to-main merge.

Type of change

  • New feature

Scope (select all touched areas)

  • Server / Daemon
  • CLI Commands (oclif)
  • Shared (constants, types, transport events)

Linked issues

N/A

Root cause (bug fixes only, otherwise write N/A)

N/A — feature merge, not a bug fix.

Test plan

  • Coverage added:
    • Unit test
    • Integration test
  • Test file(s):
    • test/unit/server/infra/render/reader/html-renderer.test.ts (img rendering)
    • test/unit/server/infra/render/reader/html-reader.test.ts (img indexing extractor)
    • test/integration/scenarios/img-roundtrip.test.ts (full read + index pipeline)
    • test/unit/oclif/lib/curate-session.test.ts (response-file helpers + envelope parsing)
    • test/commands/curate/index.test.ts (Curate command orchestration)
  • Key scenario(s) covered:
    • <img> rendering inside <bv-*> elements: canonical, missing alt, missing src, special
      chars in alt, autolink fallback for src with parens, top-level img sibling
    • BM25 indexing of img alt + src URL tokens (host, path segments, filename)
    • --response-file happy path + inline equivalence + meta passthrough
    • --delete-response-file semantics: deletes on non-throw dispatch, preserves on parse-fail
      / unknown-session / invalid-uuid, surfaces cleanup-error in both JSON errors[] and
      text-mode stderr

User-visible changes

  1. <img> rendered and indexed as first-class inline content. Previously, curating a
    topic with <img src alt/> inside a <bv-*> element wrote the tag to disk but the image's
    src / alt did not show up in brv read output or BM25 search. After this merge, <img>
    is rendered as CommonMark ![alt](src) in rendered_md at the correct inline position, and
    the alt + src text is tokenized into the BM25 index. Index schema version bumped 6 → 7 so
    previously-curated img content becomes searchable retroactively without a manual brv index rebuild.

  2. brv curate --response-file <path> and --delete-response-file. Previously, an agent
    had to inline-escape its entire JSON envelope on the shell with --response '<json>'. The new
    --response-file flag reads the same {html, meta?} envelope from disk, sidestepping
    shell-escape complexity for non-trivial HTML. --delete-response-file cleans up the envelope
    file after a successful daemon dispatch; the file is preserved on dispatch failure so the
    caller can retry. New documented error kinds: invalid-response-format,
    response-file-not-regular, response-file-read-error, response-file-delete-error.

Evidence

  • Failing test/log before + passing after (per the feature branches merged into proj)

Checklist

  • Tests added or updated and passing (npm test)
  • Lint passes (npm run lint)
  • Type check passes (npm run typecheck)
  • Build succeeds (npm run build)
  • Commits follow Conventional Commits format
  • Documentation updated (docs/curate-protocol.md + skill templates for new
    --response-file flags; system prompt updated to document <img> in the inline-HTML
    allowlist)
  • No breaking changes
  • Branch is up to date with main

Risks and mitigations

  • Risk: an oversize-curate transport issue surfaces at ~1 MB envelopes — daemon disconnects
    with no parseable JSON output on stdout. Discovered during the post-merge E2E sweep; fix in
    progress on a separate branch.
    • Mitigation: documented; affects only unusually long-form topics; tracked as a follow-up.
  • Risk: a small batch of latent <img> writer-validation defects — event-handler attributes
    and non-HTTP URL schemes pass through the writer without sanitization. Verified
    non-exploitable in current consumers: WebUI sanitizes via DOMPurify + React <img> only
    forwards known props; terminal and LLM-as-text consumers don't execute the content.
    • Mitigation: documented in the project E2E sweep notes; one-patch fix planned for
      follow-up.

danhdoan and others added 8 commits May 29, 2026 11:31
Closes the silent-strip bug where curating a topic with `<img src alt/>`
embedded inside a `<bv-*>` element succeeded on disk but vanished on
`brv read` and was missed by `brv query`. Empirically reproduced
2026-05-29: writer kept the tag intact; `brv read` showed a gap where
the image was; query for the alt text returned zero matches.

Cause: `html-renderer.ts` and the BM25 indexer's `bodyText` extraction
both rely on `getInnerText()`, which walks text-node descendants only.
Void elements like `<img>` have attribute data (`src`, `alt`) but no
text children → contribute nothing. Worst-class UX (write succeeds,
content disappears).

Approach: treat `<img>` as first-class inline content.

- `html-renderer.ts` — new private `getInlineMarkdown(node)` that walks
  like `getInnerText` but translates `<img>` to CommonMark `![alt](src)`.
  `renderChild` uses it for inline content; a top-level `<img>` case is
  added for the rare bv-sibling shape. Defensive on malformed input:
    * missing `src` → empty string (no broken `![alt]()` syntax)
    * missing `alt` → `![](src)` (valid CommonMark click target)
    * `]` in alt → collapsed to space (only `]` closes the alt span)
    * `)` in src → CommonMark autolink form `<src>` (parens-tolerant)
- `html-reader.ts` — new exported `extractImageContent(elements)` that
  aggregates every `<img>`'s alt + src into a space-joined string.
  Surfaced via the new `HtmlTopicRead.imageContent` field. Does NOT
  mutate `getInnerText` — separate focused helper with no surprise
  blast-radius on shared infrastructure.
- `search-knowledge-service.ts` — concatenate `parsed.imageContent`
  into the indexed-content array alongside bodyText / summary / tags /
  keywords / related. URLs go in verbatim; the BM25 tokenizer's
  whitespace/punctuation split decomposes them into useful tokens
  (host, path segments, filename, extension).
- `INDEX_SCHEMA_VERSION` bumped 6 → 7 so cached indexes built pre-fix
  invalidate on next daemon start. Previously-curated `<img>` content
  becomes searchable retroactively without a manual `brv index rebuild`.
- `system-prompt.yml` — extend the inline-HTML allowlist note to
  document `<img>` is supported.

Tests (16 new + 6 integration cases):

- `html-renderer.test.ts` (7 cases): canonical `<img>` rendering inside
  `<bv-decision>`; `![](src)` for missing alt; silent drop for missing
  src; `]` escape in alt; `)` autolink fallback in src; top-level
  `<img>` sibling rendering; URL tokens present in rendered output for
  BM25 friendliness.
- `html-reader.test.ts` (6 cases): empty topic → empty imageContent;
  single image alt + src aggregated; multiple images preserve document
  order; empty attrs don't produce double spaces; `readHtmlTopicSync`
  surfaces `imageContent` on the parsed result.
- `test/integration/scenarios/img-roundtrip.test.ts` (6 cases): full
  read + index pipeline. Curate-shaped HTML on tmp disk → `readHtmlTopic`
  → renderer shows markdown image syntax; indexer (MiniSearch with the
  same option shape as production) finds the topic for queries on alt
  phrase, URL host token, URL path segment, surrounding prose. Plus a
  regression guard for topics with zero images (no double spaces in
  the indexed content).

42/42 affected-surface tests green; 54/54 search-knowledge regression
tests pass. Typecheck + lint clean; the `renderChild` complexity
warning was 31 pre-fix and is 32 now — +1 unavoidable for the new
top-level `<img>` branch.

This is the first task in the post-merge inline-html-support
milestone (`features/html-memory-conversion/milestones/02-...`).
The matching fix for `<a href>` (same shape, different bug surface)
is tracked as a follow-up task.
[ENG-3021] Render and index <img> as first-class inline content
…v curate

Lets agents author the curate continuation envelope as a JSON file (no shell-escape
hell) and optionally have brv unlink it after local validation succeeds.

- --response-file <path>: alternative to --response. Reads {html, meta?} from file,
  lstat-guarded against symlinks / directories / devices.
- --delete-response-file (opt-in): unlinks the file once envelope parse + session
  lookup succeed locally, BEFORE daemon dispatch. Validation failures preserve the
  file so the agent can fix and retry; unlink failures abort the curate so we never
  claim success when the requested cleanup did not happen.
- Mutex'd at runtime against --response so failures emit a structured envelope
  error agents can switch on, not oclif's generic stderr line.
- Kickoff-side guard: any continuation-only flag (--response, --response-file,
  --delete-response-file) without --session is rejected up front. Also closes a
  pre-existing silent-ignore on --response without --session.
- All local validation (envelope parse + session existence) now runs before any
  daemon connect; bad JSON no longer triggers withDaemonRetry.

Stdin (--response-file -) deliberately deferred to a follow-up; no codebase
precedent for the dash-as-stdin convention yet.

Docs + skill markdown updated. brv curate --help EXAMPLES shows both flags.
Four cleanups from PR review (one per finding):

1. Defer --response-file unlink until after daemon dispatch returns.
   Previously the unlink fired between local validation and dispatch, so
   a transport error (daemon down, malformed payload, daemon-side throw)
   destroyed the envelope file the agent had paid an LLM call to author.
   Now the unlink only fires on a non-throw dispatch (success OR
   validation-failed); a daemon-error throw skips it and the file
   survives for retry. Unlink failure post-success appends a warning to
   the dispatch envelope instead of aborting.

2. Replace `as Error & {kind?: string}` cast on the parse-error catch
   with `instanceof InvalidResponseFormatError`. Required exporting the
   class from curate-session.ts. Matches the existing pattern for
   InvalidResponseFileError and respects the no-`as`-Type convention.

3. Pre-check whitespace-only payload in the command and emit the
   transient `empty-response` envelope directly instead of letting
   JSON.parse('   ') throw and collapse it into the terminal
   `invalid-response-format` kind. Preserves the documented contract
   where `empty-response` keeps the session alive for retry.

4. Replace the one-line buildUnknownSessionEnvelope wrapper with a
   direct export of the internal unknownSessionEnvelope. Same shape,
   one fewer indirection.

Interactive verification: file preserved on whitespace-only payload
(both inline and file), happy-path file deleted post-dispatch, real
envelope still curates clean. 8877 tests pass (1 new for the
parseCurateResponse-throws-InvalidResponseFormatError contract).
… gaps

Second round of PR review fixes (four findings on the previous fix
commit):

1. Switch cleanup-failure-post-success from warnings[] (free-text) to
   errors[] (structured kind). The previous fix moved
   response-file-delete-error into warnings, which buried the machine-
   readable kind under a substring-match. Now the cleanup error is
   appended to the dispatch envelope's errors[] so consumers can switch
   on `kind === 'response-file-delete-error'` programmatically. The
   `ok`/`status` fields stay as the daemon set them, so
   `{ok: true, status: 'done', errors: [{kind: 'response-file-delete-error', …}]}`
   is the success-with-cleanup-hiccup shape.

2. Update docs/curate-protocol.md row for response-file-delete-error to
   describe the new "non-terminal companion" lifecycle. Previously said
   "the curate is aborted" which mismatched the post-fix behavior.

3. Document the warnings[] field in the wire-envelope JSON skeleton.
   Was undocumented (pre-existing gap); this PR actively uses the field
   so now's the time to surface it.

4. Add a doc note under --delete-response-file explaining the
   correct-html re-authoring tax: on a validation-failed continuation,
   the file is unlinked but the session is still live, so the calling
   agent must author a fresh envelope and write a new file before the
   next continuation. Inline --response has no equivalent overhead.

Interactive verification: forced unlink failure via chmod-protected
parent dir. Result: status=done, ok=true, errors=[{kind:response-file-
delete-error, message:EACCES…}], warnings=None. Structured kind is
where consumers can find it.
…eton

Real regression caught in third-round review: when the previous fix
moved --delete-response-file cleanup failures from `warnings[]` to
`errors[]`, the text-mode emitter's `done` branch was left iterating
`warnings[]` only. JSON consumers saw the new `errors[0].kind` cleanly,
but `--format text` users with a successful curate + failed cleanup saw
`✓ Curated to ...` and nothing else — the cleanup failure vanished.

Fix the emitter to iterate companion `errors[]` on the `done` branch
too, prefixed with `⚠` to match the warnings convention. Same
"non-fatal companion" semantics the docs already describe.

Plus two doc lines tracking the new `done`-with-errors[] shape:

- JSON-skeleton comment for `errors[]` now lists the three valid
  placements (correct-html, failed, done-with-companion).
- Status-values table's `done` row notes the companion-errors case
  with a "treat as success; surface but don't abandon" hint so a
  contract author reading top-of-doc material doesn't assume any
  non-empty `errors[]` means failure.

Interactive verification: chmod-555 parent directory → text-mode
emits "✓ Curated to ..." plus "⚠ response-file-delete-error: ..."
on the next line.
…estration

Closes the four-round-carryover finding by pinning the load-bearing
orchestration invariants. Three consecutive review rounds had each
introduced a subtle behavioral change in the emitter / orchestration
surface caught only by manual interactive verification:

  abd82ba — deferred unlink until after dispatch
  316ece0 — switched cleanup-failure from warnings[] to errors[]
  60b613d — made text-mode iterate companion errors[] on the done branch

Each would have been caught by a small command-level test with stubbed
seams. Adding both the seams and the tests.

Refactor (src/oclif/commands/curate/index.ts):
- Extract the dispatch path into a protected dispatchContinuation()
  method that wraps withDaemonRetry + continueSession. Throws on
  daemon-error so the caller can map to a daemon-error envelope AND
  skip the post-success unlink — that ordering is the load-bearing
  invariant.
- Extract loadCurateResponseFile, deleteCurateResponseFile,
  peekCurateSession, and resolveProjectRoot calls into protected
  methods so a TestableCurate subclass can inject in-memory doubles.
- Switch the removed-flag check from process.argv.slice(2) to
  this.argv so testable subclasses scope to their own argv (mocha's
  --timeout was leaking into the curate flag parser otherwise).

New test file (test/commands/curate/index.test.ts) — 13 tests pinning:

  flag combinations:
    - --response + --response-file together
    - --delete-response-file without --response-file
    - continuation-only flags on kickoff (no --session)
  pre-dispatch local validation:
    - whitespace payload → empty-response, no dispatch
    - invalid JSON → invalid-response-format, no dispatch
    - unknown session → unknown-session, no unlink
    - --response-file read error → response-file-read-error, no dispatch
  daemon-error path preserves the response file (regression abd82ba):
    - dispatch throws → daemon-error envelope, NO unlink call fired
  cleanup-failure-post-success → done with appended errors[] (316ece0):
    - done + unlink throws → ok=true, status=done, errors[0].kind=response-file-delete-error
    - failed + unlink throws → existing errors preserved, cleanup appended
    - done + unlink succeeds → clean done envelope (no companion errors)
  text-mode emitter surfaces companion errors on done (60b613d):
    - text-mode done + cleanup error → both ✓ and ⚠ lines surface
    - text-mode done + writer warnings → existing behavior unchanged

Pattern follows test/commands/status.test.ts: TestableCurate subclasses
Curate, overrides protected seams via constructor-injected behavior
callbacks, captures stdout via sinon.stub(process.stdout, 'write')
because writeJsonResponse bypasses this.log.

Full suite: 8890 passing (13 new, 1 added earlier this session).
Accept curate continuation envelope from a JSON file, with opt-in cleanup
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

Claude PR Review — in progress

  • Gather PR diff vs origin/main
  • Review <img> rendering + indexing changes (html-reader/renderer, search-knowledge)
  • Review --response-file / --delete-response-file flag implementation (curate command + curate-session)
  • Review docs + skill template updates
  • Review test coverage
  • Post inline comments for issues found
  • Post summary with score

View job run · Branch: proj/byterover-tool-mode

Comment thread src/oclif/lib/curate-session.ts
Comment thread src/server/infra/render/reader/html-renderer.ts
Comment thread src/server/infra/render/reader/html-renderer.ts
Comment thread src/oclif/commands/curate/index.ts
Comment thread src/agent/infra/tools/implementations/search-knowledge-service.ts
Comment thread src/server/infra/render/reader/html-reader.ts
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 30, 2026

Claude finished @danhdoan's task in 4m 39s —— View job


Claude PR Review

  • Gather PR diff vs origin/main
  • Review <img> rendering + indexing changes (html-reader/renderer, search-knowledge)
  • Review --response-file / --delete-response-file flag implementation (curate command + curate-session)
  • Review docs + skill template updates
  • Review test coverage
  • Post inline comments for issues found
  • Post summary with score

Summary

This is a project-to-main merge of two independently-reviewed features (<img> first-class rendering/indexing + curate --response-file flags). The merged diff is well-shaped: the orchestration ordering on the curate command (parse → peek → dispatch → unlink) is correct, the structured-error contract for the new flags is consistent with the rest of the envelope, and the INDEX_SCHEMA_VERSION 6→7 bump correctly invalidates pre-<img> caches so previously-curated img content becomes searchable retroactively.

Strengths

  • Test coverage is strong and intentional. The command-level tests at test/commands/curate/index.test.ts explicitly name the regression commits (abd82ba, 316ece0, 60b613d) in their describe blocks — excellent practice that pins the why of each invariant to a git-blame anchor.
  • Ordering invariants are correct. The "dispatch first, then unlink only on non-throw" ordering at src/oclif/commands/curate/index.ts:471-525 is the right shape, and the matching test at lines 288-313 locks it in.
  • Structured error contract is consistent. response-file-not-regular, response-file-read-error, response-file-delete-error map cleanly into envelope.errors[] with a sensible terminal-vs-companion distinction documented in docs/curate-protocol.md:103.
  • Schema-version bump is the right call. INDEX_SCHEMA_VERSION = 7 invalidates pre-<img> caches up front, matching the PR-body promise about retroactive searchability without brv index rebuild.

Findings (all non-blocking; details in inline comments)

  • renderImgMarkdown emits raw src verbatim — non-HTTP URL schemes (javascript:, data:text/html,...) pass through unguarded (PR body already acknowledges this; flagging at the renderer to pair with the writer-side fix).
  • The autolink fallback at html-renderer.ts:244 could be broadened to handle (, <, >, whitespace, backtick — and could preserve alt text via ${alt}: <${src}> (complements the prior reviewer's point about alt-text loss).
  • pickKickoffMissingFlag returns one flag at a time on multi-flag misuse — could emit all offending flags in one envelope error.
  • brv-instructions.md doesn't tell the agent when to prefer file-based over inline; one line pulled forward from curate.md would line them up.
  • InvalidResponseFileError could expose a RESPONSE_FILE_ERROR_KINDS enumerable for exhaustive switching at downstream consumers.

Prior-round findings I concur with

  • TOCTOU lstat→read/unlink window in curate-session.ts:304-355 — theoretical given the threat model; worth a JSDoc acknowledgement.
  • Autolink fallback information loss (alt text dropped when src contains )).
  • Whitespace-shape parity between getInlineMarkdownRaw and html-parser's getInnerText.
  • Inner if (error instanceof InvalidResponseFileError) swallowing generic Error from deleteResponseFile.
  • MINISEARCH_OPTIONS-change-requires-bump JSDoc.
  • html-reader.ts:113 parameter type could narrow to readonly ElementNode[].

Score: 4.3 / 5

Solid merge of two well-scoped features with strong test coverage and good documentation hygiene. The notable gaps are (1) the writer/renderer URL-scheme sanitization gap (acknowledged in PR body and tracked as a follow-up) and (2) the autolink fallback information loss — neither blocks the merge.
· Branch: proj/byterover-tool-mode

Comment thread src/server/infra/render/reader/html-renderer.ts
Comment thread src/oclif/commands/curate/index.ts
Comment thread test/commands/curate/index.test.ts
Comment thread src/server/infra/render/reader/html-renderer.ts
Comment thread src/oclif/lib/curate-session.ts
Comment thread src/server/templates/sections/brv-instructions.md
@danhdoan danhdoan changed the title Proj/byterover tool mode fix: Render and index <img> as first-class inline content May 30, 2026
@danhdoan danhdoan merged commit c498b2f into main May 30, 2026
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants