Skip to content

feat: bash chunker fallback + NDJSON streaming + search merge#24

Closed
dvcdsys wants to merge 6 commits intomainfrom
feat/bash-fallback-and-ndjson-streaming
Closed

feat: bash chunker fallback + NDJSON streaming + search merge#24
dvcdsys wants to merge 6 commits intomainfrom
feat/bash-fallback-and-ndjson-streaming

Conversation

@dvcdsys
Copy link
Copy Markdown
Owner

@dvcdsys dvcdsys commented Apr 27, 2026

Summary

Five-commit PR covering chunker robustness, end-to-end indexing reliability, and search-result quality.

1. Chunker — language registry + parse-budget + bash fallback (c502770)

  • defaultRegistry ships 30 languages out of the box (Tier A: Go/Python/JS/TS/TSX/Java/C/C++/Ruby/PHP/Rust; Tier B: C#/Swift/Kotlin/Scala/Haskell/Elixir/Erlang/OCaml/Lua/Bash/HTML/CSS/SQL/YAML/JSON/TOML/Markdown/Dockerfile/Make). Configurable via CIX_ENABLED_LANGUAGES.
  • parseBudget = 2s with twin guards (SetTimeoutMicros + SetCancellationFlag armed by time.AfterFunc). Tree-sitter-bash's catastrophic backtracking on real install.sh (31s on 7KB) → falls back to regex.
  • bashRegexChunks extracts name() {, function name {, and one-liners with a state machine that handles strings/comments/heredocs (<<EOF, <<-EOF, <<'EOF', <<< here-strings). 18 tests.
  • extractName extended for word/field_identifier/simple_identifier/constant so bash/Go-method/Swift/Kotlin/Ruby chunks pick up the symbol name.

2. NDJSON streaming for /index/files + ctx-disconnect cancel (8e46c97)

  • Replaces single-JSON response with NDJSON event stream when client sends Accept: application/x-ndjson. Server-side 10s heartbeat + per-file events (file_started, file_chunked, file_embedded, file_done, file_error, batch_done).
  • Server detects client disconnect via r.Context().Done() → calls CancelIndexing() to release per-project session lock immediately. Was: lock survived until 1h TTL.
  • CLI gets a separate streamingClient with Timeout: 0 and 60s idle watchdog (6× heartbeat margin). New --idle-timeout flag.
  • Three render modes: Interactive (TTY, in-place status line), LineByLine (CI/non-TTY), Quiet (watcher).
  • SIGINT/SIGTERM in cix reindex propagates via signal.NotifyContext. Belt-and-braces deferred CancelIndex on error.
  • Backwards compat: server still serves old clients (no Accept header → single JSON). New client hard-fails with ErrLegacyServer against an old server (operator deploy is server-first).

3. CLI polish — cix cancel + summary grouping + dev makefile (94ed0ff)

  • cix cancel <project> frees a stuck session lock without waiting for TTL.
  • cix summary groups "Top symbols" by language instead of mixing them.
  • server/Makefile: docker-build-cuda-dev builds + pushes :cu128-dev for prod-style smoke testing before tagging a release.

4. Chunker fixes — markdown duplication + splitChunk symbol leakage (58de363)

  • Markdown registry: drop atx_heading (sections already include it) → no more 1-2 line duplicate chunks for every ### foo heading.
  • splitChunk: when an oversized function is cut into N pieces, only the first piece keeps SymbolName/SymbolSignature. Subsequent pieces become anonymous block chunks. Was: cix def run returned N hits at different line ranges of the same function.

5. Search merge — overlapping hits + windowed retrieval (f87e7e7)

  • New mergeOverlappingHits collapses results from the same file when one's range contains another's (markdown H1 ⊋ H2, Go class ⊋ method) OR when same-symbol pieces are adjacent (splitChunk leftovers). Inner matches recorded as NestedHits; merged score = max.
  • Windowed retrieval: when merge collapses below --limit, re-fetch with limit×2, ×4, ×8, ×16. Stops early on HNSW exhaustion.
  • CLI render shows + N more match(es) inside: breadcrumbs so the user sees WHY the outer chunk ranks well.

Test plan

  • cd server && go test ./... — all green
  • cd cli && go test ./... — all green
  • make docker-build-cuda-dev → redeploy on prod RTX 3090 box
  • cix reindex --full with the install.sh in tree (was hanging, then crashing) — should now complete in <2s for that file
  • cix search "server start" against a fresh index — should NOT show 1-2 line markdown heading fragments or duplicate run() chunks
  • Ctrl+C mid-cix reindex --full → immediate session-lock release; subsequent cix reindex works without 409
  • cix watch mid-batch kill → same as above
  • Older cix against new server → old single-JSON path still works
  • New cix against old server → fails fast with ErrLegacyServer

🤖 Generated with Claude Code

dvcdsys and others added 6 commits April 27, 2026 21:50
…bash regex fallback

Three related chunker improvements:

1. Language registry expansion. defaultRegistry now ships 30 languages
   (Tier A: Go/Python/JS/TS/TSX/Java/C/C++/Ruby/PHP/Rust; Tier B: C#/Swift/
   Kotlin/Scala/Haskell/Elixir/Erlang/OCaml/Lua/Bash/HTML/CSS/SQL/YAML/JSON
   /TOML/Markdown/Dockerfile/Make). Configurable via CIX_ENABLED_LANGUAGES.
   See doc/LANGUAGES.md.

2. Tree-sitter parse-budget guard. Bash grammar exhibits catastrophic
   backtracking on real-world scripts (31s on 7KB install.sh). Added
   parseBudget=2s with twin guards: SetTimeoutMicros + SetCancellationFlag
   armed by time.AfterFunc. On budget exceeded, falls back to chunkFallback().

3. Bash regex fallback. New bashRegexChunks() recognises the three common
   bash function forms (POSIX `name() {`, keyword `function name {`, with
   or without parens) and finds each function's closing brace via a state
   machine that handles single/double strings, comments, heredocs
   (<<EOF / <<-EOF / <<'EOF' / <<\"EOF\"), and here-strings (<<<).
   Module-type chunks fill gaps so top-level commands stay indexed.
   18 tests cover all forms + edge cases (nested braces, strings/comments
   with braces, install.sh repro, malformed/unbalanced).

extractName extended with `word`, `field_identifier`, `simple_identifier`,
`constant` so tree-sitter chunks for bash/Go-method/Swift/Kotlin/Ruby pick
up the symbol name instead of falling back to nil.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…index/files

Replaces the single-JSON response of POST /index/files with an NDJSON event
stream when the client sends Accept: application/x-ndjson. Solves three
real-world pain points seen on heavy batches (e.g. 20 files = 347 chunks =
165s on a single GPU slot):

* CLI no longer hits its 600s http.Client deadline on long batches —
  per-event keepalive plus a server-side 10s heartbeat ticker keep the
  connection alive arbitrarily long. New streamingClient on the CLI uses
  Timeout: 0 with a 60s idle-watchdog instead.
* Server now detects client disconnect mid-batch via r.Context().Done()
  and immediately calls CancelIndexing() to release the per-project session
  lock. Previously the lock survived until the 1h TTL.
* Per-file progress is visible during the batch (file_started, file_chunked,
  file_embedded, file_done events). Three render modes on the CLI:
  Interactive (TTY status line with CR), LineByLine (CI / non-TTY),
  Quiet (watcher — only summaries + file_error).

Backwards compatibility is asymmetric and intentional: server still serves
old clients (no Accept header → existing single-JSON path). New CLI hard-
fails with ErrLegacyServer if it gets back Content-Type: application/json,
because the operator's deploy workflow is server-first.

Wire format and event schema documented in server/internal/indexer/progress.go
and mirrored in cli/internal/client/progress.go.

SIGINT/SIGTERM in `cix reindex` now propagates via signal.NotifyContext —
HTTP request context cancels, server frees lock automatically. Belt-and-
braces deferred CancelIndex on error paths in indexer.Run() and watcher
Stop().

Tests:
* server/internal/httpapi/indexing_streaming_test.go — streaming happy
  path, disconnect-frees-lock (direct handler invocation with custom
  flushRecorder), legacy compat negotiation
* cli/internal/client/index_streaming_test.go — NDJSON parse, callback
  ordering, ErrLegacyServer hard-fail, idle timeout, retry on 503/429,
  back-compat SendFiles wrapper
* watcher tests updated to mock NDJSON responses

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…efile target

* New `cix cancel` command frees a stuck per-project session lock without
  having to wait for the 1h TTL. Pairs with the streaming-handler ctx-
  disconnect path: in normal use the lock auto-clears, this is the manual
  escape hatch.

* `cix summary` now groups "Top symbols" by language and shows up to N per
  language instead of one mixed list. Earlier output mixed Go/Python/JS
  symbols with no indication of which file they came from, which made the
  summary nearly useless on multi-language repos.

* server/Makefile: docker-build-cuda-dev target builds + pushes :cu128-dev
  for manual prod testing before tagging a release. Floating tag, no
  pinned variant — rollback isn't a concern for a dev tag.

* root Makefile: small build-target plumbing.

* doc/benchmark-cix-vs-grep.md: numbers from the search-vs-grep comparison
  done while debugging the install.sh hang. Tracked locally — not user
  documentation, more of an internal reference.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s symbol metadata

Two related fixes that cleaned up search/cix-def output for repos with
markdown docs and long Go/Python functions:

1. Markdown registry. tree-sitter-markdown's `section` already wraps
   the heading + its body, so listing both \`section\` and \`atx_heading\`
   in the type-nodes config emitted duplicate one-line chunks for every
   \`### foo\` heading (visible as Type: type | 1-2 line snippets in
   `cix search` output). Drop \`atx_heading\` — keep only \`section\`.

2. splitChunk. When tree-sitter emitted a function chunk larger than
   maxChunkSize (default 4500 chars), splitChunk cut it into N pieces and
   set SymbolName/SymbolSignature/ChunkType=\"function\" on every piece.
   Result: cix def run returned N hits at different line ranges of the
   same function. Now only the FIRST piece carries the symbol metadata;
   subsequent pieces become anonymous \`block\` chunks. Full content of
   the symbol stays indexed for embed/FTS search — only the symbol-index
   attribution is consolidated.

Test: TestSplitChunk_OnlyFirstKeepsSymbol — fixture is a 2000-line Python
function, asserts exactly one chunk in the result claims symbol=big_func.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…b render

Tree-sitter emits nested chunks by design — a markdown H1 wraps its H2
sub-sections, a Go class wraps its methods, a Python module wraps its
classes. A vector search that hits the inner chunk also tends to hit (a
bit weaker) the outer chunk, and the user's --limit budget gets eaten
by N near-duplicates of the same code region. Same problem with
splitChunk leftovers when a long function is cut into pieces.

This change collapses overlapping results from the same file into a
single "outer" hit with the inner matches recorded as NestedHits. Two
merge cases:

1. Strict containment — A.range ⊋ B.range and same file → absorb B.
2. Same-symbol adjacent — adjacent ranges where at least one carries a
   symbol name → absorb (catches splitChunk piece1 + piece2 leftovers).

Cross-file results are NEVER merged. Exact duplicates (same range twice)
are not merged either — those should be deduped at the vector-store
layer (already are, via dedupByLocation).

Windowed retrieval: instead of over-fetching limit×2 once, the search
handler now retries with limit×2, ×4, ×8, ×16 if mergeOverlappingHits
collapses the result set below the user's --limit. Stops early when the
vector store returns fewer rows than asked (HNSW exhausted) or when the
factor cap is hit. In practice the first window is enough; the loop
exists for repos with deeply nested markdown or many class+method
hits inside the same files.

Server changes:
- New file search_merge.go — mergeOverlappingHits, shouldMerge.
- searchResultItem gains NestedHits []nestedHit (omitempty).
- semanticSearchHandler refactored: extract fetchVectorResults +
  filterToSearchItems helpers, wrap call in factor loop, drop the early
  break-on-limit (merge needs the full filtered set to identify
  overlaps).
- 10 unit tests for mergeOverlappingHits + 1 integration test
  (TestSemanticSearch_NestedMarkdownMerge) verifying nested H1/H2
  sections collapse to a single result with NestedHits populated.

CLI changes:
- SearchResult / NestedHit struct mirrors the server response.
- cix search render shows "+ N more match(es) inside:" with
  per-hit score/range/symbol so the user sees WHY the outer chunk
  ranks well even when the actual signal came from a sub-section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s inside

Changes the unit of search output from "chunk" to "file". Inspired by
how grep groups hits per file but with AST-aware match boundaries and
embedding-driven ranking.

Old wire shape: a flat list of chunks. A file with three matching
chunks ate three slots out of the user's --limit budget, scattered
across the result list, often with the file appearing at positions #3
and #10 simultaneously.

New wire shape:
  results: [
    {
      file_path, language, best_score,
      matches: [
        { start_line, end_line, score, content, chunk_type, symbol_name,
          nested_hits },
        ...
      ]
    }
  ]
  total: <distinct files>

Ranking:
* Files ordered by best_score (the highest match score in the group)
  descending.
* Inside each file, matches ordered by start_line ascending — natural
  reading order top-to-bottom.
* No per-file cap on matches. The only intra-file filter is min_score.
  A file with 50 matches above threshold shows all 50.

Window loop now targets distinct files, not chunks: factor 2..16,
stops when len(file_groups) >= limit, when the vector store returns
fewer rows than asked, or when the cap is hit.

mergeOverlappingHits still runs FIRST (collapses nested H1⊋H2⊋H3 etc.
into one match with nested_hits inside), then groupByFile lifts the
survivors into file-grouped output. So a markdown file with three
nested sections still produces ONE match (not one file with three),
and a Go file with class+method overlap still produces a clean class
match with the method as a nested hit.

CLI render redesigned around the new shape:
  1. /path/to/file.go  [best 0.85]  4 matches · go
     -- [0.85] lines 61-195  (function run)
        ```go
        ...
        ```
        + 1 more match inside:
          · [0.50] line 80  (function init)
     -- [0.42] lines 250-280  (type Server)
        ```go
        ...
        ```

Tests:
* groupByFile: sort-by-best-score, sort-matches-by-line, preserves
  nested_hits, empty input.
* TestSemanticSearch_NestedMarkdownMerge updated for the new shape —
  still asserts the H1 absorbs the two H2 sub-sections (now visible
  as group.Matches[0].NestedHits).
* CLI search_test fixture updated to new wire shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dvcdsys dvcdsys closed this Apr 28, 2026
@dvcdsys dvcdsys deleted the feat/bash-fallback-and-ndjson-streaming branch April 28, 2026 10:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant