feat: bash chunker fallback + NDJSON streaming + search merge#24
Closed
feat: bash chunker fallback + NDJSON streaming + search merge#24
Conversation
…bash regex fallback
Three related chunker improvements:
1. Language registry expansion. defaultRegistry now ships 30 languages
(Tier A: Go/Python/JS/TS/TSX/Java/C/C++/Ruby/PHP/Rust; Tier B: C#/Swift/
Kotlin/Scala/Haskell/Elixir/Erlang/OCaml/Lua/Bash/HTML/CSS/SQL/YAML/JSON
/TOML/Markdown/Dockerfile/Make). Configurable via CIX_ENABLED_LANGUAGES.
See doc/LANGUAGES.md.
2. Tree-sitter parse-budget guard. Bash grammar exhibits catastrophic
backtracking on real-world scripts (31s on 7KB install.sh). Added
parseBudget=2s with twin guards: SetTimeoutMicros + SetCancellationFlag
armed by time.AfterFunc. On budget exceeded, falls back to chunkFallback().
3. Bash regex fallback. New bashRegexChunks() recognises the three common
bash function forms (POSIX `name() {`, keyword `function name {`, with
or without parens) and finds each function's closing brace via a state
machine that handles single/double strings, comments, heredocs
(<<EOF / <<-EOF / <<'EOF' / <<\"EOF\"), and here-strings (<<<).
Module-type chunks fill gaps so top-level commands stay indexed.
18 tests cover all forms + edge cases (nested braces, strings/comments
with braces, install.sh repro, malformed/unbalanced).
extractName extended with `word`, `field_identifier`, `simple_identifier`,
`constant` so tree-sitter chunks for bash/Go-method/Swift/Kotlin/Ruby pick
up the symbol name instead of falling back to nil.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…index/files Replaces the single-JSON response of POST /index/files with an NDJSON event stream when the client sends Accept: application/x-ndjson. Solves three real-world pain points seen on heavy batches (e.g. 20 files = 347 chunks = 165s on a single GPU slot): * CLI no longer hits its 600s http.Client deadline on long batches — per-event keepalive plus a server-side 10s heartbeat ticker keep the connection alive arbitrarily long. New streamingClient on the CLI uses Timeout: 0 with a 60s idle-watchdog instead. * Server now detects client disconnect mid-batch via r.Context().Done() and immediately calls CancelIndexing() to release the per-project session lock. Previously the lock survived until the 1h TTL. * Per-file progress is visible during the batch (file_started, file_chunked, file_embedded, file_done events). Three render modes on the CLI: Interactive (TTY status line with CR), LineByLine (CI / non-TTY), Quiet (watcher — only summaries + file_error). Backwards compatibility is asymmetric and intentional: server still serves old clients (no Accept header → existing single-JSON path). New CLI hard- fails with ErrLegacyServer if it gets back Content-Type: application/json, because the operator's deploy workflow is server-first. Wire format and event schema documented in server/internal/indexer/progress.go and mirrored in cli/internal/client/progress.go. SIGINT/SIGTERM in `cix reindex` now propagates via signal.NotifyContext — HTTP request context cancels, server frees lock automatically. Belt-and- braces deferred CancelIndex on error paths in indexer.Run() and watcher Stop(). Tests: * server/internal/httpapi/indexing_streaming_test.go — streaming happy path, disconnect-frees-lock (direct handler invocation with custom flushRecorder), legacy compat negotiation * cli/internal/client/index_streaming_test.go — NDJSON parse, callback ordering, ErrLegacyServer hard-fail, idle timeout, retry on 503/429, back-compat SendFiles wrapper * watcher tests updated to mock NDJSON responses Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…efile target * New `cix cancel` command frees a stuck per-project session lock without having to wait for the 1h TTL. Pairs with the streaming-handler ctx- disconnect path: in normal use the lock auto-clears, this is the manual escape hatch. * `cix summary` now groups "Top symbols" by language and shows up to N per language instead of one mixed list. Earlier output mixed Go/Python/JS symbols with no indication of which file they came from, which made the summary nearly useless on multi-language repos. * server/Makefile: docker-build-cuda-dev target builds + pushes :cu128-dev for manual prod testing before tagging a release. Floating tag, no pinned variant — rollback isn't a concern for a dev tag. * root Makefile: small build-target plumbing. * doc/benchmark-cix-vs-grep.md: numbers from the search-vs-grep comparison done while debugging the install.sh hang. Tracked locally — not user documentation, more of an internal reference. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s symbol metadata Two related fixes that cleaned up search/cix-def output for repos with markdown docs and long Go/Python functions: 1. Markdown registry. tree-sitter-markdown's `section` already wraps the heading + its body, so listing both \`section\` and \`atx_heading\` in the type-nodes config emitted duplicate one-line chunks for every \`### foo\` heading (visible as Type: type | 1-2 line snippets in `cix search` output). Drop \`atx_heading\` — keep only \`section\`. 2. splitChunk. When tree-sitter emitted a function chunk larger than maxChunkSize (default 4500 chars), splitChunk cut it into N pieces and set SymbolName/SymbolSignature/ChunkType=\"function\" on every piece. Result: cix def run returned N hits at different line ranges of the same function. Now only the FIRST piece carries the symbol metadata; subsequent pieces become anonymous \`block\` chunks. Full content of the symbol stays indexed for embed/FTS search — only the symbol-index attribution is consolidated. Test: TestSplitChunk_OnlyFirstKeepsSymbol — fixture is a 2000-line Python function, asserts exactly one chunk in the result claims symbol=big_func. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…b render Tree-sitter emits nested chunks by design — a markdown H1 wraps its H2 sub-sections, a Go class wraps its methods, a Python module wraps its classes. A vector search that hits the inner chunk also tends to hit (a bit weaker) the outer chunk, and the user's --limit budget gets eaten by N near-duplicates of the same code region. Same problem with splitChunk leftovers when a long function is cut into pieces. This change collapses overlapping results from the same file into a single "outer" hit with the inner matches recorded as NestedHits. Two merge cases: 1. Strict containment — A.range ⊋ B.range and same file → absorb B. 2. Same-symbol adjacent — adjacent ranges where at least one carries a symbol name → absorb (catches splitChunk piece1 + piece2 leftovers). Cross-file results are NEVER merged. Exact duplicates (same range twice) are not merged either — those should be deduped at the vector-store layer (already are, via dedupByLocation). Windowed retrieval: instead of over-fetching limit×2 once, the search handler now retries with limit×2, ×4, ×8, ×16 if mergeOverlappingHits collapses the result set below the user's --limit. Stops early when the vector store returns fewer rows than asked (HNSW exhausted) or when the factor cap is hit. In practice the first window is enough; the loop exists for repos with deeply nested markdown or many class+method hits inside the same files. Server changes: - New file search_merge.go — mergeOverlappingHits, shouldMerge. - searchResultItem gains NestedHits []nestedHit (omitempty). - semanticSearchHandler refactored: extract fetchVectorResults + filterToSearchItems helpers, wrap call in factor loop, drop the early break-on-limit (merge needs the full filtered set to identify overlaps). - 10 unit tests for mergeOverlappingHits + 1 integration test (TestSemanticSearch_NestedMarkdownMerge) verifying nested H1/H2 sections collapse to a single result with NestedHits populated. CLI changes: - SearchResult / NestedHit struct mirrors the server response. - cix search render shows "+ N more match(es) inside:" with per-hit score/range/symbol so the user sees WHY the outer chunk ranks well even when the actual signal came from a sub-section. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s inside Changes the unit of search output from "chunk" to "file". Inspired by how grep groups hits per file but with AST-aware match boundaries and embedding-driven ranking. Old wire shape: a flat list of chunks. A file with three matching chunks ate three slots out of the user's --limit budget, scattered across the result list, often with the file appearing at positions #3 and #10 simultaneously. New wire shape: results: [ { file_path, language, best_score, matches: [ { start_line, end_line, score, content, chunk_type, symbol_name, nested_hits }, ... ] } ] total: <distinct files> Ranking: * Files ordered by best_score (the highest match score in the group) descending. * Inside each file, matches ordered by start_line ascending — natural reading order top-to-bottom. * No per-file cap on matches. The only intra-file filter is min_score. A file with 50 matches above threshold shows all 50. Window loop now targets distinct files, not chunks: factor 2..16, stops when len(file_groups) >= limit, when the vector store returns fewer rows than asked, or when the cap is hit. mergeOverlappingHits still runs FIRST (collapses nested H1⊋H2⊋H3 etc. into one match with nested_hits inside), then groupByFile lifts the survivors into file-grouped output. So a markdown file with three nested sections still produces ONE match (not one file with three), and a Go file with class+method overlap still produces a clean class match with the method as a nested hit. CLI render redesigned around the new shape: 1. /path/to/file.go [best 0.85] 4 matches · go -- [0.85] lines 61-195 (function run) ```go ... ``` + 1 more match inside: · [0.50] line 80 (function init) -- [0.42] lines 250-280 (type Server) ```go ... ``` Tests: * groupByFile: sort-by-best-score, sort-matches-by-line, preserves nested_hits, empty input. * TestSemanticSearch_NestedMarkdownMerge updated for the new shape — still asserts the H1 absorbs the two H2 sub-sections (now visible as group.Matches[0].NestedHits). * CLI search_test fixture updated to new wire shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Five-commit PR covering chunker robustness, end-to-end indexing reliability, and search-result quality.
1. Chunker — language registry + parse-budget + bash fallback (
c502770)defaultRegistryships 30 languages out of the box (Tier A: Go/Python/JS/TS/TSX/Java/C/C++/Ruby/PHP/Rust; Tier B: C#/Swift/Kotlin/Scala/Haskell/Elixir/Erlang/OCaml/Lua/Bash/HTML/CSS/SQL/YAML/JSON/TOML/Markdown/Dockerfile/Make). Configurable viaCIX_ENABLED_LANGUAGES.parseBudget = 2swith twin guards (SetTimeoutMicros+SetCancellationFlagarmed bytime.AfterFunc). Tree-sitter-bash's catastrophic backtracking on real install.sh (31s on 7KB) → falls back to regex.bashRegexChunksextractsname() {,function name {, and one-liners with a state machine that handles strings/comments/heredocs (<<EOF,<<-EOF,<<'EOF',<<<here-strings). 18 tests.extractNameextended forword/field_identifier/simple_identifier/constantso bash/Go-method/Swift/Kotlin/Ruby chunks pick up the symbol name.2. NDJSON streaming for
/index/files+ ctx-disconnect cancel (8e46c97)Accept: application/x-ndjson. Server-side 10s heartbeat + per-file events (file_started,file_chunked,file_embedded,file_done,file_error,batch_done).r.Context().Done()→ callsCancelIndexing()to release per-project session lock immediately. Was: lock survived until 1h TTL.streamingClientwithTimeout: 0and 60s idle watchdog (6× heartbeat margin). New--idle-timeoutflag.Interactive(TTY, in-place status line),LineByLine(CI/non-TTY),Quiet(watcher).cix reindexpropagates viasignal.NotifyContext. Belt-and-braces deferredCancelIndexon error.ErrLegacyServeragainst an old server (operator deploy is server-first).3. CLI polish —
cix cancel+ summary grouping + dev makefile (94ed0ff)cix cancel <project>frees a stuck session lock without waiting for TTL.cix summarygroups "Top symbols" by language instead of mixing them.server/Makefile:docker-build-cuda-devbuilds + pushes:cu128-devfor prod-style smoke testing before tagging a release.4. Chunker fixes — markdown duplication + splitChunk symbol leakage (
58de363)atx_heading(sections already include it) → no more 1-2 line duplicate chunks for every### fooheading.splitChunk: when an oversized function is cut into N pieces, only the first piece keepsSymbolName/SymbolSignature. Subsequent pieces become anonymousblockchunks. Was:cix def runreturned N hits at different line ranges of the same function.5. Search merge — overlapping hits + windowed retrieval (
f87e7e7)mergeOverlappingHitscollapses results from the same file when one's range contains another's (markdown H1 ⊋ H2, Go class ⊋ method) OR when same-symbol pieces are adjacent (splitChunk leftovers). Inner matches recorded asNestedHits; merged score = max.--limit, re-fetch with limit×2, ×4, ×8, ×16. Stops early on HNSW exhaustion.+ N more match(es) inside:breadcrumbs so the user sees WHY the outer chunk ranks well.Test plan
cd server && go test ./...— all greencd cli && go test ./...— all greenmake docker-build-cuda-dev→ redeploy on prod RTX 3090 boxcix reindex --fullwith the install.sh in tree (was hanging, then crashing) — should now complete in <2s for that filecix search "server start"against a fresh index — should NOT show 1-2 line markdown heading fragments or duplicate run() chunkscix reindex --full→ immediate session-lock release; subsequentcix reindexworks without 409cix watchmid-batch kill → same as abovecixagainst new server → old single-JSON path still workscixagainst old server → fails fast withErrLegacyServer🤖 Generated with Claude Code