feat: search quality + NDJSON index + drop legacy/python-api#25
Merged
feat: search quality + NDJSON index + drop legacy/python-api#25
Conversation
…bash regex fallback
Three related chunker improvements:
1. Language registry expansion. defaultRegistry now ships 30 languages
(Tier A: Go/Python/JS/TS/TSX/Java/C/C++/Ruby/PHP/Rust; Tier B: C#/Swift/
Kotlin/Scala/Haskell/Elixir/Erlang/OCaml/Lua/Bash/HTML/CSS/SQL/YAML/JSON
/TOML/Markdown/Dockerfile/Make). Configurable via CIX_ENABLED_LANGUAGES.
See doc/LANGUAGES.md.
2. Tree-sitter parse-budget guard. Bash grammar exhibits catastrophic
backtracking on real-world scripts (31s on 7KB install.sh). Added
parseBudget=2s with twin guards: SetTimeoutMicros + SetCancellationFlag
armed by time.AfterFunc. On budget exceeded, falls back to chunkFallback().
3. Bash regex fallback. New bashRegexChunks() recognises the three common
bash function forms (POSIX `name() {`, keyword `function name {`, with
or without parens) and finds each function's closing brace via a state
machine that handles single/double strings, comments, heredocs
(<<EOF / <<-EOF / <<'EOF' / <<\"EOF\"), and here-strings (<<<).
Module-type chunks fill gaps so top-level commands stay indexed.
18 tests cover all forms + edge cases (nested braces, strings/comments
with braces, install.sh repro, malformed/unbalanced).
extractName extended with `word`, `field_identifier`, `simple_identifier`,
`constant` so tree-sitter chunks for bash/Go-method/Swift/Kotlin/Ruby pick
up the symbol name instead of falling back to nil.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…index/files Replaces the single-JSON response of POST /index/files with an NDJSON event stream when the client sends Accept: application/x-ndjson. Solves three real-world pain points seen on heavy batches (e.g. 20 files = 347 chunks = 165s on a single GPU slot): * CLI no longer hits its 600s http.Client deadline on long batches — per-event keepalive plus a server-side 10s heartbeat ticker keep the connection alive arbitrarily long. New streamingClient on the CLI uses Timeout: 0 with a 60s idle-watchdog instead. * Server now detects client disconnect mid-batch via r.Context().Done() and immediately calls CancelIndexing() to release the per-project session lock. Previously the lock survived until the 1h TTL. * Per-file progress is visible during the batch (file_started, file_chunked, file_embedded, file_done events). Three render modes on the CLI: Interactive (TTY status line with CR), LineByLine (CI / non-TTY), Quiet (watcher — only summaries + file_error). Backwards compatibility is asymmetric and intentional: server still serves old clients (no Accept header → existing single-JSON path). New CLI hard- fails with ErrLegacyServer if it gets back Content-Type: application/json, because the operator's deploy workflow is server-first. Wire format and event schema documented in server/internal/indexer/progress.go and mirrored in cli/internal/client/progress.go. SIGINT/SIGTERM in `cix reindex` now propagates via signal.NotifyContext — HTTP request context cancels, server frees lock automatically. Belt-and- braces deferred CancelIndex on error paths in indexer.Run() and watcher Stop(). Tests: * server/internal/httpapi/indexing_streaming_test.go — streaming happy path, disconnect-frees-lock (direct handler invocation with custom flushRecorder), legacy compat negotiation * cli/internal/client/index_streaming_test.go — NDJSON parse, callback ordering, ErrLegacyServer hard-fail, idle timeout, retry on 503/429, back-compat SendFiles wrapper * watcher tests updated to mock NDJSON responses Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…efile target * New `cix cancel` command frees a stuck per-project session lock without having to wait for the 1h TTL. Pairs with the streaming-handler ctx- disconnect path: in normal use the lock auto-clears, this is the manual escape hatch. * `cix summary` now groups "Top symbols" by language and shows up to N per language instead of one mixed list. Earlier output mixed Go/Python/JS symbols with no indication of which file they came from, which made the summary nearly useless on multi-language repos. * server/Makefile: docker-build-cuda-dev target builds + pushes :cu128-dev for manual prod testing before tagging a release. Floating tag, no pinned variant — rollback isn't a concern for a dev tag. * root Makefile: small build-target plumbing. * doc/benchmark-cix-vs-grep.md: numbers from the search-vs-grep comparison done while debugging the install.sh hang. Tracked locally — not user documentation, more of an internal reference. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s symbol metadata Two related fixes that cleaned up search/cix-def output for repos with markdown docs and long Go/Python functions: 1. Markdown registry. tree-sitter-markdown's `section` already wraps the heading + its body, so listing both \`section\` and \`atx_heading\` in the type-nodes config emitted duplicate one-line chunks for every \`### foo\` heading (visible as Type: type | 1-2 line snippets in `cix search` output). Drop \`atx_heading\` — keep only \`section\`. 2. splitChunk. When tree-sitter emitted a function chunk larger than maxChunkSize (default 4500 chars), splitChunk cut it into N pieces and set SymbolName/SymbolSignature/ChunkType=\"function\" on every piece. Result: cix def run returned N hits at different line ranges of the same function. Now only the FIRST piece carries the symbol metadata; subsequent pieces become anonymous \`block\` chunks. Full content of the symbol stays indexed for embed/FTS search — only the symbol-index attribution is consolidated. Test: TestSplitChunk_OnlyFirstKeepsSymbol — fixture is a 2000-line Python function, asserts exactly one chunk in the result claims symbol=big_func. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…b render Tree-sitter emits nested chunks by design — a markdown H1 wraps its H2 sub-sections, a Go class wraps its methods, a Python module wraps its classes. A vector search that hits the inner chunk also tends to hit (a bit weaker) the outer chunk, and the user's --limit budget gets eaten by N near-duplicates of the same code region. Same problem with splitChunk leftovers when a long function is cut into pieces. This change collapses overlapping results from the same file into a single "outer" hit with the inner matches recorded as NestedHits. Two merge cases: 1. Strict containment — A.range ⊋ B.range and same file → absorb B. 2. Same-symbol adjacent — adjacent ranges where at least one carries a symbol name → absorb (catches splitChunk piece1 + piece2 leftovers). Cross-file results are NEVER merged. Exact duplicates (same range twice) are not merged either — those should be deduped at the vector-store layer (already are, via dedupByLocation). Windowed retrieval: instead of over-fetching limit×2 once, the search handler now retries with limit×2, ×4, ×8, ×16 if mergeOverlappingHits collapses the result set below the user's --limit. Stops early when the vector store returns fewer rows than asked (HNSW exhausted) or when the factor cap is hit. In practice the first window is enough; the loop exists for repos with deeply nested markdown or many class+method hits inside the same files. Server changes: - New file search_merge.go — mergeOverlappingHits, shouldMerge. - searchResultItem gains NestedHits []nestedHit (omitempty). - semanticSearchHandler refactored: extract fetchVectorResults + filterToSearchItems helpers, wrap call in factor loop, drop the early break-on-limit (merge needs the full filtered set to identify overlaps). - 10 unit tests for mergeOverlappingHits + 1 integration test (TestSemanticSearch_NestedMarkdownMerge) verifying nested H1/H2 sections collapse to a single result with NestedHits populated. CLI changes: - SearchResult / NestedHit struct mirrors the server response. - cix search render shows "+ N more match(es) inside:" with per-hit score/range/symbol so the user sees WHY the outer chunk ranks well even when the actual signal came from a sub-section. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s inside Changes the unit of search output from "chunk" to "file". Inspired by how grep groups hits per file but with AST-aware match boundaries and embedding-driven ranking. Old wire shape: a flat list of chunks. A file with three matching chunks ate three slots out of the user's --limit budget, scattered across the result list, often with the file appearing at positions #3 and #10 simultaneously. New wire shape: results: [ { file_path, language, best_score, matches: [ { start_line, end_line, score, content, chunk_type, symbol_name, nested_hits }, ... ] } ] total: <distinct files> Ranking: * Files ordered by best_score (the highest match score in the group) descending. * Inside each file, matches ordered by start_line ascending — natural reading order top-to-bottom. * No per-file cap on matches. The only intra-file filter is min_score. A file with 50 matches above threshold shows all 50. Window loop now targets distinct files, not chunks: factor 2..16, stops when len(file_groups) >= limit, when the vector store returns fewer rows than asked, or when the cap is hit. mergeOverlappingHits still runs FIRST (collapses nested H1⊋H2⊋H3 etc. into one match with nested_hits inside), then groupByFile lifts the survivors into file-grouped output. So a markdown file with three nested sections still produces ONE match (not one file with three), and a Go file with class+method overlap still produces a clean class match with the method as a nested hit. CLI render redesigned around the new shape: 1. /path/to/file.go [best 0.85] 4 matches · go -- [0.85] lines 61-195 (function run) ```go ... ``` + 1 more match inside: · [0.50] line 80 (function init) -- [0.42] lines 250-280 (type Server) ```go ... ``` Tests: * groupByFile: sort-by-best-score, sort-matches-by-line, preserves nested_hits, empty input. * TestSemanticSearch_NestedMarkdownMerge updated for the new shape — still asserts the H1 absorbs the two H2 sub-sections (now visible as group.Matches[0].NestedHits). * CLI search_test fixture updated to new wire shape. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removes the archived Python/FastAPI backend deprecated on 2026-04-24 per the schedule in doc/MIGRATION_FROM_PYTHON.md. No Go code imports this tree; the migration completed in server/v0.3.0. Updated references: * CONTRIBUTING.md: drop legacy row from repo-tree diagram * doc/MIGRATION_FROM_PYTHON.md: past-tense, retain as historical / rollback recipe for the preserved :0.2-python-legacy Docker tag * doc/DEPRECATION_POLICY.md: past-tense * .github/workflows/codeql.yml: drop two now-stale comment fragments * server/internal/vectorstore/store.go: rephrase docID comment to drop the dead path pointer; the byte-format invariant (md5[:6] → 12 hex chars + line range + idx) stays load-bearing for existing prod indexes * README: .cixignore example legacy/ -> vendor/ to avoid implying the dir still ships server/bench/queries.json keeps "legacy/" in anti_paths defensively; the rules are inert with the dir gone. Slates the next server tag at v0.4.0 per the deletion plan in doc/MIGRATION_FROM_PYTHON.md. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Search-quality + CLI-UX batch on top of v0.3.0. Three threads, one PR
because they share wire contracts (
/index/filesstreaming,/searchresult shape) and the same UX surface (
cix search,cix reindex).Search result quality
14fc45a):/searchnow returns one entryper file with all matches nested inside, instead of a flat chunk list
where one file could occupy
--limitslots. Wire shape:results[].matches[]+total = distinct files.f87e7e7): nested tree-sitter chunks (markdownH1⊋H2, Go class⊋method, Python module⊋class) collapse into one outer
hit with
nested_hits[]recorded.splitChunkleftovers from thesame symbol also merge.
f87e7e7): handler retries withlimit×2/4/8/16 if merging collapses below
--limit. Stops early onHNSW exhaustion.
58de363): markdown registry no longeremits duplicate one-line
atx_headingchunks; only the firstsplitChunkpiece carries symbol metadata socix defreturns onehit per function instead of N.
f87e7e7+14fc45a): breadcrumb-style output —path [best 0.85] 4 matches · gothen per-match score/range/symbolwith
+ N more match(es) inside:for nested hits.Indexing — streaming + cancel
8e46c97):POST /index/fileswithAccept: application/x-ndjsonreturns per-file events(
file_started/file_chunked/file_embedded/file_done/file_error) plus a 10s heartbeat. Old clients (no Accept header)still get the single-JSON path. New CLI hard-fails on
application/json(ErrLegacyServer) — server-first deploy.8e46c97): server detectsr.Context().Done()mid-batch and calls
CancelIndexing()immediately, freeing theper-project session lock instead of waiting for the 1h TTL.
cix cancelcommand (94ed0ff): manual escape hatch for stucksession locks.
cix reindexusessignal.NotifyContextsoCtrl-C cancels the HTTP request; deferred
CancelIndexon watcher/ indexer error paths as belt-and-braces.
Chunker — language coverage + safety
c502770): default registry now ships Tier-A +Tier-B grammars; toggleable via
CIX_LANGUAGES. Seedoc/LANGUAGES.md.
c502770): tree-sitterSetTimeoutMicros+SetCancellationFlagarmed bytime.AfterFunc(2s). Catchescatastrophic backtracking (bash on 7KB
install.shwas 31s →bounded).
c502770): on parse-budget exceeded forbash, switches to a regex chunker that handles POSIX /
function/parens-or-not function forms with a string/comment/heredoc-aware
brace matcher.
CLI ergonomics
cix summaryTop symbols grouped by language (was a useless mixedlist on multi-language repos).
cix search --exclude <path>per-query escape hatch for noisydirectories.
--min-scoreraised0.1 → 0.4, calibrated forCodeRankEmbed-Q8_0 + path-aware embeddings. README has a new
Tuning Search Quality section explaining the score landscape.
Cleanup — drop
legacy/python-api/(f20e9c6)doc/MIGRATION_FROM_PYTHON.md. No Go
code imports it; the migration completed on 2026-04-24.
past tense, codeql.yml comments, and the dead path reference in
server/internal/vectorstore/store.go(byte-format invariantretained — just dropped the broken pointer).
legacy/→vendor/so the.cixignoresnippetdoesn't imply the directory still ships.
anti_pathskeeplegacy/defensively (inert with thedirectory gone).
Compatibility / release implications
/searchJSON shape changed (chunk list → file list with nestedmatches). The CLI in this PR is updated to the new shape. Older
CLI pinned to the prior shape will not parse new server responses
— bump CLI on consumers when this server tag ships.
/index/filesstreaming is opt-in viaAcceptheader; legacysingle-JSON path preserved for older clients.
server/v0.4.0per the deletion plan in the migration doc.
cli/v0.4.0; recommendcli/v0.5.0for the next CLItag (new
cancelcommand, breaking--min-scoredefault, hard-failon legacy server).
Test plan
go test ./...green inserver/andcli/go vet ./...cleangrep -rIn 'legacy/python-api'returns nothing — no stale pathreferences left in the tree
cix search "<vague query>"on this repo — verifyfile-grouped output, breadcrumb render, nested-hit rollup
cix reindex --fullon a 20+ file batch — verify NDJSONprogress events stream, no 600s timeout
cix reindexthen Ctrl-C mid-batch — verifycix statusshows no stuck session (no 409 on next reindex)cix cancelafter a deliberately stuck session —verify lock releases
cix summaryon a multi-language repo — verify Topsymbols grouped per language
🤖 Generated with Claude Code