Skip to content

feat: search quality + NDJSON index + drop legacy/python-api#25

Merged
dvcdsys merged 9 commits intomainfrom
feat/search-quality-and-cli-ux
Apr 28, 2026
Merged

feat: search quality + NDJSON index + drop legacy/python-api#25
dvcdsys merged 9 commits intomainfrom
feat/search-quality-and-cli-ux

Conversation

@dvcdsys
Copy link
Copy Markdown
Owner

@dvcdsys dvcdsys commented Apr 28, 2026

Summary

Search-quality + CLI-UX batch on top of v0.3.0. Three threads, one PR
because they share wire contracts (/index/files streaming, /search
result shape) and the same UX surface (cix search, cix reindex).

Search result quality

  • File-grouped results (14fc45a): /search now returns one entry
    per file with all matches nested inside, instead of a flat chunk list
    where one file could occupy --limit slots. Wire shape:
    results[].matches[] + total = distinct files.
  • Overlap merging (f87e7e7): nested tree-sitter chunks (markdown
    H1⊋H2, Go class⊋method, Python module⊋class) collapse into one outer
    hit with nested_hits[] recorded. splitChunk leftovers from the
    same symbol also merge.
  • Windowed retrieval (f87e7e7): handler retries with
    limit×2/4/8/16 if merging collapses below --limit. Stops early on
    HNSW exhaustion.
  • Chunker quality fixes (58de363): markdown registry no longer
    emits duplicate one-line atx_heading chunks; only the first
    splitChunk piece carries symbol metadata so cix def returns one
    hit per function instead of N.
  • CLI render (f87e7e7 + 14fc45a): breadcrumb-style output —
    path [best 0.85] 4 matches · go then per-match score/range/symbol
    with + N more match(es) inside: for nested hits.

Indexing — streaming + cancel

  • NDJSON streaming (8e46c97): POST /index/files with
    Accept: application/x-ndjson returns per-file events
    (file_started / file_chunked / file_embedded / file_done /
    file_error) plus a 10s heartbeat. Old clients (no Accept header)
    still get the single-JSON path. New CLI hard-fails on
    application/json (ErrLegacyServer) — server-first deploy.
  • Disconnect cancels (8e46c97): server detects r.Context().Done()
    mid-batch and calls CancelIndexing() immediately, freeing the
    per-project session lock instead of waiting for the 1h TTL.
  • cix cancel command (94ed0ff): manual escape hatch for stuck
    session locks.
  • SIGINT propagation: cix reindex uses signal.NotifyContext so
    Ctrl-C cancels the HTTP request; deferred CancelIndex on watcher
    / indexer error paths as belt-and-braces.

Chunker — language coverage + safety

  • 30+ languages (c502770): default registry now ships Tier-A +
    Tier-B grammars; toggleable via CIX_LANGUAGES. See
    doc/LANGUAGES.md.
  • Parse-budget guard (c502770): tree-sitter SetTimeoutMicros +
    SetCancellationFlag armed by time.AfterFunc(2s). Catches
    catastrophic backtracking (bash on 7KB install.sh was 31s →
    bounded).
  • Bash regex fallback (c502770): on parse-budget exceeded for
    bash, switches to a regex chunker that handles POSIX / function /
    parens-or-not function forms with a string/comment/heredoc-aware
    brace matcher.

CLI ergonomics

  • cix summary Top symbols grouped by language (was a useless mixed
    list on multi-language repos).
  • cix search --exclude <path> per-query escape hatch for noisy
    directories.
  • Default --min-score raised 0.1 → 0.4, calibrated for
    CodeRankEmbed-Q8_0 + path-aware embeddings. README has a new
    Tuning Search Quality section explaining the score landscape.

Cleanup — drop legacy/python-api/ (f20e9c6)

  • Removes the archived 940KB Python/FastAPI tree per the schedule in
    doc/MIGRATION_FROM_PYTHON.md. No Go
    code imports it; the migration completed on 2026-04-24.
  • Updates CONTRIBUTING.md tree diagram, deprecation/migration docs to
    past tense, codeql.yml comments, and the dead path reference in
    server/internal/vectorstore/store.go (byte-format invariant
    retained — just dropped the broken pointer).
  • README example legacy/vendor/ so the .cixignore snippet
    doesn't imply the directory still ships.
  • Bench-eval anti_paths keep legacy/ defensively (inert with the
    directory gone).

Compatibility / release implications

  • /search JSON shape changed (chunk list → file list with nested
    matches). The CLI in this PR is updated to the new shape. Older
    CLI pinned to the prior shape will not parse new server responses

    — bump CLI on consumers when this server tag ships.
  • /index/files streaming is opt-in via Accept header; legacy
    single-JSON path preserved for older clients.
  • Legacy-removal commit slates next server tag as server/v0.4.0
    per the deletion plan in the migration doc.
  • CLI is at cli/v0.4.0; recommend cli/v0.5.0 for the next CLI
    tag (new cancel command, breaking --min-score default, hard-fail
    on legacy server).

Test plan

  • go test ./... green in server/ and cli/
  • go vet ./... clean
  • grep -rIn 'legacy/python-api' returns nothing — no stale path
    references left in the tree
  • CI passes (ci-server.yml + ci-cli.yml + codeql.yml)
  • Manual: cix search "<vague query>" on this repo — verify
    file-grouped output, breadcrumb render, nested-hit rollup
  • Manual: cix reindex --full on a 20+ file batch — verify NDJSON
    progress events stream, no 600s timeout
  • Manual: cix reindex then Ctrl-C mid-batch — verify
    cix status shows no stuck session (no 409 on next reindex)
  • Manual: cix cancel after a deliberately stuck session —
    verify lock releases
  • Manual: cix summary on a multi-language repo — verify Top
    symbols grouped per language

🤖 Generated with Claude Code

dvcdsys and others added 9 commits April 27, 2026 21:50
…bash regex fallback

Three related chunker improvements:

1. Language registry expansion. defaultRegistry now ships 30 languages
   (Tier A: Go/Python/JS/TS/TSX/Java/C/C++/Ruby/PHP/Rust; Tier B: C#/Swift/
   Kotlin/Scala/Haskell/Elixir/Erlang/OCaml/Lua/Bash/HTML/CSS/SQL/YAML/JSON
   /TOML/Markdown/Dockerfile/Make). Configurable via CIX_ENABLED_LANGUAGES.
   See doc/LANGUAGES.md.

2. Tree-sitter parse-budget guard. Bash grammar exhibits catastrophic
   backtracking on real-world scripts (31s on 7KB install.sh). Added
   parseBudget=2s with twin guards: SetTimeoutMicros + SetCancellationFlag
   armed by time.AfterFunc. On budget exceeded, falls back to chunkFallback().

3. Bash regex fallback. New bashRegexChunks() recognises the three common
   bash function forms (POSIX `name() {`, keyword `function name {`, with
   or without parens) and finds each function's closing brace via a state
   machine that handles single/double strings, comments, heredocs
   (<<EOF / <<-EOF / <<'EOF' / <<\"EOF\"), and here-strings (<<<).
   Module-type chunks fill gaps so top-level commands stay indexed.
   18 tests cover all forms + edge cases (nested braces, strings/comments
   with braces, install.sh repro, malformed/unbalanced).

extractName extended with `word`, `field_identifier`, `simple_identifier`,
`constant` so tree-sitter chunks for bash/Go-method/Swift/Kotlin/Ruby pick
up the symbol name instead of falling back to nil.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…index/files

Replaces the single-JSON response of POST /index/files with an NDJSON event
stream when the client sends Accept: application/x-ndjson. Solves three
real-world pain points seen on heavy batches (e.g. 20 files = 347 chunks =
165s on a single GPU slot):

* CLI no longer hits its 600s http.Client deadline on long batches —
  per-event keepalive plus a server-side 10s heartbeat ticker keep the
  connection alive arbitrarily long. New streamingClient on the CLI uses
  Timeout: 0 with a 60s idle-watchdog instead.
* Server now detects client disconnect mid-batch via r.Context().Done()
  and immediately calls CancelIndexing() to release the per-project session
  lock. Previously the lock survived until the 1h TTL.
* Per-file progress is visible during the batch (file_started, file_chunked,
  file_embedded, file_done events). Three render modes on the CLI:
  Interactive (TTY status line with CR), LineByLine (CI / non-TTY),
  Quiet (watcher — only summaries + file_error).

Backwards compatibility is asymmetric and intentional: server still serves
old clients (no Accept header → existing single-JSON path). New CLI hard-
fails with ErrLegacyServer if it gets back Content-Type: application/json,
because the operator's deploy workflow is server-first.

Wire format and event schema documented in server/internal/indexer/progress.go
and mirrored in cli/internal/client/progress.go.

SIGINT/SIGTERM in `cix reindex` now propagates via signal.NotifyContext —
HTTP request context cancels, server frees lock automatically. Belt-and-
braces deferred CancelIndex on error paths in indexer.Run() and watcher
Stop().

Tests:
* server/internal/httpapi/indexing_streaming_test.go — streaming happy
  path, disconnect-frees-lock (direct handler invocation with custom
  flushRecorder), legacy compat negotiation
* cli/internal/client/index_streaming_test.go — NDJSON parse, callback
  ordering, ErrLegacyServer hard-fail, idle timeout, retry on 503/429,
  back-compat SendFiles wrapper
* watcher tests updated to mock NDJSON responses

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…efile target

* New `cix cancel` command frees a stuck per-project session lock without
  having to wait for the 1h TTL. Pairs with the streaming-handler ctx-
  disconnect path: in normal use the lock auto-clears, this is the manual
  escape hatch.

* `cix summary` now groups "Top symbols" by language and shows up to N per
  language instead of one mixed list. Earlier output mixed Go/Python/JS
  symbols with no indication of which file they came from, which made the
  summary nearly useless on multi-language repos.

* server/Makefile: docker-build-cuda-dev target builds + pushes :cu128-dev
  for manual prod testing before tagging a release. Floating tag, no
  pinned variant — rollback isn't a concern for a dev tag.

* root Makefile: small build-target plumbing.

* doc/benchmark-cix-vs-grep.md: numbers from the search-vs-grep comparison
  done while debugging the install.sh hang. Tracked locally — not user
  documentation, more of an internal reference.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s symbol metadata

Two related fixes that cleaned up search/cix-def output for repos with
markdown docs and long Go/Python functions:

1. Markdown registry. tree-sitter-markdown's `section` already wraps
   the heading + its body, so listing both \`section\` and \`atx_heading\`
   in the type-nodes config emitted duplicate one-line chunks for every
   \`### foo\` heading (visible as Type: type | 1-2 line snippets in
   `cix search` output). Drop \`atx_heading\` — keep only \`section\`.

2. splitChunk. When tree-sitter emitted a function chunk larger than
   maxChunkSize (default 4500 chars), splitChunk cut it into N pieces and
   set SymbolName/SymbolSignature/ChunkType=\"function\" on every piece.
   Result: cix def run returned N hits at different line ranges of the
   same function. Now only the FIRST piece carries the symbol metadata;
   subsequent pieces become anonymous \`block\` chunks. Full content of
   the symbol stays indexed for embed/FTS search — only the symbol-index
   attribution is consolidated.

Test: TestSplitChunk_OnlyFirstKeepsSymbol — fixture is a 2000-line Python
function, asserts exactly one chunk in the result claims symbol=big_func.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…b render

Tree-sitter emits nested chunks by design — a markdown H1 wraps its H2
sub-sections, a Go class wraps its methods, a Python module wraps its
classes. A vector search that hits the inner chunk also tends to hit (a
bit weaker) the outer chunk, and the user's --limit budget gets eaten
by N near-duplicates of the same code region. Same problem with
splitChunk leftovers when a long function is cut into pieces.

This change collapses overlapping results from the same file into a
single "outer" hit with the inner matches recorded as NestedHits. Two
merge cases:

1. Strict containment — A.range ⊋ B.range and same file → absorb B.
2. Same-symbol adjacent — adjacent ranges where at least one carries a
   symbol name → absorb (catches splitChunk piece1 + piece2 leftovers).

Cross-file results are NEVER merged. Exact duplicates (same range twice)
are not merged either — those should be deduped at the vector-store
layer (already are, via dedupByLocation).

Windowed retrieval: instead of over-fetching limit×2 once, the search
handler now retries with limit×2, ×4, ×8, ×16 if mergeOverlappingHits
collapses the result set below the user's --limit. Stops early when the
vector store returns fewer rows than asked (HNSW exhausted) or when the
factor cap is hit. In practice the first window is enough; the loop
exists for repos with deeply nested markdown or many class+method
hits inside the same files.

Server changes:
- New file search_merge.go — mergeOverlappingHits, shouldMerge.
- searchResultItem gains NestedHits []nestedHit (omitempty).
- semanticSearchHandler refactored: extract fetchVectorResults +
  filterToSearchItems helpers, wrap call in factor loop, drop the early
  break-on-limit (merge needs the full filtered set to identify
  overlaps).
- 10 unit tests for mergeOverlappingHits + 1 integration test
  (TestSemanticSearch_NestedMarkdownMerge) verifying nested H1/H2
  sections collapse to a single result with NestedHits populated.

CLI changes:
- SearchResult / NestedHit struct mirrors the server response.
- cix search render shows "+ N more match(es) inside:" with
  per-hit score/range/symbol so the user sees WHY the outer chunk
  ranks well even when the actual signal came from a sub-section.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
…s inside

Changes the unit of search output from "chunk" to "file". Inspired by
how grep groups hits per file but with AST-aware match boundaries and
embedding-driven ranking.

Old wire shape: a flat list of chunks. A file with three matching
chunks ate three slots out of the user's --limit budget, scattered
across the result list, often with the file appearing at positions #3
and #10 simultaneously.

New wire shape:
  results: [
    {
      file_path, language, best_score,
      matches: [
        { start_line, end_line, score, content, chunk_type, symbol_name,
          nested_hits },
        ...
      ]
    }
  ]
  total: <distinct files>

Ranking:
* Files ordered by best_score (the highest match score in the group)
  descending.
* Inside each file, matches ordered by start_line ascending — natural
  reading order top-to-bottom.
* No per-file cap on matches. The only intra-file filter is min_score.
  A file with 50 matches above threshold shows all 50.

Window loop now targets distinct files, not chunks: factor 2..16,
stops when len(file_groups) >= limit, when the vector store returns
fewer rows than asked, or when the cap is hit.

mergeOverlappingHits still runs FIRST (collapses nested H1⊋H2⊋H3 etc.
into one match with nested_hits inside), then groupByFile lifts the
survivors into file-grouped output. So a markdown file with three
nested sections still produces ONE match (not one file with three),
and a Go file with class+method overlap still produces a clean class
match with the method as a nested hit.

CLI render redesigned around the new shape:
  1. /path/to/file.go  [best 0.85]  4 matches · go
     -- [0.85] lines 61-195  (function run)
        ```go
        ...
        ```
        + 1 more match inside:
          · [0.50] line 80  (function init)
     -- [0.42] lines 250-280  (type Server)
        ```go
        ...
        ```

Tests:
* groupByFile: sort-by-best-score, sort-matches-by-line, preserves
  nested_hits, empty input.
* TestSemanticSearch_NestedMarkdownMerge updated for the new shape —
  still asserts the H1 absorbs the two H2 sub-sections (now visible
  as group.Matches[0].NestedHits).
* CLI search_test fixture updated to new wire shape.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Removes the archived Python/FastAPI backend deprecated on 2026-04-24
per the schedule in doc/MIGRATION_FROM_PYTHON.md. No Go code imports
this tree; the migration completed in server/v0.3.0.

Updated references:
* CONTRIBUTING.md: drop legacy row from repo-tree diagram
* doc/MIGRATION_FROM_PYTHON.md: past-tense, retain as historical /
  rollback recipe for the preserved :0.2-python-legacy Docker tag
* doc/DEPRECATION_POLICY.md: past-tense
* .github/workflows/codeql.yml: drop two now-stale comment fragments
* server/internal/vectorstore/store.go: rephrase docID comment to
  drop the dead path pointer; the byte-format invariant (md5[:6] →
  12 hex chars + line range + idx) stays load-bearing for existing
  prod indexes
* README: .cixignore example legacy/ -> vendor/ to avoid implying
  the dir still ships

server/bench/queries.json keeps "legacy/" in anti_paths defensively;
the rules are inert with the dir gone.

Slates the next server tag at v0.4.0 per the deletion plan in
doc/MIGRATION_FROM_PYTHON.md.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@dvcdsys dvcdsys merged commit 5bcf8bf into main Apr 28, 2026
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant