On-disk persistance layer#33
Merged
Merged
Conversation
Per-rule implementations:
store_kuzu: Cypher path matches caller's KindFile node, follows
its EdgeImports adjacency to the imported file nodes, then
finds candidates whose file_path matches an imported file.
Unique candidate across the import set wins.
store_duckdb: SQL UPDATE...FROM with a 5-way JOIN:
edges → nodes(caller) → nodes(caller's file) → edges(imports) →
nodes(imported file) → nodes(candidate). HAVING COUNT(DISTINCT)
= 1 enforces uniqueness.
Filters skip stub-id imported files (external::*, unresolved::*)
so the rule doesn't bind through unresolved chains.
This is the highest-coverage rule for Python / JS / Rust where the
import set is the canonical visibility scope. On the storetest
fixture (caller imports lib.go which exports Target) the rule
rewrites the unresolved::Target edge in a single Cypher / SQL
statement — no Go iteration, no per-edge GetNode round-trip.
Conformance: 9/9 backend-resolver subtests pass on both backends.
The fixture-based test asserts the rewritten edge points at the
expected lib.go::Target node and survives the AllBulk chain.
Two-pass implementation per backend: one for `<stem>.py`, one for
`<stem>/__init__.py`. Either suffix that matches an existing
KindFile node rewrites the edge.
store_kuzu: per-suffix Cypher MATCH+DELETE+CREATE. Cypher's
string concat (`||` in some dialects) is `+` in Kuzu, so the
suffix is inlined as a literal in each pass.
store_duckdb: per-suffix UPDATE...FROM with a CTE that joins
the unresolved edge against the KindFile candidate via
substring(e.to_id, 20) — pyrel prefix is 19 chars
("unresolved::pyrel::"), 20 = 1-indexed start of the stem.
The 19-char prefix length: "unresolved::" (12) + "pyrel::" (7).
Future Dart support would add a third pass with a different
prefix and convention; calling code passes lang="python" (or
empty == all dialects) so the API is forward-compatible.
Conformance: 9/9 backend-resolver subtests pass. The fixture
asserts `unresolved::pyrel::app/util` rewrites to
`app/util.py` when that file node exists in the graph.
Per-backend implementations:
store_kuzu: two-step Cypher pass.
1. Upgrade stub Node rows that AddEdge's mergeStubNodeLocked
created with empty kind: set kind='external' and derive
name from id (strip the 'external::' prefix).
2. Promote edge origin to ast_resolved for every edge whose
to_id starts with 'external::' and lacks origin metadata.
store_duckdb: three statements because DuckDB's AddBatch does
NOT auto-stub endpoints.
1. INSERT distinct external::* rows where the node is missing
(INSERT ... ON CONFLICT DO NOTHING for idempotency).
2. UPDATE pre-existing rows whose kind is empty / wrong.
3. UPDATE edges to promote origin/tier to ast_resolved.
This pass replaces what the Go-side SynthesizeExternalCalls did
on the shadow path — for the DB-delegated cold-load it's the only
way the indexer learns about external::* targets without
materializing the edge list in Go.
Conformance: 9/9 pass on both backends. Fixture asserts the
external::npm/foo::bar node exists post-resolve when the only
input was an edge pointing at it.
Per-rule implementations:
store_kuzu: Cypher MATCH+DELETE+CREATE with the cross-repo
candidate constraint (caller.repo_prefix <> target.repo_prefix
AND both non-empty). Sets cross_repo=1 on the created edge —
Kuzu's schema declares the column INT64, not BOOL, so the
literal must be the integer form.
store_duckdb: SQL UPDATE...FROM with a CTE selecting unique
cross-repo candidates. Schema there has cross_repo BOOLEAN so
TRUE works.
Both rules fire only when caller.repo_prefix is non-empty
(no-op in single-repo mode) and require COUNT(*)=1 cross-repo
candidates to avoid mis-binding across siblings.
Conformance: 9/9 backend-resolver subtests pass on both
backends. Fixture asserts an r1 → r2 cross-repo binding when
r1/a.go::Caller has unresolved::Target and r2/x.go::Target is
the only candidate outside r1.
Phase 3 complete: 6/7 BackendResolver methods now ship per-rule
Cypher + SQL implementations on Kuzu and DuckDB. Only
ResolveUniqueNames (already in store.go from earlier work)
remains in its original location — Phase 4 will port the full
set to Cozo (Datalog) and Ladybug.
Ladybug is a Kuzu fork — its Cypher dialect is byte-compatible with Kuzu's, so the Phase 2 + 3 implementations port verbatim. Copy of store_kuzu/backend_resolver.go with the package name swapped. Also refactors the existing store_ladybug ResolveUniqueNames (originally a Kuzu copy with the targets[0] AS target pattern) into the same two-pass form the Kuzu side adopted — OPTIONAL MATCH + count for the uniqueness check, then a re-MATCH that keeps target typed as Node so the CREATE binder accepts it. Conformance: 9/9 backend-resolver subtests pass. The 38-subtest RunConformance suite is unchanged.
Full Cozo Datalog port of the 7 BackendResolver methods plus
ResolveAllBulk. The implementation is structurally different from
Cypher/SQL because Cozo's Datalog is not a constraint solver — it
won't invert concat() to derive variables, and it has no
substring function. Two patterns make the port workable:
- Extract embedded names via regex_replace
(`name = regex_replace(to_id_old, '^unresolved::', '')`) which
binds the variable in one step rather than relying on
concat-inversion.
- Aggregation in the rule head:
`cand_counts[from_id, to_id_old, count(target_id)] := body`
groups by the non-aggregated head columns implicitly, then
`unique_edges` filters by `cnt == 1`.
- Mutations: every rule does query → :rm old logical key
→ :put new row under one writeMu hold (Cozo has no in-place
UPDATE for stored relations; the composite primary key is
part of what changes when to_id is rewritten).
Per-rule notes:
- ResolveRelativeImports: uses ends_with + regex_replace to
pull the stem from the candidate file path (.py or
/__init__.py), then concat-joins against the unresolved
pyrel:: target.
- ResolveExternalCallStubs: two-phase — (1) regex-derive the
name from external::* edge targets and :put missing nodes;
(2) :rm + :put edges to promote origin to ast_resolved.
- ResolveCrossRepo: sets cross_repo=true (Cozo's column is
Bool) on rewritten edges. Same uniqueness pattern as the
other rules.
Conformance: 9/9 backend-resolver subtests pass plus the existing
38 RunConformance subtests.
Linux-scale bench delivered the final number: Cozo indexes at
854s (comparable to DuckDB) but query latency lands at p50 4.7
seconds, p95 6.6 seconds. The cause is cozo-lib-go not exposing
prepared statements — every GetNode / FindNodesByName re-parses
its Datalog query from a string. Acceptable on the BackendResolver
bulk-pass shape (one parse, many rows) but unusable for the
read-heavy MCP / daemon query surface where the binding is hit
hundreds of times per request.
The 65 MB on-disk footprint (smallest of every backend tested)
isn't worth the 4-5 order-of-magnitude query regression vs Kuzu
(700 µs) or sqlite (479 µs at Linux scale).
Deletes:
- internal/graph/store_cozo/ (store + methods + backend resolver + tests)
- bench/store-bench/cozo_register.go (build-tag-isolated factory)
- bench/store-bench/registry.go (the cozoFactory hook — no more
Rust-backend collisions to worry about)
- skip-cozo flag + wantCozo wiring in main.go
- cozo step in run-linux.sh / run-linux-rest.sh
- github.com/cozodb/cozo-lib-go + github.com/stretchr/objx from go.mod
Conformance: 526 tests pass (the BackendResolver + storetest + indexer
+ resolver suites). The four remaining viable backends are kuzu,
ladybug, duckdb, sqlite — all already validated with the full
BackendResolver Cypher / SQL implementations.
The shadow-swap path reassigned idx.graph to an in-memory shadow during IndexCtx so the resolver and post-resolve passes could run at memory latency, but idx.resolver was constructed at indexer.New with the disk Store and never updated. ResolveAll's r.graph.EdgesWithUnresolvedTarget() queried the empty disk Store, returned zero pending edges, and the function short-circuited on len(pending) == 0 — silently disabling every resolver pass (module attribution, relative imports, cross-package guards, edge in-place resolution, ...) for backends that opt into the swap. Symptom on the gortex bench: in-memory backend produced 36 KindModule nodes for Python pypi/stdlib imports that every disk backend was missing, and kuzu/ladybug had to auto-stub ~70k unresolved::* placeholders that the resolver would normally have bound. Add Resolver.SetGraph and call it in the shadow swap (and the deferred restore) so r.graph follows idx.graph through the swap. SetGraph also re-binds r.mu to the new store's ResolveMutex so concurrent resolvers on the disk store still serialise correctly after the swap completes. Regression test indexes the same Python project into both a *Graph and a sqlite Store and asserts both produce the same node-ID set, with the pypi/stdlib KindModule nodes as the canary.
The headline query-p50 / p95 column collapses six different access patterns into one number, hiding that sqlite wins point lookups (~20µs) while losing on bulk name searches (~30ms) and the Cypher backends are the inverse. Split the workload into per-tool measurements that map to the MCP tools agents actually invoke: get_symbol -> Store.GetNode get_dependencies -> Store.GetOutEdges find_usages -> Store.GetInEdges + EdgeReferences filter get_callers -> Store.GetInEdges + EdgeCalls filter search_symbols -> Store.FindNodesByName get_file_summary -> Store.GetFileNodes The headline aggregate still rides on the result for backwards-compat with prior bench markdown. Also drop the stale cozo reference from run-linux-rest.sh's header comment — cozo was removed earlier; the runner script already only dispatches ladybug, duckdb, sqlite.
…mance
Two standalone diagnostics that index the same repo through two
backends (memory + sqlite) and report the symmetric diff of the
resulting node / edge sets. Caught the shadow-swap resolver-redirect
bug (resolver pointed at the empty disk Store, so module attribution
and edge in-place resolution silently no-op'd for every backend that
opted into the shadow swap) — 36 Python KindModule nodes were missing
on disk, every disk-backed run.
Beyond the original investigation they keep paying for themselves:
node-diff: lists which IDs one backend has that the other dropped,
with a kind / lang / empty-field histogram so the cause
is obvious at a glance.
edge-diff: same shape for edges, classifies the diff by
(Kind, FromKind, ToKind), and reports raw vs. unique-key
counts so a dedup-index bug surfaces as duplicate slots
instead of being masked by AllEdges()'s collapse.
Run periodically when changing the indexer pipeline, the resolver, or
adding a new store backend. Outputs go to bench/results/.
The Go extractor builds EdgeMemberOf targets as
`<methodfile>::TypeName` because it parses one file at a time
(internal/parser/languages/golang.go:955). Methods declared in any
file other than the type's defining file emit edges pointing at a
phantom ID — the real type node lives in a different file with a
different `<file>::TypeName` ID.
Without this pass, every Go type whose methods span multiple files
shows up as N separate "partial types" in the graph:
- InferImplements (resolver.go:1764) keys its typeID→method-set map
on the phantom IDs, so a type with 50 methods across 10 files
appears as 10 partial types with ~5 methods each. Any interface
that needs methods from more than one file is silently NOT
inferred — find_implementations / class_hierarchy / get_callers
over interface methods all return partial results.
- kuzu / ladybug materialise an empty Node row for every phantom
target (rel-table FK), inflating their node counts; gortex bench
surfaced 139 such phantoms on the gortex codebase alone (Indexer
methods spread across crash_isolation.go, dataflow.go,
transform.go, ...; Server methods across the internal/mcp tree).
Memory / sqlite / duckdb tolerated edges-without-nodes so the bug
was invisible at the storage level — but they were silently wrong
about interface satisfaction for the same set of cross-file types.
The pass indexes every Go KindType / KindInterface node by
(filepath.Dir, name), then walks EdgeMemberOf and rewrites the
target from `<methodfile>::Type` to `<typefile>::Type` when exactly
one canonical match exists in the same package. Ambiguous matches
(two distinct types with the same name in the same package, which
shouldn't happen in valid Go) leave the edge alone rather than
guess. Non-Go method nodes are skipped — Java / Python / TS group
methods inside the class body in the same file, so the cross-file
pattern doesn't arise.
Verified on the gortex codebase: 139 suspect cross-file phantoms
collapse to 0 after the pass; total kuzu node count drops by 169
matching real-type rows (the +30 over 139 is non-determinism from
parallel resolution).
Indexes a repo through kuzu and classifies its node set into real (kind/name/file populated) vs stub (all blank but ID), buckets stubs by ID-prefix family, and flags "suspect" stubs whose ID shape DOESN'T match any known synthetic prefix — those are the candidates for parser/resolver bugs that produce edges to non-existent nodes. Caught the cross-file Go method-receiver bug fixed in the previous commit: 139 Go types with methods spread across files were each materialised as one phantom-per-method-file because the parser built the EdgeMemberOf target from the method's own file, not the type's defining file. The diagnostic surfaced them, the rebind pass collapsed them; this harness is the guard against the same shape regressing on other languages (or the same shape on Go after future extractor changes). Output goes to bench/results/kuzu-stubs-*.txt. Re-run when changing the Go extractor, adding a new language, or modifying the resolver's EdgeMemberOf machinery.
…sets
The Go dataflow walker built local-binding IDs as
`<owner>#local:<name>@<absoluteLine>` and closure IDs as
`<owner>#closure@<absoluteLine>`. Adding an unrelated line above a
function shifted every local and closure ID inside it, so the
incremental indexer had to delete + re-insert every dataflow /
closure edge in the function on every save — O(bindings-in-file)
churn per edit.
Switch the encoding to `@+<offset>` where the offset is the
binding's 1-based line minus the function's declaration line. The
leading `+` marks the value unambiguously as an offset; the IDs
stay stable under shifts of the function as a whole. Only edits
*inside* the function above a binding shift that binding's ID —
unavoidable, because the offset is the disambiguator for the same
name re-bound at different lines.
The closure Node's Name field still carries the absolute line so
search results / outlines render the human-meaningful position.
Regression tests cover three properties:
- locals stay stable when lines are added *above* the function,
- locals shift correctly when lines are added *inside* above the
binding (the intentional case — protects the re-bind
disambiguator),
- closures get the same offset treatment.
The Go dataflow walker used to emit local-binding IDs only as edge
endpoints (`<owner>#local:<name>@+<offset>`) without ever calling
AddNode for them — the rationale at the time was to keep BM25
search clean of every transient `err` / `data` / `i`. The cost
showed up on storage backends that enforce rel-table foreign-key
integrity (Kuzu, Ladybug): for every dataflow edge that targeted a
local, COPY had to auto-stub an empty Node row to satisfy the FK.
On the gortex codebase alone this was ~51k phantom stubs, ~80% of
the entire FK-stub population.
The pattern was also semantically inconsistent — KindParam and
KindClosure are intra-function bindings too, and BOTH are
materialised as first-class nodes (16k params + 2k closures on
gortex). Locals were the lone holdout.
Lift them: every binding declared in declareTarget /
handleRangeClause now produces a KindLocal node (Name = identifier,
FilePath = the file the binding lives in, StartLine = its 1-based
line, Language = "go") plus an EdgeMemberOf edge back to the
enclosing function or method. The walker dedups via emittedLocals
so a binding visited through multiple walk paths still produces
exactly one node row.
Search hygiene preserved at the index boundary:
shouldIndexForSearch returns false for KindLocal so BM25 / Bleve
never see them — consumers that explicitly want locals (a
`kind: "local"` query) can still find them, but the default name
lookup is unaffected.
Bench effect on gortex (kuzu backend):
before — 193,343 nodes (129,733 real / 63,610 stubs)
after — 197,742 nodes (185,778 real / 11,964 stubs)
↳ stubs −51,646 (every intra-function binding now a real node),
real +56,045 (locals + the few non-local stubs that also
promoted), remaining stubs are the unresolved::* / external::*
population the resolver couldn't bind.
Regression tests cover three properties:
- KindLocal nodes get emitted for every short_var_decl /
var_spec / range-clause binding, with the canonical ID and an
EdgeMemberOf edge to the enclosing function,
- a binding visited multiple times produces exactly one node row,
- shouldIndexForSearch returns false for KindLocal so name
lookups don't surface intra-function bindings.
Walks every `unresolved::<bareName>` edge whose source sits inside a
function and rewrites the target onto the matching KindLocal /
KindParam node declared in that function's scope. Pre-#77 there was
nothing to bind to — locals were edge-endpoint-only — so the
worker-pool fallback ran a graph-wide FindNodesByName and gave up
on the ambiguity, falling through to `unresolved::*` for every
common identifier (err / data / src / out / ...). With #77's
KindLocal materialisation the scope is first-class and the bind
becomes an O(matching-name) walk over a per-owner index built once
per ResolveAll.
Precedence rules implemented:
- KindLocal beats KindParam (Go shadowing semantics).
- Among locals, the latest StartLine that's still <= the
reference line wins (standard "last shadow in scope" rule).
- Ambiguous cases (two candidates at the same StartLine, no
candidate before the reference, …) leave the edge untouched so
the unresolved audit still surfaces them.
Scope today is Go-only — TypeScript / Python don't materialise
locals yet, so their `unresolved::<name>` edges naturally degrade
to a no-op (empty owner-index for those functions). The TS / py
local-materialisation passes are a separate follow-up.
Bench effect on gortex:
before — 183,145 unresolved::* edges across 8,387 unique IDs
after — 137,533 edges across 5,155 IDs (-45.6k edges, -3.2k IDs)
bucket: bare-name 115,711 → 70,031 (the 45k absorbed local/param
references now navigate to first-class nodes; the residual 70k
is dominated by Go builtins, addressed in the next step).
Regression test matrix covers eight properties:
- local takes precedence over a same-named param,
- param falls through when no local matches,
- From IDs with #local: / #param: suffix still resolve via the
enclosing function,
- references before a binding's StartLine are NOT bound to it,
- the most recent shadow wins,
- ambiguous same-line shadows leave the edge unresolved,
- qualified shapes (*.Method, pkg.Name, pyrel::*) are untouched.
The Go extractor materialises every `[T any]` / `[T comparable, U ~int]` declaration as a KindGenericParam node with ID `<func>#tparam:<name>` and an EdgeMemberOf back to the owner. Until now the resolver never consulted these when an in-body reference (`var x T`, return type `T`, `instantiate[T]`) landed as `unresolved::T` — they stayed as phantoms. The pass mirrors bindBareNameScopeRefs: index every Go KindGenericParam by enclosing-function ID up front, walk the edge kinds that can carry tparam refs (EdgeReferences, EdgeTypedAs, EdgeReturns, EdgeInstantiates), and rewrite To onto the matching tparam node when the source's enclosing function is the one that declared it. Cross-function bindings are explicitly left alone — function B referring to `T` does NOT bind to function A's `T`. Side benefit: `find_usages` on a generic type parameter starts working — *"where in this generic function is T used?"* — which is a real refactoring query for the body of any generic helper. Bench effect on gortex: unresolved::* down only ~130 edges because what looked like 5k `unresolved::T` references in the audit is dominated by `testing.T` typed-param mis-classifications (the parser stripped the `testing.` qualifier and we got `unresolved::T` for every `func TestX(t *testing.T)`); Step 4's qualifier-preservation will route those to `stdlib::testing::T` properly. The genuinely generic refs (the smaller subset) do bind cleanly. Regression tests cover: in-function bind succeeds, cross-function bind is refused, qualified shapes (*.T, pkg.T) are untouched.
The Go extractor emitted every reference to append / len / make / string / int / float64 / ... as `unresolved::<name>` because the parser doesn't carry a language-intrinsic classifier. The resolver fell through to its worker-pool fallback which gave up on the ambiguity, leaving ~50k edges per gortex-scale Go codebase pointing at phantoms. These calls/typeRefs aren't unresolved — they're language primitives. Rewrite them at the resolver layer onto canonical `builtin::go::*` IDs and materialise one KindBuiltin node per unique builtin so the rewritten edges land on a real graph node: builtin::go::append (functions: append/len/make/...) builtin::go::type::string (types: string/int/float64/...) builtin::go::const::iota (constants: iota/nil/true/false) KindBuiltin is a new NodeKind, excluded from BM25 search (shouldIndexForSearch) for the same reason as KindLocal — surfacing `string` / `len` / `append` from every search would drown signal. It's a cross-repo singleton like KindModule (`module::pypi:requests`), so the multi-repo prefix-parity tests get an explicit allow-list update. Pass runs after Step 1 (scope-bind) and Step 2 (generic-param) so the bare-name bucket is consumed in the right order: locals take precedence over builtins (a user-defined `len` shadows the builtin), then unresolved names get the builtin treatment. Re-run from ResolveFile so incremental reindex converges with a cold full index (the load-bearing TestIncrementalReindex_ConvergesToFullIndex contract). Bench effect on gortex: before — 137,533 unresolved::* edges across 5,155 IDs after — 92,130 edges across 5,147 IDs (-45.4k edges) bare-name 70,031 → 24,564 (the remaining 24k are user-defined bare names the resolver still can't bind; Step 4 / Step 5 cover the *.method and external-call buckets) Side benefit: `find_usages(builtin::go::type::float64)` becomes a real query — answers "every variable typed as float64 in this codebase", which unlocks the type-drift / dataflow analyses the user called out as the load-bearing case for promoting builtins. Regression tests cover: function call, type ref, constant ref, non-Go cross-binding refusal, dedup of the materialised KindBuiltin across many edges, qualified shapes left alone, unknown names left alone. Two pre-existing multi-repo tests updated to exempt KindBuiltin (and KindModule) from the per-repo prefix rule.
The Go dataflow walker (go_dataflow.go) collapsed every
`selector_expression` to `unresolved::*.<method>` when emitting
arg_of / returns_to / value_flow edges, even when the receiver
was a package alias the file's import map already named. The
explicit comment at calleeRef line 542 acknowledged it:
> Receiver-typed targets (e.g. an import alias dispatch)
> can't be reconstructed without the file's import map.
> Fall through to the generic "*." form
— so every `fmt.Sprintf(...)`, `strings.Join(...)`,
`assert.True(t, ...)`, `os.ModePerm` reference inside a dataflow
context leaked the qualifier and landed as an `unresolved::*.*`
phantom. The call extractor's own emit path already used the
imports map correctly (`unresolved::extern::<importPath>::<method>`,
resolved downstream by resolveExtern to stdlib::/dep::/external::);
the dataflow walker just hadn't been given access to the same map.
Thread `imports map[string]string` from emitFunction / emitMethod
→ emitGoFunctionShape → emitGoDataflow → goFlowWalker. Both
selector-shaped exits in the walker now look up the operand's
identifier in the imports map first:
- calleeRef (selector_expression call): `pkg.Method(x)` →
`unresolved::extern::<importPath>::Method`
- exprSources (selector_expression value): `pkg.Name` →
`unresolved::extern::<importPath>::Name`
When the operand isn't a known package alias (it's a local
variable, struct-field chain, or some other receiver), the
fallback to `unresolved::*.Method` stays — those need
receiver-type inference, which is a separate follow-up.
Bench effect on gortex:
before — 92,167 unresolved::* edges across 5,147 IDs
after — 61,450 edges across 4,853 IDs (-30.7k edges)
bucket: *.method-unknown-receiver 67,461 → 36,776 (the rest
are local-receiver / chain-selector cases that need richer
type tracking).
Once Step 5 lands and materialises stdlib::/dep::/external::
targets as KindFunction nodes, every package-qualified call
that was leaking through here will navigate to a real graph
node — "who in this codebase calls fmt.Sprintf" becomes a
one-hop find_usages.
Regression tests:
- SelectorCallPreservesPackageQualifier: package-qualified
call sites land on extern:: shape, not *.method.
- NonImportedReceiverFallsBack: receiver that's NOT a package
alias (a param) still uses the `*.` fallback so receiver-
type inference downstream still has its hint.
- SelectorValuePreservesQualifier: covers exprSources (value
access, not invocation), guards both selector exits.
After resolveExtern classifies `unresolved::extern::<path>::<symbol>`
edge targets into the three external-prefix buckets
(stdlib::, dep::, external::), the targets sit in the graph as
phantom edge endpoints — they're FK stubs on Kuzu / Ladybug and
invisible nodes on memory / sqlite / duckdb. That blocks the
queries the user called out as the load-bearing case for
promoting externals:
- "every function in this codebase that calls json.Marshal"
- "what's our usage surface on testify?"
- "if we vendor X, what symbols are we depending on?"
The new attributeGoExternalCalls pass walks the same edge kinds
attributeGoBuiltins does, collects every unique
(prefix, importPath, symbol) triple, and materialises:
- One KindModule node per import path
(`module::go:fmt`, `module::go:encoding/json`,
`module::go:github.com/stretchr/testify/assert`) shared across
every repo that uses it, with Meta.role = stdlib|dep|external.
- One KindFunction node per (prefix, path, symbol) with the
original target ID preserved so existing edges keep landing
on it without rewriting. Meta.external = true and
Meta.module_path / Meta.module_role record the lineage.
- An EdgeMemberOf edge from the symbol to its parent module so
`get_callers(module::go:encoding/json)` answers "every symbol
in this codebase that comes from encoding/json".
Mirrors the existing attributeNonGoModuleImports pass for
Python / Dart pypi modules. All AddNode / AddEdge calls are
idempotent on ID so re-running the pass from ResolveFile during
incremental reindex is a no-op.
Bench effect on gortex (post Step 4 → post Step 5):
kuzu node count 193,343 → 195,769 (+2,426 = the new
stdlib/dep symbols)
kuzu stubs 11,964 → 8,281 (-3,683)
unresolved::* edges essentially unchanged — Step 5 doesn't
rewrite unresolved::*; it materialises the
already-resolved external targets.
Two pre-existing multi-repo prefix-parity tests get an explicit
exemption for `meta.external=true` KindFunction nodes (parallel
to the KindModule / KindBuiltin singletons exempted in earlier
steps): they're cross-repo by construction.
Regression test matrix covers stdlib materialisation with the
right metadata, dep materialisation with the full import path,
module-node sharing across many symbols of the same package,
idempotency on re-run, and the negative case (no extern targets =
no module nodes created).
Port of #77's Go local-materialisation work to TypeScript /
JavaScript. The TS extractor previously emitted KindParam +
KindClosure + KindGenericParam for the function-shape detail
but skipped intra-function bindings — `let` / `const` / `var`,
destructure patterns, for-in/for-of induction vars, and catch
parameters all existed only at AST traversal time, never as
graph nodes.
Lift each one as a KindLocal node anchored to its enclosing
function via EdgeMemberOf, using the same
`<owner>#local:<name>@+<offsetFromOwnerStartLine>` ID
convention the Go walker uses so the binding identity is
stable when lines move above the function (the #76 stability
property carries over). Walker dedupes per-binding via an
emitted-IDs set so a name visited through multiple walk paths
still produces one node row.
Scope covers the production binding-introduction sites:
- `let` / `const` / `var` declarations (`lexical_declaration`,
`variable_declaration`),
- object / array destructure patterns including renamed
bindings (`const { foo, bar: aliased } = obj`),
- for-in / for-of induction variables,
- catch-clause parameters.
Nested functions are deliberately NOT recursed into — their
bindings belong to the inner function's own scope, and the
extractor's per-function pass handles each inner function
separately.
TS doesn't (yet) have a dataflow walker analogous to Go's
emitGoDataflow, so no value_flow / arg_of / returns_to edges
target these locals today. The value is two-fold:
1. Semantic parity with Go — every binding is a first-class
graph node with stable identity, ready for the dataflow /
scope-resolution passes downstream.
2. The resolver's scope-aware bare-name binding (#81) can
now find TS locals when binding `unresolved::<name>` →
KindLocal for any future TS dataflow emit.
KindLocal is excluded from BM25 search via shouldIndexForSearch
(no change needed — already covers the kind) so the
materialisation doesn't pollute name lookups with per-function
`err` / `data` / `i` rows.
Regression test matrix covers the five binding sites:
- let / const / var declarations
- object + array destructure (with renamed pair_pattern)
- for-of induction var
- nested-function scope isolation
- function-relative offset stability under edits above the
function.
KuzuDB's GitHub repo (kuzudb/kuzu) is marked Public archive — no
more releases or maintenance from upstream. Ladybug, the
maintained fork we already ship as store_ladybug, covers the same
Cypher property-graph workload with binary-compatible storage.
Removed:
- internal/graph/store_kuzu/ (4 files: store, schema, backend
resolver, conformance test)
- bench/kuzu-stubs/ diagnostic (Kuzu-specific stub auditor)
- go.mod requirement on github.com/kuzudb/go-kuzu (+ tidy)
- kuzu wiring in bench/store-bench/main.go (skip flag, only-arg
parsing, dispatch branch)
- kuzu row from bench/run-linux.sh and the stale comment in
bench/run-linux-rest.sh
Migrated bench/unresolved-audit from store_kuzu to store_ladybug
(same FK-stub stress shape; just a different backend tag).
Refreshed surrounding comments to drop joint kuzu/ladybug
references — the remaining Cypher backend is Ladybug alone. No
production code paths needed semantic changes because Ladybug's
behaviour mirrors Kuzu's (it IS the fork).
Two test fixtures had to follow:
- internal/mcp/server_test.go setupTestServer fixture dropped
its `import "fmt"` so the resolver's attributeGoExternalCalls
pass doesn't auto-add a `module::go:fmt` node and skew the
external-call analyser tests. (The fmt usage was cosmetic;
only the analyser tests cared about it.)
- internal/mcp/tools_analyze_coverage_test.go updated its
synthetic coverage profile line numbers to match the new
fixture (function bodies shifted up by 2 lines).
Build/test verification:
- go build ./... — clean
- go build -tags 'duckdb ladybug' ./... — clean
- go test ./internal/... -tags 'duckdb ladybug' — passes
(one pre-existing perf-gate flake in
TestAnalyzeImpact_FastPathSubMillisecond observed BEFORE this
change too — unrelated to the Kuzu removal)
Ladybug ships Kuzu's FTS extension compiled into liblbug. Capability
probe (fts_probe_test.go) confirms the call surface:
- INSTALL FTS + LOAD EXTENSION FTS once per database
- CALL CREATE_FTS_INDEX('table', 'name', [columns])
- CALL QUERY_FTS_INDEX('table', 'name', 'query') (3-arg, no limit)
- Auto-updates on later table writes — no drop / rebuild needed
The one rough edge surfaced by the probe: Ladybug's default
tokeniser does NOT split camelCase or snake_case. `ValidateToken`
indexes as a single token `validatetoken`, so a query `validate`
returns 0 hits — that's a recall regression vs our in-process BM25
backend which has explicit camelCase / snake_case / path-segment
splitting (internal/search.Tokenize).
This commit bridges the gap by pre-tokenising at write time and
applying the same tokeniser on the read side:
- SymbolFTS sidecar table holds (id, tokens) — the tokens column
is the camelCase-/snake-/path-split form of the symbol's
name + qual_name, joined by spaces. Stored separately from
the main Node table so the bulk-load path doesn't have to
learn the FTS schema.
- UpsertSymbolFTS(nodeID, tokens) writes to the sidecar with
MERGE so a re-parse of a file replaces the prior text in
place (no duplicates).
- BuildSymbolIndex installs + loads the extension and calls
CREATE_FTS_INDEX over SymbolFTS.tokens. Idempotent via an
atomic sentinel; lazy-builds on the first SearchSymbols if
the indexer hasn't called it yet.
- SearchSymbols runs the user query through search.Tokenize
(same splitter as the write side), joins with spaces, and
fires CALL QUERY_FTS_INDEX. Returns sorted hits with their
BM25 scores. Falls back to search.TokenizeQuery when
Tokenize drops every term (short queries like "go" / "js"
that the strict tokeniser would silently swallow).
Wires through the new graph.SymbolSearcher capability interface
(UpsertSymbolFTS / BuildSymbolIndex / SearchSymbols). The
SymbolHit shape mirrors what the daemon's search_symbols path
needs. Other backends (sqlite / duckdb) don't implement it yet;
the indexer-side integration that consumes it (skip Bleve when
SymbolSearcher is present) is the next commit.
Conformance test matrix (TestSymbolSearcher_EndToEnd, 6 sub-cases):
- exact identifier ("ValidateToken") ✓ top hit
- camelCase head ("validate") ✓ 2 hits
- camelCase tail ("token") ✓ top hit
- two-word query ("validate token") ✓ top hit
- qualifier hop ("auth" via qual_name) ✓ 2 hits
- control miss target ("pretty") ✓ top hit
Plus TestSymbolSearcher_AutoUpdate (post-create upserts findable
without rebuild) and TestSymbolSearcher_IdempotentUpsert (text
replacement, no duplicate rows).
The FTS capability landed in the previous commit but no production
path wrote to it. Wire the indexer to populate the backend FTS
from the same node stream that drives the disk-store bulk load,
plus mirror per-call updates so incremental reindex doesn't
diverge.
Three pieces:
1. graph.SymbolSearcher gains BulkUpsertSymbolFTS(items) — the
cold-load fast path. Per-call UpsertSymbolFTS is fine for
incremental updates (1 file change → tens of nodes) but pays
~1ms/MERGE × 600k nodes = 10 minutes on a Vscode cold-start.
Bulk path implemented on store_ladybug via TSV + COPY FROM,
mirroring the existing Node / Edge bulk loader: dedup by ID,
wipe-and-rewrite (no append), invalidate the indexBuilt
sentinel so the next SearchSymbols rebuilds the FTS.
2. internal/indexer.go drain wires SymbolSearcher into the
shadow-swap path: as DrainNodes yields each node, if the
disk target is a SymbolSearcher and the node passes
shouldIndexForSearch (same filter the in-process BM25
backend uses — keeps the FTS corpus and BM25 corpus
identical), append a SymbolFTSItem with the tokens computed
by ftsTokensFor. After FlushBulk, call BulkUpsertSymbolFTS +
BuildSymbolIndex. Reporter emits a `building symbol fts`
stage so the UI can show progress.
3. internal/indexer.go incremental-reindex path adds a parallel
UpsertSymbolFTS call alongside the existing idx.search.Add,
gated on idx.graph.(graph.SymbolSearcher). The two indexes
stay in sync without the daemon having to dual-write
explicitly.
ftsTokensFor folds n.QualName into the tokenised text so a query
like "auth" still matches "auth.ValidateToken" (qualifier-hop
recall the in-process BM25 backend has by handling QualName as a
separate field). Tokens go through search.Tokenize so camelCase /
snake_case / path-segment splitting matches the BM25 contract.
Bench wiring + Bleve skip ride in the next commit; with this
commit alone the backend FTS is populated but search_symbols
still reads from Bleve. Test sweep stays clean (one pre-existing
perf flake in TestAnalyzeImpact_FastPathSubMillisecond unrelated
to this change).
The capability layer + indexer-side writes landed in the previous two commits but search_symbols still read from the in-process BM25 backend. Plug the read side: a search.Backend adapter that forwards Search to graph.SymbolSearcher.SearchSymbols, picked up at indexer construction when the store implements the capability. internal/search/symbolsearcher_backend.go: search.SymbolSearcherBackend implements search.Backend over a graph.SymbolSearcher. Search forwards to SearchSymbols and translates per-hit (NodeID, Score) into search.SearchResult. Add / Remove are no-ops because the indexer drives the SymbolSearcher writes directly (BulkUpsertSymbolFTS at drain, per-call UpsertSymbolFTS in the incremental path) — never through the search.Backend contract. Count tracks deltas-since- construction as best-effort observability. internal/indexer/indexer.go: initialSearchBackend(g) picks the search backend the Swappable wraps on construction. If g implements graph.SymbolSearcher, the adapter is the initial backend; otherwise the existing search.NewAuto path (BM25 with Bleve auto-upgrade) is used. Net effect today: any indexer.New on a Ladybug-backed store routes every Engine.SearchSymbolsScoped / SearchSymbolsRanked call through CALL QUERY_FTS_INDEX in Ladybug's vectorised engine instead of the in-process BM25 / Bleve index. What's still not bypassed yet — and what the next commit covers: the Swappable's auto-upgrade goroutine still runs, builds Bleve from AllNodes once the corpus crosses search.AutoThreshold, and swaps it in. That defeats this commit's purpose at large repo size by reinstating the ~100MB Bleve heap. Skipping that upgrade when the swapped-in backend is a SymbolSearcherBackend is FTS Step 3.
The Swappable's auto-upgrade goroutine kicks in once
idx.search.Count() crosses search.AutoThreshold, builds a Bleve
index from the full node snapshot, and atomically swaps it into
idx.search. That was the right behaviour when the only options
were BM25 (small corpus) and Bleve (large corpus) — but with the
SymbolSearcher adapter now serving Search via the disk store's
native FTS, an auto-upgrade would:
1. Spawn a 30-60s background build of a parallel in-process
Bleve index covering the SAME corpus the disk FTS already
holds — wasted CPU.
2. Allocate ~100MB of heap for Bleve's tokeniser + posting
lists — the exact memory the FTS path was meant to release.
3. Silently Swap() the SymbolSearcherBackend out for Bleve once
the build completes — defeating the FTS path entirely. Every
search_symbols call after the swap would hit Bleve instead
of the disk FTS, and the user would never know.
Gate the upgrade on isSymbolSearcherBackend(idx.search): when the
active backend is the FTS adapter, don't spawn. The
upgradeOnce.Do still records the gate so a later reindex on the
same indexer instance also stays on the adapter — symmetric with
the existing "one upgrade per indexer lifetime" contract.
isSymbolSearcherBackend unwraps the Swappable to inspect the
underlying backend, since search.Backend.Inner is only on the
Swappable type. Defensive nil-handling so callers in tests
that pass a non-Swappable can still call it.
This commit completes the FTS read-path migration: every search
on a Ladybug-backed daemon now goes to native FTS, no Bleve
build runs at any point of the indexer lifecycle. Bench (FTS
Step 4) measures the resulting latency + memory delta.
The store-bench's per-MCP-tool table measured `search_symbols` as
Store.FindNodesByName — a per-name Cypher lookup that doesn't
exercise the new SymbolSearcher path the daemon now routes
search_symbols through.
Add a `fts_search` column that measures the native FTS
round-trip when the store implements graph.SymbolSearcher:
- Builds the FTS index on the corpus that's just been
populated (BuildSymbolIndex is idempotent so this is a
belt-and-suspenders against backends that don't auto-build
during AddBatch).
- For each sampled node name in the existing query workload,
times SearchSymbols(name, 20) — the same call shape
Engine.gatherBackendCandidates issues through the
SymbolSearcherBackend adapter.
Non-SymbolSearcher backends (memory / sqlite / duckdb today)
leave the column at 0.0µs / 0.0µs — the cell reads correctly as
"capability not implemented" rather than spuriously fast.
Gortex bench landed: Ladybug `fts_search` p50/p95 = 700µs /
827µs vs the legacy `search_symbols` (FindNodesByName) at
27.90ms / 31.50ms on the same fixture — ~40× faster. Vscode
bench runs next.
Probe (vector_probe_test.go) confirmed Ladybug ships the VECTOR
extension compiled into liblbug. Call surface:
- INSTALL VECTOR + LOAD EXTENSION VECTOR once per database
- FLOAT[N] column type (fixed dim at table declaration)
- CALL CREATE_VECTOR_INDEX('table', 'name', 'col') 3-arg form
- CALL QUERY_VECTOR_INDEX('table', 'name', $vec, $k) 4-arg
- Default metric is cosine; distance, not similarity
(lower = closer; exact match ≈ 0, orthogonal = 1)
- Auto-update on later inserts (mirrors FTS)
New graph.VectorSearcher capability interface plus matching
ladybug implementation (store_ladybug/vector.go):
- UpsertEmbedding(id, vec) for incremental: per-call MERGE,
refuses dim mismatch against the declared FLOAT[N] column.
- BulkUpsertEmbeddings(items) for cold-load: TSV + COPY FROM
(file extension MUST be .csv — `.tsv` is rejected at bind
time with "Cannot load from file type tsv"). Auto-migrates
the schema if the batch dim differs from the prior declaration
(allowed at the cold-start boundary; per-call still errors so
a stray wrong-dim upsert can't silently drop the corpus).
- BuildVectorIndex(dim) lazily creates SymbolVec(id STRING,
emb FLOAT[dim], PRIMARY KEY(id)) and CALL CREATE_VECTOR_INDEX
over emb. Idempotent via the indexBuilt sentinel; a dim
change drops and re-creates the index.
- SimilarTo(vec, k) runs CALL QUERY_VECTOR_INDEX and returns
hits ordered by ascending distance.
Lazy schema (vs static DDL) because the FLOAT[N] width is
embedder-model-specific and only known when the first vector
arrives — MiniLM-L6-v2 is 384, BGE-Code is 768, GloVe-50d is 50.
The store can't preallocate a column at Open time without knowing
which provider the daemon will run with.
Conformance test matrix (4 tests):
- BulkAndQuery: 4 items in, top-2 hits cover the exact match +
near neighbour; distance ≈ 0 on the exact match.
- PerCallUpsert: incremental writes findable on next query.
- DimRejectsMismatch: second per-call upsert with wrong dim
must error (no silent corpus drop).
- BulkReplacesPriorCorpus: bulk wipe-and-rewrite semantics.
Indexer integration + adapter + bench land in Steps 2-4.
After the embedder's batch pass produces (id → vec) pairs, in
addition to populating the in-process search.VectorBackend
(coder/hnsw), the indexer now also pushes the same vectors into
the backend's native HNSW via graph.VectorSearcher when the
store implements it.
Cold-load shape:
- Accumulate (id, vec) pairs alongside the existing
vecBackend.Add loop. No extra pass; the slice is built from
the same vector slice the in-process backend consumes.
- One BulkUpsertEmbeddings + one BuildVectorIndex call after
the loop. Both errors logged at warn, non-fatal — the
in-process backend still works as the fallback path until
Vector Step 3 routes reads through.
- Skipped when the store doesn't implement VectorSearcher
(sqlite, duckdb, in-memory) so the existing path keeps
working byte-for-byte for those backends.
The in-process HNSW build stays for now. The next commit (Vector
Step 3) extends search.SymbolSearcherBackend to also implement
search.ChannelSearcher's vector channel, gating the in-process
NewVector / Add loop behind the same hasVectorSearcher check
that this commit consults. That's where the ~1GB heap saving
on Vscode-scale shows up.
This commit on its own is observably a no-op for the daemon —
both the in-process and backend HNSW are populated and the read
path still hits the in-process one. The behaviour shift comes
with Step 3.
When the underlying graph.Store implements graph.VectorSearcher
(today only store_ladybug), the in-process search.VectorBackend
now delegates Search to the engine-native HNSW and skips the
parallel hnsw.Graph build entirely.
Two pieces:
- internal/search/vector.go: VectorBackend gains a delegate
field + SetDelegate(VectorDelegate). When set, Add becomes a
no-op (bumps a delegateCount so HybridBackend's `Count() > 0`
gate still fires once the indexer has populated the corpus),
Search forwards to delegate.SimilarTo, Count returns the
delta count. The in-process hnsw.Graph is never touched —
nothing is allocated for the parallel index. SetDelegate is
safe to call once at construction; HybridBackend's
SetChunkMap and other state stays live so de-chunking and
dim reporting keep working.
Search.VectorDelegate is exported with a graph.VectorHit
return so the indexer can install a delegate without writing
a per-package translation type — search already imports
graph for SymbolHit, so the type sharing is free.
- internal/indexer/indexer.go: buildSearchIndex's vector
branch now detects graph.VectorSearcher on idx.graph and
installs a vectorSearcherDelegate before the vec.Add loop.
The same loop still drives BulkUpsertEmbeddings on the
backend (Vector Step 2) — the only behavioural change here
is that the in-process hnsw.Graph never holds the vectors,
freeing roughly dim × 4 × N bytes of heap (≈ 1 GB at
384-dim × 663k symbols on a Vscode-scale repo).
Read path on a Ladybug-backed daemon: HybridBackend.SearchChannels
→ embedder.Embed(query) → VectorBackend.Search → delegate
.SimilarTo → CALL QUERY_VECTOR_INDEX in Ladybug's vectorised
engine. Same shape the FTS path took.
Bench (Vector Step 4) measures the heap delta on a corpus
with embeddings actually populated. The Add-side test sweep
stays clean (one pre-existing perf flake unrelated).
store-bench now reports a `vector_search` column alongside
`fts_search`, exercising graph.VectorSearcher on every backend
that implements it and surfacing an in-process search.VectorBackend
baseline row so the engine-native HNSW can be compared head-to-head
with the heap-resident HNSW the daemon used to build.
Flags:
-vectors corpus size (0 = off; default off keeps the
existing latency bench fast)
-vector-dim embedding dim (default 384, MiniLM-L6-v2)
-vector-queries number of SimilarTo / Search calls to time
-vector-seed PRNG seed for deterministic cross-backend runs
The corpus is generated once with a math/rand seed and reused for
every backend + the in-process row, so the comparison is
apples-to-apples (identical vector distribution, identical query
vectors, identical k). Vectors are L2-normalised; HNSW under
cosine distance behaves best on unit-norm inputs.
Sample (gortex repo, 20k corpus, 384 dim, 500 queries):
| backend | vector_search p50 / p95 | heap (alloc / inuse) |
|--------------------|-------------------------|----------------------|
| ladybug | 987.0µs / 1.10ms | 37MB / 68MB |
| (in-process HNSW) | 101.0µs / 123.0µs | +5MB / +33MB delta |
Engine-native is ~10x slower per query at this scale (Cypher
parse/bind/transaction overhead dominates a single ANN lookup) but
keeps the vectors on disk — the daemon avoids paying dims*4*N
bytes in heap. At a 60k-symbol vscode-scale corpus the heap delta
is the load-bearing trade-off, not the per-query latency: 1ms is
well under the LLM round-trip floor either way.
Pre-empt a tempting regression: now that the ladybug (C++) backend is gone, the Windows .exe still links libstdc++-6.dll because several bundled tree-sitter grammars (e.g. go-sitter-forest norg/scanner.cc) ship C++ external scanners that the cgo build compiles with g++. Document this at the bundling site so the mingw runtime DLLs aren't mistaken for ladybug leftovers and removed.
The windows release used to ship gortex.exe plus the mingw C/C++ runtime DLLs (libstdc++-6 / libgcc_s_seh-1 / libwinpthread-1) it linked dynamically — the C++ stdlib is pulled in because some tree-sitter grammars carry C++ external scanners (e.g. go-sitter-forest norg). Link them statically via -extldflags=-static so the .exe is a single self-contained binary, and replace the DLL-staging step (the brittle find_dll scan) with an objdump guard that fails the release if any mingw runtime import leaks through. The zip stays — install.ps1, checksums.txt, cosign signing, and windows/unix artifact parity are all built around it, and it compresses the large CGo binary — but it now contains only gortex.exe. install.ps1 drops the multi-file / DLL-count install path accordingly.
Wire up the dormant Scoop channel now that the windows artifact is a single self-contained gortex.exe. The release-windows job builds the manifest with jq (version, tag-pinned zip url, sha256 from the zip, bin: gortex.exe, plus checkver + autoupdate) and pushes it to gortexhq/scoop-bucket, honouring either the repo-root or bucket/ layout. The step is non-blocking (continue-on-error) and self-skips when SCOOP_BUCKET_TOKEN is absent, so a token-less fork or a transient bucket failure never fails a release whose binary already shipped. SCOOP_BUCKET_TOKEN moves from the goreleaser job (which never consumed it — there is no scoops block there) to this step, where it's actually used; the .goreleaser.yml note is updated to point at the new step.
Gortex split its per-user state across three roots — config in ~/.config/gortex, cache in ~/.cache/gortex, and a flat ~/.gortex for the store / models / memories. Collapse them into one tree: ~/.gortex/ ├── config.yaml, servers.toml ├── cache/ (daemon socket/pid/log, snapshots, eval/token/… caches) ├── store/ (the on-disk backend + WAL/shm sidecars) ├── models/ (downloaded embedding models) └── memories/ The XDG variables stay an explicit escape hatch: an absolute XDG_CONFIG_HOME / XDG_DATA_HOME / XDG_CACHE_HOME (and the GORTEX_DAEMON_* overrides) still wins and routes that category to the standard <base>/gortex location, so XDG-strict setups, sandboxes, and the test suite are unaffected. internal/platform/xdg.go is the single resolver: ConfigDir/DataDir collapse to ~/.gortex, CacheDir/OSCacheDir to ~/.gortex/cache, and new StoreDir / ModelsDir / MemoriesDir hang the durable sub-trees off DataDir. The scattered callers route through it; the Legacy* aliases are gone. MigrateToUnifiedHome folds an older split layout into the new tree on first run — best-effort, idempotent, rename-based, and a no-op under an XDG override. It is wired into the root command's PersistentPreRun so it lands before any command opens the store or reads config. Models are treated as durable data (kept out of cache so a wipe doesn't drop large downloads); the stale daemon socket/pid are left to regenerate.
…layout Follow-up to 431c0b2 (unify per-user state under ~/.gortex). Update the now-stale path references in cobra flag descriptions, command help/usage text, doc comments, agent-instruction strings, and docs/*.md to the unified layout: ~/.config/gortex/config.yaml -> ~/.gortex/config.yaml ~/.cache/gortex/<x> -> ~/.gortex/cache/<x> ~/.cache/gortex/models/ -> ~/.gortex/models/ (models are data, not cache) ~/.gortex/<backend>.store -> ~/.gortex/store/<backend>.store %LocalAppData%\gortex -> %USERPROFILE%\.gortex\cache The XDG-relocation explanations (an absolute $XDG_*_HOME still wins) are kept — they remain accurate. Per-repo .gortex.yaml / .gortex/ references are left untouched. Text-only; no logic change; go build is green.
The install matrix's macos-13 leg never gets a runner — GitHub retired its Intel macOS images, so the job queues until it is cancelled (~88 min), making the workflow look perpetually stuck. There is no Intel-macOS replacement (macos-14 / 15 / latest are all arm64), and install.sh is arch-agnostic (only the downloaded artifact differs), so macOS coverage continues via macos-14 (arm64) with no real loss.
…file The incremental (fsnotify / edit_file) re-index path ran the whole-graph materializeDataflowParams — g.AllEdges() over the entire edge set — after every single-file edit. On the disk backend that materializes every edge per keystroke, a large per-edit cost with no benefit beyond the one file. Replace it on that path with materializeDataflowParamsForFile, which rewrites only the arg_of / returns_to edges the edited file emits. A file's dataflow From is not always a file node: returns_to's From is the caller function and a bare-identifier arg_of's From resolves to a file local (both covered by GetFileNodes), but a selector / package-qualified / global / nested-call argument keeps a synthetic unresolved:: From that never becomes a file node. So the scoped pass probes the union of the file's nodes and the synthetic From ids carried by the file's freshly-extracted edges (result.Edges), then keeps only edges whose FilePath is this file — exactly the set the whole-graph pass would touch for it. The batch path (Resolver.ResolveAll) still runs the whole-graph variant once. Adds an equivalence test: a fixture exercising all four argument shapes (bare, selector, global, nested-call) asserts the scoped per-file pass produces byte-identical arg_of+returns_to edges to the whole-graph pass, with a guard that the synthetic-From case is actually exercised so the assertion can't pass vacuously.
…tifiedIndex) Foundation for O(edited-file) incremental clone detection. Adds the internal/clones building blocks needed to maintain the LSH index across single-file edits instead of rebuilding it over the whole corpus: - CMS.Decrement: subtract a key's count (floored at 0), so a file's shingles can be removed from the corpus frequency sketch on evict. - Index.Remove(id) / Index.QueryCandidates(id): maintain and query the existing band buckets per item, using the same bandKey and the same maxBucketSize cap as the batch EmitCandidatesTo, so a maintained query returns exactly the batch candidate set. - StratifiedIndex (Add/Remove/QueryPairs): a per-length-class wrapper mirroring DetectPairsStratified's stratification, so an item is banked and queried in the same classes the batch path uses. A deterministic equivalence test proves the union of per-item QueryPairs equals the batch DetectPairsStratified pair set (plus Remove-reversal and CMS round-trip checks); an adversarial fuzz over length classes, overlap boundaries, and the oversized-bucket cap confirmed exact set-equality. internal/clones only — no indexer or batch call-site changes yet.
Persists each function/method node's MinHash shingle set ([]uint64) in a clone_shingles(node_id, repo_prefix, shingles) table so the maintained clone-detection CMS can be rebuilt after a warm restart (snapshot reuse, no re-parse) — the foundation for keeping incremental clone detection active across the daemon's normal restart path. Mirrors the file_mtimes sidecar: CloneShingleWriter (BulkSetCloneShingles / DeleteCloneShingles) + CloneShingleReader (LoadCloneShingles), little- endian 8-bytes/elem blob encoding, repo-prefix-scoped reads, chunked single-tx writes. Implemented on both the SQLite Store and the in-memory Graph, and exercised by a storetest conformance subtest that runs against both backends (exact length/order/value round-trip, delete, repo isolation). No internal/clones or internal/indexer changes — the consumer lands next.
Wire the maintained CMS + LSH clone index into the indexer so a single-file (re)index updates clone EdgeSimilarTo edges in O(edited file) instead of re-running the whole-graph detectClonesAndEmitEdges per edit — the second half of removing the global passes from the edit_file hot path (dataflow was the first, 17d9531). - incrementalCloneIndex (clone_incremental.go) owns a maintained clones.CMS + clones.StratifiedIndex + an in-memory shingle cache. Rebuild seeds the CMS/corpus from ALL of a repo's bodies (via the clone_shingles sidecar) and the LSH from survivors (clone_sig), so it matches finaliseCloneSignatures' all-bodies CMS. EvictFuncs decrements the CMS, removes from the LSH, and deletes sidecar rows; UpdateFuncs adds the edited file's bodies, computes signatures via the shared computeCloneSigFromShingles kernel, and emits EdgeSimilarTo from per-item LSH queries. - finaliseCloneSignatures persists every body's shingles to the sidecar before clearing Meta, so the index rebuilds after a warm restart. - Clone detection is now PER-REPOSITORY: finaliseCloneSignatures / detectClonesAndEmitEdges take a repoPrefix and scope their node walks to it; MultiIndexer runs detection once per tracked repo, each with its own threshold. No cross-repo EdgeSimilarTo edges form — matching the per-repo incremental maintainer and avoiding cross-repo false-positive clones. Single-repo behavior is unchanged (repoPrefix "" matches every node). - indexFile uses the incremental path when the index is built, falling back to the whole-graph pass otherwise; the batch pass remains the re-baseline (corrects CMS drift) and still runs diffusion. Tests: full-index-vs-incremental equivalence (incl. the useFilter regime via an overridable cmsMinCorpus), warm-restart-rebuild-from-sidecar, and a multi-repo test asserting no cross-repo edges and batch == incremental per repo. make lint clean; -race green.
…pped events The fsnotify watcher could leave the graph stale in ways index-health couldn't see (it's mtime-over-tracked-files, blind to new files + lost events). Three fixes: 1. Parse-then-swap. indexFile AND the live watcher path (patchGraph's ChangeModified case) evicted a file's nodes BEFORE parsing, so a transiently-unparseable save — the common mid-edit case — dropped the file's symbols from the graph + search index until the next clean save. Now: parse first; evict + re-add only on a successful parse; on failure the prior nodes stay intact (stale-but-present beats empty). patchGraph no longer pre-evicts (it relied on the same hazard); removed/added telemetry stays gross via the file's prior node count. 2. inotify overflow recovery. The watcher read only Events(), never the Dropped()/EventOverflow signal, so on a Linux kernel queue overflow (bursty change — git checkout, mass edits, build churn) lost events were silent with no re-scan until the up-to-1h reconcile janitor. Now an overflow triggers a coalesced full-tree IncrementalReindex, recovering dropped creates/modifies/deletes (including new files) in seconds. (macOS FSEvents already self-healed; Linux did not.) 3. pollGitHead retry. The poller advanced lastSHA before its git diff, so a transient diff failure permanently skipped that SHA range. Now lastSHA advances only after a successful diff; a failure leaves the range to retry next cycle. Tests (realtime_reliability_test.go) with negative controls that fail against the pre-fix code: a failed re-index keeps prior nodes (via both indexFile and the live patchGraph path), an overflow event triggers a coalesced reconcile that indexes a previously-missed file, and a failed git diff leaves the range to retry. Honest remaining gaps (separate follow-ups, not regressions): the poller's mtime fallback still only re-stats known files (new-file discovery now leans on the overflow reconcile + the hourly janitor); the janitor's default 1h interval is unchanged; and a small Linux race remains for files created in a brand-new subdir before its inotify watch lands.
…t lost
Closes the new-subdirectory race on the Linux inotify backend. inotify
attaches a watch to a directory only AFTER its create event is read, so
any file written into the directory in that window fires no event. With
the watcher previously dropping all directory events, such a file stayed
invisible until the hourly reconcile janitor — and, if never committed,
the poller's mtime fallback (which only re-stats already-tracked files)
could not recover it either.
Fix: when a directory event carries a Create, scan the directory's
subtree on disk via IncrementalReindexPaths so pre-watch files are picked
up regardless of whether an event ever fired ("watch first, then scan":
fswatcher attaches the inotify watch before our handler runs, so files
created after it fire normal events and files created before are caught
by the scan — the overlap is at worst a redundant, idempotent re-index,
IsStale-gated to a stat for already-current files). A directory event
without a Create (a bare mtime bump) needs no scan; its entries' changes
fire their own file events. A burst of directory creates (a large
checkout) coalesces into a single in-flight drainer, escalating to one
full-tree reconcile past a cap rather than fanning out into many scoped
walks — so a checkout never storms.
macOS FSEvents has no such race (one recursive root stream, no per-dir
watch), but the scan runs there too as a cheap idempotent backstop.
Also corrects a misleading comment: the EventOverflow branch in
handleEvent is reachable only on Linux/Windows — macOS FSEvents never
emits EventOverflow; its backend absorbs UserDropped/KernelDropped by
re-scanning the subtree internally.
Tests: a file buried in a new directory (its own create event lost) is
indexed by the directory-create scan via the real handleEvent ->
enqueueDirScan -> runDirScan -> IncrementalReindexPaths path; and the
scan is gated on a Create (a bare directory modify triggers none).
…re-index Editing or deleting a definition silently stripped its callers' edges until a cold reindex. Two coupled bugs: 1. Un-resolve (silent edge loss). Graph eviction removes a file's nodes AND drops every incoming reference edge from surviving callers in other files (it deletes the edge from the caller's out-edge bucket). Re-indexing file A — even with no change to a symbol — therefore stripped B's call to A.Foo, and nothing recreated it: ResolveFile(A) only re-resolves A's OUTGOING edges. So find_usages / get_callers on an edited symbol went blank until a full reindex. 2. No reverse resolution. A symbol newly defined in A leaves callers in other files pinned to an `unresolved::<Name>` stub, because per-file resolution never revisits other files' pending edges. Fix: - restubIncomingRefs (indexer): before any incremental eviction, rewrite each surviving caller's resolved reference edge into A back to an `unresolved::<Name>` stub via ReindexEdges, so it detaches from the soon-to-be-evicted node and survives as a pending edge instead of being dropped. Only name-resolvable reference kinds (calls, references, reads/writes, typed_as/returns, implements/extends/composes/ instantiates) are re-stubbed; structural and enrichment edges are left to drop. Wired into every incremental evict site (the live edit path, delete/rename, and the reconcile deletion sweeps). Backend-agnostic — GetInEdges + ReindexEdges are the same Store primitives the resolver uses, so it behaves identically on the in-memory and disk stores. - ResolveIncomingForFile (resolver): the reverse of ResolveFile. After a definition is (re)indexed, bind the pending edges that reference this file's symbol names. Candidates are found via GetInEdges keyed by the `unresolved::<Name>` stub id — the stub id IS the in-edge bucket key, so no new index is needed — and run through the existing per-edge resolveEdge with the same reachability / import / cross-package gates as ResolveFile. Scoped to this file's names: O(references to those names), not a whole-graph ResolveAll. indexFile calls it right after ResolveFile. Net: editing A keeps B's caller edge (re-stub + rebind); deleting A leaves B's call as an unresolved stub (correct for a now-missing symbol), not a dropped edge; re-creating A rebinds the pending stub. Names the resolver can't bind uniquely/safely stay pending for the periodic ResolveAll — no whole-graph storm on a single-file edit. Two graph kind-predicates back the filters: IsResolvableRefEdge and IsReferenceableSymbol. Test (incremental_resolve_test.go) drives the real IndexCtx -> IndexFile -> EvictFile path with Go source and asserts the caller edge survives a re-index, reverts to an unresolved stub on delete, and rebinds on re-create. Verified with negative controls: disabling the re-stub OR the reverse pass each makes it fail, so both halves are load-bearing.
A fatal store error on the fsnotify path took the whole daemon down. The
observed crash: during a `daemon restart`, a still-pending debounced
watcher timer fired patchGraph against a store whose connection was
already closed; store_sqlite's panicOnFatal turned the SQL error into a
panic, and because the timer runs in its own goroutine — not through the
MCP wrapToolHandler firewall — the panic crashed the process.
Add guardWatcherPanic and defer it in every watcher background goroutine:
the debounced per-file patch, the storm drain, the overflow reconcile,
and the new-directory scan. A panic in any of them is now recovered and
logged (with stack), aborting just that unit of work — the file stays
stale until the next event or the reconcile janitor — instead of taking
the daemon down. The debounced-patch cleanup (deleting the pending entry)
still runs on the panic path. This matches the existing tool-path
firewall philosophy ("no handler can crash the daemon"), extended to the
goroutines fsnotify drives directly.
Independently relevant now that the incremental path issues more store
queries from watcher goroutines (restubIncomingRefs, ResolveIncomingForFile,
the new-directory scan) — any of which would otherwise be a fresh crash
vector against a transiently-unavailable store.
Test drives a debounced patch against a store armed to panic on read and
asserts the panic is recovered + logged rather than propagated; the
negative control (guard removed) crashes the test binary.
… restart converges The full-index mtime persist was an upsert (BulkSetFileMtimes = INSERT OR REPLACE), so a file deleted since the last index left its row in the store forever. On every warm restart HasChangesSinceMtimes hit the phantom-deletion row, flagged the repo as changed, and forced a full whole-repo re-track plus all global passes — which never converged, because the re-track re-persisted with the same upsert. Make the full-index persist authoritative: - Add two optional store capabilities: FileMtimeReplacer.ReplaceFileMtimes (DELETE the repo's rows then bulk insert, one tx) and FileMtimeDeleter.DeleteFileMtimes (prune specific paths). Implemented on the sqlite store; empty input is a no-op so an empty snapshot never wipes a repo. - Full-index persist now prefers ReplaceFileMtimes (falls back to upsert). - Both IncrementalReindex deletion loops prune the deleted files' persisted rows via DeleteFileMtimes. The per-file incremental add path stays an upsert. After one re-track the persisted set matches disk, so the next unchanged warm restart takes the fast path instead of re-indexing everything.
… typed sidecar Change A (storage unification), churn domain. Enrichment used to ride in the gob-encoded nodes.meta BLOB, so every node row paid encode/decode for rarely-read data, and get_churn_rate scanned AllNodes + gob-decoded every meta blob to peek at one key (~27ms / 220x slower than memory on sqlite). Churn now persists in a dedicated churn_enrichment table (node_id PK + typed columns + repo_prefix), mirroring the clone_shingles sidecar: a new optional Store capability (ChurnEnrichmentWriter/Reader) implemented by both the in-memory and sqlite backends, with a conformance case forcing parity. - The enricher (internal/churn) writes the sidecar via BulkSetChurn when the backend implements the capability, and no longer stamps Node.Meta. - get_churn_rate reads the typed rows via an index over the (small) enriched set + one batched node lookup, instead of the AllNodes scan. It falls back to the legacy Meta scan when the sidecar is empty (un-migrated DB) or the backend lacks the capability, so an existing store.sqlite still answers until the next `gortex enrich churn` (recompute-on-next-enrich migration; no destructive backfill). Tests: storetest conformance on both backends; the enricher writes + round-trips the sidecar (and no longer leaves churn in Meta); get_churn_rate surfaces sidecar rows (sort_by / min_commits) and the legacy Meta-scan fallback still works. First of the per-domain enrichment moves; coverage/releases/blame follow the same pattern, and the EvictFile DeleteEnrichment cascade lands with them (orphan rows are currently tolerated, as vectors are).
… a typed sidecar Change A, coverage domain (mirrors the churn sidecar 5d19188). Coverage % + stmt counts now persist in a typed coverage_enrichment table (node_id PK + repo_prefix + coverage_pct/num_stmt/hit) via a new optional CoverageEnrichmentWriter/Reader capability on both backends, with a conformance case. - The coverage enricher writes the sidecar (batched) and, on success, strips the Meta stamps + skips the AddBatch so the node blob stays lean; on a sidecar write failure it falls back to persisting Meta via AddBatch so coverage is never lost. - All six coverage readers (coverage_gaps, coverage_summary, health_score, knowledge_gaps, inspections, replay_episode) now batch-load the sidecar once via coverageByID() and read through coveragePctFrom(), which falls back to Meta["coverage_pct"] for un-migrated DBs / capability-less backends. EdgeCoveredBy edges (and their edge-level coverage_pct) are unchanged. Recompute-on-next-enrich migration; no destructive backfill. Tests: conformance on both backends, enricher round-trips the sidecar (and leaves no coverage in node Meta), and the coverage analyzers + the enrich-coverage MCP path read it.
….meta into a typed sidecar Change A, releases domain (mirrors churn/coverage). The "first appeared in <tag>" marker now persists in a typed release_enrichment table (node_id PK + repo_prefix + added_in) via a new optional ReleaseEnrichmentWriter/Reader capability on both backends, with a conformance case. - The releases enricher (EnrichGraphForBranch, which EnrichGraph and EnrichGraphWithRepoPrefix delegate to) writes the sidecar and no longer stamps Node.Meta; a sidecar write error propagates (the enricher already returns error). - The releases analyzer (enrich_releases read path) batch-loads the sidecar via releaseByID() and reads through addedInFrom(), falling back to Meta["added_in"] for un-migrated DBs / capability-less backends; the "any added_in?" existence probe checks the sidecar first. Recompute-on-next-enrich migration. Tests: conformance on both backends, enricher round-trips the sidecar (and leaves no added_in in node Meta), and the releases analyzer reads it.
… a typed sidecar
Change A, blame domain (mirrors churn/coverage/releases). Per-symbol
last_authored {commit,email,timestamp} now persists in a typed
blame_enrichment table (node_id PK + repo_prefix + commit/email/ts) via a
new optional BlameEnrichmentWriter/Reader capability on both backends,
with a conformance case.
- The blame enricher writes the sidecar and no longer stamps Node.Meta;
a sidecar write error propagates (EnrichGraph already returns error).
Person nodes + EdgeAuthored edges are unchanged.
- All blame readers redirect through batched sidecar maps with a
Meta-fallback for un-migrated DBs: stale_code, ownership, stale_flags'
caller-recency (tools_enhancements via blameRowsByID/lastAuthoredFrom),
health_score's recency axis (lastAuthoredTSFrom), and the novelty
hotspot weight (nodeLastAuthoredTime now takes the blame map). The
pre-existing dead reads (inspections' string-typed last_authored,
directional hotspots' nodeAddedInTime) are left unchanged — their
behavior is identical pre/post migration.
Recompute-on-next-enrich migration. Tests: conformance on both backends,
enricher round-trips the sidecar (no last_authored left in Meta), and the
blame analyzers read it.
Change A completion: when a file is deleted/renamed (Indexer.EvictFile and the reconcile deletion sweeps), drop its nodes' churn/coverage/release/ blame sidecar rows via evictEnrichment so a removed file leaves no orphan enrichment. Capability-gated (no-op on backends without the writers). Not cascaded on the modify path: a re-index keeps the same node IDs, so their enrichment rows stay valid (and a renamed symbol's stale row is harmless — readers skip a row whose node is gone). Runs alongside the existing restubIncomingRefs on the same delete sites. Test: EvictFile drops the seeded churn/coverage/blame rows for the file's nodes.
Add the blame, coverage, and co-change enrichers to the daemon control surface alongside the existing churn / releases verbs. Each runs the in-process enricher against the daemon's warm graph, per tracked repo prefix, so the persisted metadata is immediately queryable and the on-disk store write lock stays uncontested. - proto: ControlEnrichBlame/Coverage/Cochange verbs + Params/Result. Coverage carries pre-parsed segments on the wire so the daemon never reads the caller's filesystem. - Controller interface + server dispatch cases for the three verbs. - realController: per-prefix handlers mirroring EnrichChurn/Releases, factored through a shared resolveEnrichTargets helper. - fakeController stubs (daemon + hooks test packages) and a dispatch test covering all five enrich verbs.
Every enrich subcommand (churn/blame/coverage/releases/cochange/all) now persists exclusively through the running daemon. When no daemon is reachable the command returns a single clean error instead of building a throwaway in-memory graph that nothing would read — and which, if it wrote to the on-disk store directly, would race the daemon's writer. - Delete the standalone in-memory index+enrich fallback and the --snapshot flags that only made sense for it. - All six subcommands resolve the path scope, check daemon.IsRunning(), then forward via the matching control RPC through shared dial / controlEnrich helpers. Coverage parses the profile CLI-side and ships the segments to the daemon. - 'enrich all' runs each enricher via successive control calls; the per-enricher toggles (--no-churn/blame/releases/cochange, --coverage) are preserved. --branch is kept for churn and releases. - Tests: no-daemon error path for every subcommand, plus coverage profile-parse-before-daemon-check ordering.
…-mine The lazy mineCoChange path wrote only the in-memory caches and deliberately skipped materialising EdgeCoChange edges, so every daemon restart re-mined `git log` (5-15s) because the coChangeFromEdges fast path found nothing. mineCoChange now persists the mined pairs via cochange.AddEdges after the mine, so a subsequent start reads them back via coChangeFromEdges and skips the mine. The persist is bounded: mineCoChange runs once per process (sync.Once) and the fast path skips the mine once edges exist, so the edge count (and the clusters-cache token) moves at most once per graph — a single recompute, not the per-restart drift the original skip avoided. Co- change edges are partition-irrelevant (edgeWeight 0; KindFile endpoints are filtered out of community detection), so that one recompute yields the same partition. Refreshing stale co-change after a HEAD move remains a manual `gortex enrich cochange` (or cold reindex) — the lazy path does not auto-re-mine once edges exist; that's an intentional scope boundary (auto-refresh needs a per-HEAD marker coordinated with the CLI enrich path). Test: AddEdges-persisted edges take the coChangeFromEdges fast path with the right score/count.
…ories Replace the gob.gz flat-file persistence for session notes and cross-session development memories with a SQLite sidecar DB, separate from the graph store and independent of the graph --backend so the side-stores persist even under the in-memory backend. - New persistence.SidecarStore opens <DataDir>/sidecar.sqlite (WAL, synchronous=NORMAL, busy_timeout=5000) with notes + memories tables (plus scopes + notebooks for the follow-up), row CRUD, and bounded- DELETE trim mirroring the prior cap/trim policy. Handles are cached per absolute path so one file backs every repo a daemon serves. - notes/memories managers keep their public API, in-memory model, and scorers byte-for-byte; only the persistence layer swaps to per-row upserts + a bounded DELETE trim. - One-shot legacy import: a pre-existing notes.gob.gz / memories.gob.gz is loaded into the sidecar on first open (guarded on an empty table + a migration mark) then renamed to *.bak, never deleted. Idempotent. - Init wiring: the gortex mcp subprocess persists under its cache dir; the daemon now persists too via the data-dir sidecar with a stable partition key (it was previously no-persistence).
…idecar Back the saved-scope registry and the repository notebook on the same SQLite sidecar DB as notes + memories, making sqlite the primary store for all four side-stores. - Scopes: scopeStore now reads/writes the global scopes table; the in-memory byName map mirrors it for lock-cheap reads. A pre-existing scopes.json is imported once then renamed to scopes.json.bak. - Notebooks: notebookManager persists to the notebooks table at <repo>/.gortex/sidecar.sqlite instead of per-entry markdown files. The notebookEntry shape is unchanged; the schema reserves a symbol_ids column for the future. Legacy <repo>/.gortex/notebook/ <id>.md files are imported once then renamed to <id>.md.bak; the markdown marshal/unmarshal helpers stay for the importer. TTL prune becomes a bounded DELETE. A markdown mirror is deliberately left to a later step. - Tests: notebook handler tests ported off markdown-file assertions to round-trip through the sidecar; added manager-level legacy-import + restart-persistence tests for notes / memories / scopes / notebooks.
The writer goroutine pushed every Save/Evict result into the buffered errs channel but only checked the stop channel at the top of the loop. The writer outruns the 64-slot buffer in microseconds, so it blocked on a full errs send, never re-checked stop, and wg.Wait() deadlocked forever (errs is only drained after wg.Wait()). The hang was in the test harness, not FileStore (which serialises via flock + atomic temp+rename). Honour stop while sending so the writer exits on a full buffer. The test now completes (and passes under -race): FileStore's concurrent Save/Evict/Load safety is exercised without the harness deadlock.
…mined) The lazy co-change path serves persisted EdgeCoChange edges as-is once they exist and does not auto-re-mine, so the counts can be stale after git history advances. Emit an Info line on that fast path — "could be updated, but was not" — pointing at `gortex enrich cochange` to refresh, rather than silently serving possibly-stale data. Fires at most once per daemon process (mineCoChange is sync.Once-gated).
- sidecar_sqlite.go: convert legacyScope→ScopeRow directly (staticcheck S1016) - cli_progress.go: drop indexWithSpinner — orphaned when enrich stopped building an in-memory graph (it forwards to the daemon) — and its now unused context/fmt/indexer imports (unused) - tools_analyze_health_score.go: drop extractTimestamp, replaced by lastAuthoredTSFrom in the blame-sidecar read path (unused) golangci-lint run --timeout=5m → 0 issues.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Benchmark of various options for the persistence layer for non-in-memory-only launch in case of huge repositories and local (non-remote server) setup.
Reslts
gortex scale (~2000 files, ~125k nodes, ~520k edges):
Declined due to performance issues
On the Linux scale,
cozoconfirmed the issue with extremely slow queries, so thecozooption declined too.Kuzu - public archive - declined too