Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Grep and fuzzy file search work fine for small projects. At scale they break dow
│ └── embedded Swagger UI │
│ │
│ Indexing pipeline │
│ ├── gotreesitter (AST chunking, 200+ languages)
│ ├── tree-sitter/wasm (AST chunking, 31 langs) (wazero)
│ ├── llama-server sidecar (Unix socket → CodeRankEmbed Q8 GGUF) │
│ ├── chromem-go (cosine similarity vector store) │
│ ├── SQLite FTS5 chunk mirror (BM25 — powers hybrid workspace) │
Expand Down
21 changes: 21 additions & 0 deletions doc/openapi.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3245,6 +3245,8 @@ components:
- max_embedding_concurrency
- llama_batch_size
- index_embed_batch_chunks
- chunk_max_concurrent
- llama_cache_ram_mib
- source
properties:
embedding_model:
Expand All @@ -3270,6 +3272,14 @@ components:
type: integer
minimum: 0
description: Cross-file embed-batch size for repo indexing (chunks per embed call). 0 = one call per file.
chunk_max_concurrent:
type: integer
minimum: 0
description: Chunker (tree-sitter wasm) instance-concurrency cap, decoupled from embedding concurrency. Each instance holds ~69 MiB. 0 = recommended (3).
llama_cache_ram_mib:
type: integer
minimum: -1
description: llama-server host prompt-cache cap in MiB (--cache-ram). 0 = disabled (recommended for embeddings — prompts are never reused, and llama's upstream 8 GiB default grows RSS until the container OOMs), -1 = unlimited.
source:
type: object
additionalProperties:
Expand Down Expand Up @@ -3301,6 +3311,8 @@ components:
- max_embedding_concurrency
- llama_batch_size
- index_embed_batch_chunks
- chunk_max_concurrent
- llama_cache_ram_mib
properties:
embedding_model: { type: string }
llama_ctx_size: { type: integer }
Expand All @@ -3309,6 +3321,8 @@ components:
max_embedding_concurrency: { type: integer }
llama_batch_size: { type: integer }
index_embed_batch_chunks: { type: integer }
chunk_max_concurrent: { type: integer }
llama_cache_ram_mib: { type: integer }

RuntimeConfigUpdate:
type: object
Expand Down Expand Up @@ -3339,6 +3353,13 @@ components:
index_embed_batch_chunks:
type: integer
nullable: true
chunk_max_concurrent:
type: integer
nullable: true
llama_cache_ram_mib:
type: integer
nullable: true
description: MiB; 0 clears the override (falls back to env / recommended = disabled), -1 = unlimited.

SidecarStatus:
type: object
Expand Down
3 changes: 3 additions & 0 deletions poc/wasm-treesitter/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# build artifact — rebuilt by build.sh; the committed module lives as
# server/internal/chunker/tswasm/ts-core.wasm.br (brotli)
ts-core.wasm
70 changes: 70 additions & 0 deletions poc/wasm-treesitter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# PoC: tree-sitter via WASM/wazero (pure-Go, no cgo)

An alternative to the cgo backend on `feat/chunker-cgo-treesitter`. The **official**
tree-sitter C runtime + the official TypeScript grammar are compiled to a single
standalone `wasm32-wasi` reactor module (`build.sh`, via `zig cc`) and driven
from Go through [wazero](https://github.com/tetratelabs/wazero) — **no cgo, no
JavaScript, no third-party parser**. Only `wasmts.go` (our wazero host) is
bespoke; the parser itself is the unmodified upstream C.

Goal: give us real **speed + stability** numbers to choose between cgo and wasm.

## Results — same 852-file vscode TypeScript corpus, full-tree walk

| backend | wall | files/s | MB/s | ERROR trees | `editorOptions.ts` |
|---|---|---|---|---|---|
| gotreesitter (pure-Go GLR) | 13.83 s | 62 | 0.8 | **13** | 8.77 s → ERROR |
| **WASM (wazero, pure-Go host)** | **~2.5 s** | **~330** | **~4.1** | **0** | **49 ms** |
| cgo (native tree-sitter) | 1.26 s | 675 | 11.5 | 0 | 17 ms |

- **WASM is ~2× slower than cgo, ~5× faster than gotreesitter, and correct** (0 ERROR trees vs gotreesitter's 13).
- The WASM overhead is the **host↔guest call boundary**, not memory: each of the
2.68 M nodes costs ~3 wazero calls (`ts_node_type`, `ts_node_child_count`,
`ts_node_child`). Reusing node slots instead of `malloc`/`free` per node moved
the number only 328→357 files/s — so it's the calls. A single batched
"serialize subtree" export would close most of the remaining gap vs cgo
(future work; not done here).

## Stability (`cmd/stability`)

- tree-sitter is **robust**: 6 adversarial inputs (100–200 k-deep nesting, 5 MB
single token, invalid UTF-8, unbalanced templates) all parsed without crashing
— this is true of cgo too, so it is **not** a bug WASM uniquely fixes.
- What WASM **adds** is containment: a guest-side fault (resource limit, and in
principle any C bug — stack overflow, OOB) surfaces as a **recoverable Go
error**; the host process stays alive. The memory-capped run demonstrates this.
- Under cgo the equivalent fault is a native **SIGSEGV/abort that kills the whole
cix-server**. So crash-isolation is **insurance against unknown C bugs in
grammars/scanners**, not a fix for an observed crash.

## Trade-off summary

| | cgo (current) | WASM/wazero (this PoC) |
|---|---|---|
| Parse speed | 🟢 fastest | 🟡 ~2× slower (≈invisible end-to-end: embeddings dominate) |
| Correctness | 🟢 official | 🟢 official (identical parser) |
| Build | 🟡 needs C toolchain (musl-static solved it) | 🟢 `CGO_ENABLED=0`, trivial cross-compile; `zig` only at wasm-build time (one-off, artifact committed) |
| Crash isolation | 🔴 C fault kills process | 🟢 contained → Go error |
| Binary size | 🔴 ~78 MB (grammar tables linked natively) | 🟢 likely smaller: pure-Go host (~41 MB) + embedded `.wasm` (1.4 MB / grammar, brotli-compressible) |
| Maturity / effort | 🟢 drop-in (official binding + 31 grammar modules) | 🔴 bespoke host; must build/bundle 31 grammar `.wasm` + flesh out node API + batched walk |

## Honest read

It's close. cgo is done and fastest. WASM costs ~2× on **parsing**, but since
**embeddings dominate end-to-end indexing time**, that 2× is largely invisible in
production — while WASM's upsides (no cgo, crash-isolation, smaller binary,
toolchain-free server builds) are real. The price of WASM is **engineering
effort** to productionize: build all 31 grammars into the module, write the full
node-walk API the chunker needs (with a batched-walk export to recover speed),
and wire it behind the same `tsgrammars`-style registry.

## Build & run

```bash
brew install zig # provides clang + wasi-libc cross-compile
./build.sh # → ts-ts.wasm (official tree-sitter v0.25.10 + tree-sitter-typescript v0.23.2)
go run ./cmd/bench /path/to/vscode/src/vs/editor
go run ./cmd/stability
```

`ts-ts.wasm` is committed so the benchmarks run without zig.
108 changes: 108 additions & 0 deletions poc/wasm-treesitter/build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,108 @@
#!/usr/bin/env bash
# Builds ts-core.wasm: the OFFICIAL tree-sitter C runtime + the base grammars +
# our host_extra.c (the batched ts_dump_tree walk), compiled to ONE standalone
# wasm32-wasi reactor module via `zig cc`. No emscripten, no JS glue.
#
# Requires: zig (clang + wasi-libc cross-compile), git, and tree-sitter CLI (only
# for grammars whose repo ships no committed parser.c — gen=1 rows).
#
# For wasm we compile each grammar IN PLACE from a full clone, so relative
# includes (e.g. typescript's ../../common/scanner.h) and src-root headers (html
# tag.h, haskell unicode.h) resolve naturally — none of the vendor.sh copy/rewrite
# dance is needed. Quirks that remain: SHA pins (dart), `tree-sitter generate`
# (sql), and a 2nd grammar from one repo (tsx). See plan §6.1.
set -euo pipefail
cd "$(dirname "$0")"

TS_VERSION="${TS_VERSION:-v0.25.10}"
OUT="${OUT:-ts-core.wasm}"
WORK="$(mktemp -d)"
trap 'rm -rf "$WORK"' EXIT

# id repo ref srcsubdir [gen]
GRAMMARS=(
"python tree-sitter/tree-sitter-python v0.25.0 src"
"typescript tree-sitter/tree-sitter-typescript v0.23.2 typescript/src"
"tsx tree-sitter/tree-sitter-typescript v0.23.2 tsx/src"
"javascript tree-sitter/tree-sitter-javascript v0.25.0 src"
"go tree-sitter/tree-sitter-go v0.25.0 src"
"rust tree-sitter/tree-sitter-rust v0.24.2 src"
"java tree-sitter/tree-sitter-java v0.23.5 src"
"c tree-sitter/tree-sitter-c v0.24.2 src"
"cpp tree-sitter/tree-sitter-cpp v0.23.4 src"
"ruby tree-sitter/tree-sitter-ruby v0.23.1 src"
"c_sharp tree-sitter/tree-sitter-c-sharp v0.23.5 src"
"php tree-sitter/tree-sitter-php v0.24.2 php/src"
"swift alex-pinkus/tree-sitter-swift 0.7.3-with-generated-files src"
"kotlin tree-sitter-grammars/tree-sitter-kotlin v1.1.0 src"
"scala tree-sitter/tree-sitter-scala v0.26.0 src"
"bash tree-sitter/tree-sitter-bash v0.25.1 src"
"lua tree-sitter-grammars/tree-sitter-lua v0.5.0 src"
"dart UserNobody14/tree-sitter-dart a9bdfa3 src"
"r r-lib/tree-sitter-r v1.2.0 src"
"objc tree-sitter-grammars/tree-sitter-objc v3.0.2 src"
"html tree-sitter/tree-sitter-html v0.23.2 src"
"css tree-sitter/tree-sitter-css v0.25.0 src"
"scss tree-sitter-grammars/tree-sitter-scss v1.0.0 src"
"sql DerekStride/tree-sitter-sql v0.3.11 src 1"
"markdown tree-sitter-grammars/tree-sitter-markdown v0.5.3 tree-sitter-markdown/src"
"zig tree-sitter-grammars/tree-sitter-zig v1.1.2 src"
"julia tree-sitter/tree-sitter-julia v0.25.0 src"
"fortran stadelmanma/tree-sitter-fortran v0.6.0 src"
"haskell tree-sitter/tree-sitter-haskell v0.23.1 src"
"ocaml tree-sitter/tree-sitter-ocaml v0.25.0 grammars/ocaml/src"
"solidity JoranHonig/tree-sitter-solidity v1.2.13 src"
)

clone() { # repo ref dest — tag/branch fast path, SHA fallback
local repo="$1" ref="$2" dest="$3"
git clone --depth 1 --branch "$ref" "https://github.com/$repo" "$dest" >/dev/null 2>&1 && return 0
git clone "https://github.com/$repo" "$dest" >/dev/null 2>&1 || return 1
git -C "$dest" checkout "$ref" >/dev/null 2>&1
}

echo "→ tree-sitter runtime $TS_VERSION"
git clone --depth 1 --branch "$TS_VERSION" https://github.com/tree-sitter/tree-sitter "$WORK/tree-sitter" 2>/dev/null

SRCS=( "$WORK/tree-sitter/lib/src/lib.c" "csrc/host_extra.c" )
INCS=( -I "$WORK/tree-sitter/lib/include" -I "$WORK/tree-sitter/lib/src" )
EXPORTS=()
BUILT=() ; FAILED=()

for row in "${GRAMMARS[@]}"; do
read -r id repo ref sub gen <<<"$row"
printf ' %-12s %s@%s ' "$id" "$repo" "$ref"
if ! clone "$repo" "$ref" "$WORK/$id"; then echo "CLONE FAIL"; FAILED+=("$id"); continue; fi
gsrc="$WORK/$id/$sub"
if [ "${gen:-0}" = "1" ] && [ ! -f "$gsrc/parser.c" ]; then
( cd "$WORK/$id" && tree-sitter generate >/dev/null 2>&1 ) || true
fi
if [ ! -f "$gsrc/parser.c" ]; then echo "NO parser.c"; FAILED+=("$id"); continue; fi
SRCS+=( "$gsrc/parser.c" )
[ -f "$gsrc/scanner.c" ] && SRCS+=( "$gsrc/scanner.c" )
[ -f "$gsrc/scanner.cc" ] && SRCS+=( "$gsrc/scanner.cc" )
INCS+=( -I "$gsrc" )
EXPORTS+=( -Wl,--export=tree_sitter_$id )
BUILT+=("$id")
echo "ok"
done

echo "→ compiling ${#SRCS[@]} sources, ${#BUILT[@]} grammars → $OUT"
zig cc --target=wasm32-wasi-musl -mexec-model=reactor \
"${INCS[@]}" "${SRCS[@]}" \
-o "$OUT" -Oz -fPIC -Wl,--no-entry -Wl,--strip-debug \
-Wl,--export=malloc -Wl,--export=free \
-Wl,--export=ts_parser_new -Wl,--export=ts_parser_delete \
-Wl,--export=ts_parser_set_language -Wl,--export=ts_parser_parse_string \
-Wl,--export=ts_parser_reset \
-Wl,--export=ts_tree_delete -Wl,--export=ts_tree_root_node \
-Wl,--export=ts_node_child_count -Wl,--export=ts_node_child \
-Wl,--export=ts_node_type -Wl,--export=ts_node_start_byte \
-Wl,--export=ts_node_end_byte -Wl,--export=ts_node_has_error \
-Wl,--export=ts_dump_tree -Wl,--export=ts_dump_rec_size \
-Wl,--export=ts_language_symbol_count -Wl,--export=ts_language_symbol_name \
"${EXPORTS[@]}"

echo "built $OUT ($(du -h "$OUT" | cut -f1)) — runtime $TS_VERSION, ${#BUILT[@]} grammars"
[ ${#FAILED[@]} -gt 0 ] && echo "FAILED: ${FAILED[*]}"
echo "grammars: ${BUILT[*]}"
82 changes: 82 additions & 0 deletions poc/wasm-treesitter/cmd/bench/main.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,82 @@
// Command bench parses every .ts file under a directory with the WASM backend
// and reports throughput — apples-to-apples with the cgo/gotreesitter numbers
// in ../../README.md (same corpus, same full-tree walk).
//
// go run ./cmd/bench /path/to/vscode/src/vs/editor
package main

import (
"context"
"fmt"
"os"
"path/filepath"
"sort"
"time"

wasmts "github.com/dvcdsys/code-index/poc/wasm-treesitter"
)

func main() {
if len(os.Args) < 2 {
fmt.Println("usage: bench <dir-with-.ts-files>")
os.Exit(2)
}
root := os.Args[1]
ctx := context.Background()
eng, err := wasmts.New(ctx, 0)
if err != nil {
panic(err)
}
defer eng.Close()

var files []string
filepath.WalkDir(root, func(p string, d os.DirEntry, e error) error {
if e == nil && !d.IsDir() && filepath.Ext(p) == ".ts" {
files = append(files, p)
}
return nil
})
fmt.Printf("corpus: %d .ts files\n\n", len(files))

type res struct {
path string
dur time.Duration
isErr bool
}
var all []res
var totalParse time.Duration
var totalBytes, errFiles, totalNodes int
start := time.Now()
for _, f := range files {
src, _ := os.ReadFile(f)
t0 := time.Now()
r, err := eng.Parse("tree_sitter_typescript", src)
d := time.Since(t0)
if err != nil {
fmt.Printf(" trap on %s: %v\n", filepath.Base(f), err)
continue
}
totalParse += d
totalBytes += len(src)
totalNodes += r.Nodes
if r.HasError {
errFiles++
}
all = append(all, res{f, d, r.HasError})
}
wall := time.Since(start)
sort.Slice(all, func(i, j int) bool { return all[i].dur > all[j].dur })
mb := float64(totalBytes) / 1e6
fmt.Println("=== WASM: official tree-sitter via wazero (pure-Go host, no cgo) ===")
fmt.Printf(" wall: %v parse+walk: %v\n", wall.Round(time.Millisecond), totalParse.Round(time.Millisecond))
fmt.Printf(" throughput: %.0f files/s, %.1f MB/s\n", float64(len(files))/wall.Seconds(), mb/totalParse.Seconds())
fmt.Printf(" ERROR trees: %d / %d nodes walked: %d\n", errFiles, len(files), totalNodes)
fmt.Printf(" slowest 5:\n")
for i := 0; i < 5 && i < len(all); i++ {
e := ""
if all[i].isErr {
e = " [ERROR]"
}
fmt.Printf(" %8v %s%s\n", all[i].dur.Round(time.Millisecond), filepath.Base(all[i].path), e)
}
}
Loading
Loading