Token-aware chunking with overlap and structure-safe splits

## Context
The deterministic chunker in `packages/core/src/chunks.ts` splits sections by paragraph and then **hard-slices at `MAX_CHUNK_CHARS = 4000`** in `splitLong()` (line 4 / 126-146), cutting mid-sentence, mid-code-fence, and mid-table. It is the single chunking core consumed by BOTH pipelines: the CLI/MCP path and the production Worker queue consumer (`apps/api/src/worker.ts` calls `renderChunksNdjson(buildChunks(...))` at ~2186 and ~2790). Poor chunk boundaries degrade RAG retrieval AND the verbatim grounding gate in `facts.ts` (`validateQuote` relies on `normalizeText` exported from `chunks.ts`).

## Goal / user story
As an agent builder recalling extracted context, I want chunks that are token-budgeted, overlapping, and never split inside a sentence/code block/table, so retrieved snippets are self-contained and citable — improving both the grounded chat answers and fact extraction precision.

## Acceptance criteria
- [ ] Chunk size is governed by an approximate **token budget (~512-800 tokens)** rather than a raw 4000-char cap (a cheap chars-per-token heuristic is acceptable; no heavy tokenizer dependency in the Worker bundle).
- [ ] Consecutive parts within a section carry **~10-15% overlap** so a fact spanning a boundary survives in at least one chunk.
- [ ] Splits never occur inside a fenced code block (```), never break a markdown table (lines starting with `|`), and prefer sentence/paragraph boundaries over mid-word cuts.
- [ ] `chunkId` remains **location-based and deterministic** across reruns (same `routePath` + `headingPath` + stable part index), so `planMemoryWrite` diffing and `chunkGraphDigest()` stay stable for the `*/30` freshness pipeline.
- [ ] `normalizeText` stays the shared normalizer; `facts.test.ts` (validateQuote substring gate) and `chunks.test.ts` remain green, with new cases for code-fence/table/overlap.

## Implementation notes
- Touch only `packages/core/src/chunks.ts`; extend `packages/core/src/chunks.test.ts`. Lifts both pipelines at once because both call `buildChunks`.
- Replace `splitLong()`: tokenize-aware accumulation that emits parts at the token budget, then prepend the trailing ~10-15% of the previous part as overlap. Track code-fence open/close state and table-row runs while iterating lines so a split is deferred past those blocks.
- Keep the `partCounters` location-key scheme so `chunkId = sha256Hex(`${locationKey}\n#${partIndex}`).slice(0,16)` stays stable; overlap text must NOT change the partIndex sequence (overlap is additive content, not a new logical part).
- `contentHash` stays `sha256Hex(normalizeText(part))`; note overlap means a quote may appear in two chunks — that is fine, `validateQuote` checks substring of the *cited* chunk.
- Gotcha: the Worker bundle is size/CPU constrained — prefer a heuristic char estimate or a tiny pure-JS token estimator over `js-tiktoken`/`gpt-tokenizer`.

## Sui Overflow angle
Clean, stable chunk boundaries make `chunkGraphDigest()` a trustworthy "what was extracted" anchor — the exact value the on-chain `contextmem.extract` attribution receipt keys on. Better chunks also directly improve the grounded RAG chat that headlines the live demo.

## Dependencies
None (foundational; the hybrid-retrieval issue builds on these chunks).

_Part of the ContextMEM roadmap (#4) • Sui Overflow build._

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Token-aware chunking with overlap and structure-safe splits #12

Context

Goal / user story

Acceptance criteria

Implementation notes

Sui Overflow angle

Dependencies

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Token-aware chunking with overlap and structure-safe splits #12

Description

Context

Goal / user story

Acceptance criteria

Implementation notes

Sui Overflow angle

Dependencies

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions