Context
The deterministic chunker in packages/core/src/chunks.ts splits sections by paragraph and then hard-slices at MAX_CHUNK_CHARS = 4000 in splitLong() (line 4 / 126-146), cutting mid-sentence, mid-code-fence, and mid-table. It is the single chunking core consumed by BOTH pipelines: the CLI/MCP path and the production Worker queue consumer (apps/api/src/worker.ts calls renderChunksNdjson(buildChunks(...)) at ~2186 and ~2790). Poor chunk boundaries degrade RAG retrieval AND the verbatim grounding gate in facts.ts (validateQuote relies on normalizeText exported from chunks.ts).
Goal / user story
As an agent builder recalling extracted context, I want chunks that are token-budgeted, overlapping, and never split inside a sentence/code block/table, so retrieved snippets are self-contained and citable — improving both the grounded chat answers and fact extraction precision.
Acceptance criteria
Implementation notes
- Touch only
packages/core/src/chunks.ts; extend packages/core/src/chunks.test.ts. Lifts both pipelines at once because both call buildChunks.
- Replace
splitLong(): tokenize-aware accumulation that emits parts at the token budget, then prepend the trailing ~10-15% of the previous part as overlap. Track code-fence open/close state and table-row runs while iterating lines so a split is deferred past those blocks.
- Keep the
partCounters location-key scheme so chunkId = sha256Hex(${locationKey}\n#${partIndex}).slice(0,16) stays stable; overlap text must NOT change the partIndex sequence (overlap is additive content, not a new logical part).
-
contentHash stays sha256Hex(normalizeText(part)); note overlap means a quote may appear in two chunks — that is fine, validateQuote checks substring of the cited chunk.
- Gotcha: the Worker bundle is size/CPU constrained — prefer a heuristic char estimate or a tiny pure-JS token estimator over
js-tiktoken/gpt-tokenizer.
Sui Overflow angle
Clean, stable chunk boundaries make chunkGraphDigest() a trustworthy "what was extracted" anchor — the exact value the on-chain contextmem.extract attribution receipt keys on. Better chunks also directly improve the grounded RAG chat that headlines the live demo.
Dependencies
None (foundational; the hybrid-retrieval issue builds on these chunks).
Part of the ContextMEM roadmap (#4) • Sui Overflow build.
Context
The deterministic chunker in
packages/core/src/chunks.tssplits sections by paragraph and then hard-slices atMAX_CHUNK_CHARS = 4000insplitLong()(line 4 / 126-146), cutting mid-sentence, mid-code-fence, and mid-table. It is the single chunking core consumed by BOTH pipelines: the CLI/MCP path and the production Worker queue consumer (apps/api/src/worker.tscallsrenderChunksNdjson(buildChunks(...))at ~2186 and ~2790). Poor chunk boundaries degrade RAG retrieval AND the verbatim grounding gate infacts.ts(validateQuoterelies onnormalizeTextexported fromchunks.ts).Goal / user story
As an agent builder recalling extracted context, I want chunks that are token-budgeted, overlapping, and never split inside a sentence/code block/table, so retrieved snippets are self-contained and citable — improving both the grounded chat answers and fact extraction precision.
Acceptance criteria
|), and prefer sentence/paragraph boundaries over mid-word cuts.chunkIdremains location-based and deterministic across reruns (sameroutePath+headingPath+ stable part index), soplanMemoryWritediffing andchunkGraphDigest()stay stable for the*/30freshness pipeline.normalizeTextstays the shared normalizer;facts.test.ts(validateQuote substring gate) andchunks.test.tsremain green, with new cases for code-fence/table/overlap.Implementation notes
packages/core/src/chunks.ts; extendpackages/core/src/chunks.test.ts. Lifts both pipelines at once because both callbuildChunks.splitLong(): tokenize-aware accumulation that emits parts at the token budget, then prepend the trailing ~10-15% of the previous part as overlap. Track code-fence open/close state and table-row runs while iterating lines so a split is deferred past those blocks.partCounterslocation-key scheme sochunkId = sha256Hex(${locationKey}\n#${partIndex}).slice(0,16)stays stable; overlap text must NOT change the partIndex sequence (overlap is additive content, not a new logical part).contentHashstayssha256Hex(normalizeText(part)); note overlap means a quote may appear in two chunks — that is fine,validateQuotechecks substring of the cited chunk.js-tiktoken/gpt-tokenizer.Sui Overflow angle
Clean, stable chunk boundaries make
chunkGraphDigest()a trustworthy "what was extracted" anchor — the exact value the on-chaincontextmem.extractattribution receipt keys on. Better chunks also directly improve the grounded RAG chat that headlines the live demo.Dependencies
None (foundational; the hybrid-retrieval issue builds on these chunks).
Part of the ContextMEM roadmap (#4) • Sui Overflow build.