Skip to content

Token-aware chunking with overlap and structure-safe splits #12

Description

@harrymove-ctrl

Context

The deterministic chunker in packages/core/src/chunks.ts splits sections by paragraph and then hard-slices at MAX_CHUNK_CHARS = 4000 in splitLong() (line 4 / 126-146), cutting mid-sentence, mid-code-fence, and mid-table. It is the single chunking core consumed by BOTH pipelines: the CLI/MCP path and the production Worker queue consumer (apps/api/src/worker.ts calls renderChunksNdjson(buildChunks(...)) at ~2186 and ~2790). Poor chunk boundaries degrade RAG retrieval AND the verbatim grounding gate in facts.ts (validateQuote relies on normalizeText exported from chunks.ts).

Goal / user story

As an agent builder recalling extracted context, I want chunks that are token-budgeted, overlapping, and never split inside a sentence/code block/table, so retrieved snippets are self-contained and citable — improving both the grounded chat answers and fact extraction precision.

Acceptance criteria

  • Chunk size is governed by an approximate token budget (~512-800 tokens) rather than a raw 4000-char cap (a cheap chars-per-token heuristic is acceptable; no heavy tokenizer dependency in the Worker bundle).
  • Consecutive parts within a section carry ~10-15% overlap so a fact spanning a boundary survives in at least one chunk.
  • Splits never occur inside a fenced code block (```), never break a markdown table (lines starting with |), and prefer sentence/paragraph boundaries over mid-word cuts.
  • chunkId remains location-based and deterministic across reruns (same routePath + headingPath + stable part index), so planMemoryWrite diffing and chunkGraphDigest() stay stable for the */30 freshness pipeline.
  • normalizeText stays the shared normalizer; facts.test.ts (validateQuote substring gate) and chunks.test.ts remain green, with new cases for code-fence/table/overlap.

Implementation notes

  • Touch only packages/core/src/chunks.ts; extend packages/core/src/chunks.test.ts. Lifts both pipelines at once because both call buildChunks.
  • Replace splitLong(): tokenize-aware accumulation that emits parts at the token budget, then prepend the trailing ~10-15% of the previous part as overlap. Track code-fence open/close state and table-row runs while iterating lines so a split is deferred past those blocks.
  • Keep the partCounters location-key scheme so chunkId = sha256Hex(${locationKey}\n#${partIndex}).slice(0,16) stays stable; overlap text must NOT change the partIndex sequence (overlap is additive content, not a new logical part).
  • contentHash stays sha256Hex(normalizeText(part)); note overlap means a quote may appear in two chunks — that is fine, validateQuote checks substring of the cited chunk.
  • Gotcha: the Worker bundle is size/CPU constrained — prefer a heuristic char estimate or a tiny pure-JS token estimator over js-tiktoken/gpt-tokenizer.

Sui Overflow angle

Clean, stable chunk boundaries make chunkGraphDigest() a trustworthy "what was extracted" anchor — the exact value the on-chain contextmem.extract attribution receipt keys on. Better chunks also directly improve the grounded RAG chat that headlines the live demo.

Dependencies

None (foundational; the hybrid-retrieval issue builds on these chunks).

Part of the ContextMEM roadmap (#4) • Sui Overflow build.

Metadata

Metadata

Assignees

No one assigned

    Labels

    P0Demo-blocking: required for a working Sui Overflow democrawlingWeb/Walrus crawling and context-extraction qualityfeatureUser- or agent-facing capability

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions