Context
The existing chunkContent function (src/core/indexing.ts) is already structure-aware for markdown:
- Splits at heading boundaries (
#, ##, ###)
- Prepends breadcrumb context (
Context: Parent > Child) so embeddings capture document location
- Falls back to paragraph splitting for oversized sections
- Applies configurable inter-chunk overlap (default 10%, prefers line boundaries)
This is meaningfully better than naive fixed-size chunking. The original issue overstated the gap.
Actual remaining gap
Non-markdown documents lose their structure before chunking.
When libscope parses a PDF, DOCX, or HTML file, the parsers (src/core/parsers/) convert to plain text or markdown. The quality of that conversion determines whether chunkContent can do its heading-aware splitting. Currently:
- HTML (
parsers/html.ts): strips tags, headings may or may not survive as # markdown
- DOCX (
parsers/docx.ts): mammoth converts to markdown — headings should be preserved, but nesting and tables may be lossy
- PDF (
parsers/pdf.ts): pdf-parse produces raw text — heading structure is completely lost, paragraph breaks may be inconsistent
- EPUB/PPTX (
parsers/epub.ts, parsers/pptx.ts): varies
The result is that for PDF in particular, chunkContent receives a flat wall of text and falls back to character-based paragraph splitting with no structural awareness.
Proposed work
- Audit each parser's output — check whether heading structure survives as markdown headings that
chunkContent can use
- PDF: improve structure extraction —
pdf-parse has limited structure support; consider pdf2json or heuristic detection of heading lines (ALL CAPS, short lines, larger font size metadata) to inject ## markers
- HTML: ensure headings survive as
# markdown — verify h1–h4 tags become #–#### in the parser output
- DOCX: verify mammoth heading mapping — ensure
Heading 1–Heading 3 styles map to #–###
What this is NOT
- This is not a request for embedding-based semantic chunking (high complexity, marginal gain over the current heading-aware approach)
- The overlap and paragraph fallback already handle the context-boundary problem for well-structured documents
Acceptance criteria
Context
The existing
chunkContentfunction (src/core/indexing.ts) is already structure-aware for markdown:#,##,###)Context: Parent > Child) so embeddings capture document locationThis is meaningfully better than naive fixed-size chunking. The original issue overstated the gap.
Actual remaining gap
Non-markdown documents lose their structure before chunking.
When libscope parses a PDF, DOCX, or HTML file, the parsers (
src/core/parsers/) convert to plain text or markdown. The quality of that conversion determines whetherchunkContentcan do its heading-aware splitting. Currently:parsers/html.ts): strips tags, headings may or may not survive as#markdownparsers/docx.ts): mammoth converts to markdown — headings should be preserved, but nesting and tables may be lossyparsers/pdf.ts): pdf-parse produces raw text — heading structure is completely lost, paragraph breaks may be inconsistentparsers/epub.ts,parsers/pptx.ts): variesThe result is that for PDF in particular,
chunkContentreceives a flat wall of text and falls back to character-based paragraph splitting with no structural awareness.Proposed work
chunkContentcan usepdf-parsehas limited structure support; considerpdf2jsonor heuristic detection of heading lines (ALL CAPS, short lines, larger font size metadata) to inject##markers#markdown — verifyh1–h4tags become#–####in the parser outputHeading 1–Heading 3styles map to#–###What this is NOT
Acceptance criteria
##markers for section breaksh1–h4→#–####in parser output