Skip to content

feat: structure-preserving chunking for non-markdown documents (PDF, DOCX, HTML) #484

@RobertLD

Description

@RobertLD

Context

The existing chunkContent function (src/core/indexing.ts) is already structure-aware for markdown:

  • Splits at heading boundaries (#, ##, ###)
  • Prepends breadcrumb context (Context: Parent > Child) so embeddings capture document location
  • Falls back to paragraph splitting for oversized sections
  • Applies configurable inter-chunk overlap (default 10%, prefers line boundaries)

This is meaningfully better than naive fixed-size chunking. The original issue overstated the gap.

Actual remaining gap

Non-markdown documents lose their structure before chunking.

When libscope parses a PDF, DOCX, or HTML file, the parsers (src/core/parsers/) convert to plain text or markdown. The quality of that conversion determines whether chunkContent can do its heading-aware splitting. Currently:

  • HTML (parsers/html.ts): strips tags, headings may or may not survive as # markdown
  • DOCX (parsers/docx.ts): mammoth converts to markdown — headings should be preserved, but nesting and tables may be lossy
  • PDF (parsers/pdf.ts): pdf-parse produces raw text — heading structure is completely lost, paragraph breaks may be inconsistent
  • EPUB/PPTX (parsers/epub.ts, parsers/pptx.ts): varies

The result is that for PDF in particular, chunkContent receives a flat wall of text and falls back to character-based paragraph splitting with no structural awareness.

Proposed work

  1. Audit each parser's output — check whether heading structure survives as markdown headings that chunkContent can use
  2. PDF: improve structure extractionpdf-parse has limited structure support; consider pdf2json or heuristic detection of heading lines (ALL CAPS, short lines, larger font size metadata) to inject ## markers
  3. HTML: ensure headings survive as # markdown — verify h1h4 tags become ##### in the parser output
  4. DOCX: verify mammoth heading mapping — ensure Heading 1Heading 3 styles map to ####

What this is NOT

  • This is not a request for embedding-based semantic chunking (high complexity, marginal gain over the current heading-aware approach)
  • The overlap and paragraph fallback already handle the context-boundary problem for well-structured documents

Acceptance criteria

  • Audit of each parser's markdown output — document which heading levels survive for each format
  • PDF: heuristic heading detection or improved extractor that injects ## markers for section breaks
  • HTML: confirm h1h4##### in parser output
  • DOCX: confirm mammoth heading styles map correctly
  • Test fixtures: one PDF, one DOCX, one HTML file with known heading structure; assert chunk count and breadcrumb content after indexing

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions