feat: structure-preserving chunking for non-markdown documents (PDF, DOCX, HTML)

## Context

The existing `chunkContent` function (`src/core/indexing.ts`) is already structure-aware for markdown:
- Splits at heading boundaries (`#`, `##`, `###`)
- Prepends breadcrumb context (`Context: Parent > Child`) so embeddings capture document location
- Falls back to paragraph splitting for oversized sections
- Applies configurable inter-chunk overlap (default 10%, prefers line boundaries)

This is meaningfully better than naive fixed-size chunking. The original issue overstated the gap.

## Actual remaining gap

**Non-markdown documents lose their structure before chunking.**

When libscope parses a PDF, DOCX, or HTML file, the parsers (`src/core/parsers/`) convert to plain text or markdown. The quality of that conversion determines whether `chunkContent` can do its heading-aware splitting. Currently:

- **HTML** (`parsers/html.ts`): strips tags, headings may or may not survive as `#` markdown
- **DOCX** (`parsers/docx.ts`): mammoth converts to markdown — headings should be preserved, but nesting and tables may be lossy
- **PDF** (`parsers/pdf.ts`): pdf-parse produces raw text — heading structure is completely lost, paragraph breaks may be inconsistent
- **EPUB/PPTX** (`parsers/epub.ts`, `parsers/pptx.ts`): varies

The result is that for PDF in particular, `chunkContent` receives a flat wall of text and falls back to character-based paragraph splitting with no structural awareness.

## Proposed work

1. **Audit each parser's output** — check whether heading structure survives as markdown headings that `chunkContent` can use
2. **PDF: improve structure extraction** — `pdf-parse` has limited structure support; consider `pdf2json` or heuristic detection of heading lines (ALL CAPS, short lines, larger font size metadata) to inject `##` markers
3. **HTML: ensure headings survive as `#` markdown** — verify `h1`–`h4` tags become `#`–`####` in the parser output
4. **DOCX: verify mammoth heading mapping** — ensure `Heading 1`–`Heading 3` styles map to `#`–`###`

## What this is NOT

- This is not a request for embedding-based semantic chunking (high complexity, marginal gain over the current heading-aware approach)
- The overlap and paragraph fallback already handle the context-boundary problem for well-structured documents

## Acceptance criteria

- [ ] Audit of each parser's markdown output — document which heading levels survive for each format
- [ ] PDF: heuristic heading detection or improved extractor that injects `##` markers for section breaks
- [ ] HTML: confirm `h1`–`h4` → `#`–`####` in parser output
- [ ] DOCX: confirm mammoth heading styles map correctly
- [ ] Test fixtures: one PDF, one DOCX, one HTML file with known heading structure; assert chunk count and breadcrumb content after indexing

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: structure-preserving chunking for non-markdown documents (PDF, DOCX, HTML) #484

Context

Actual remaining gap

Proposed work

What this is NOT

Acceptance criteria

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

feat: structure-preserving chunking for non-markdown documents (PDF, DOCX, HTML) #484

Description

Context

Actual remaining gap

Proposed work

What this is NOT

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions