Preprocessor for scraped LLVM Phabricator markdown → documents for Pinecone

## Summary
Preprocess **already scraped** LLVM Phabricator data in **markdown (MD)** into document objects (content + metadata) for Pinecone upsert. No scraping or Phabricator API calls.

## Scope
- **Input**: Scraped Phabricator **MD** (existing scrape structure). **Output**: Documents with `content` (e.g. title + description) and `metadata` (e.g. `object_id`, `type`, `author`, `status`, `created_at`, `project`, `url`).
- **In scope**: Parse/validate MD (files, front matter if any); normalize text; define document schema; one doc per revision/task or chunk as agreed. **Out of scope**: Fetching from Phabricator; calling Pinecone.

## Result
Library or CLI: MD input (file/dir) → list of `{ content, metadata }`. Config for field mapping/truncation. Code, tests, and doc schema README.

## Acceptance criteria
- [ ] Accepts scraped Phabricator MD in agreed format; outputs stable `content` + `metadata`.
- [ ] Metadata includes object id, type, filterable fields. Schema documented.
- [ ] No Phabricator API or scraping in this component.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessor for scraped LLVM Phabricator markdown → documents for Pinecone #86

Summary

Scope

Result

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preprocessor for scraped LLVM Phabricator markdown → documents for Pinecone #86

Description

Summary

Scope

Result

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions