Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

Preprocessor for scraped LLVM Phabricator markdown → documents for Pinecone #86

@jonathanMLDev

Description

@jonathanMLDev

Summary

Preprocess already scraped LLVM Phabricator data in markdown (MD) into document objects (content + metadata) for Pinecone upsert. No scraping or Phabricator API calls.

Scope

  • Input: Scraped Phabricator MD (existing scrape structure). Output: Documents with content (e.g. title + description) and metadata (e.g. object_id, type, author, status, created_at, project, url).
  • In scope: Parse/validate MD (files, front matter if any); normalize text; define document schema; one doc per revision/task or chunk as agreed. Out of scope: Fetching from Phabricator; calling Pinecone.

Result

Library or CLI: MD input (file/dir) → list of { content, metadata }. Config for field mapping/truncation. Code, tests, and doc schema README.

Acceptance criteria

  • Accepts scraped Phabricator MD in agreed format; outputs stable content + metadata.
  • Metadata includes object id, type, filterable fields. Schema documented.
  • No Phabricator API or scraping in this component.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions