Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

Preprocessor for scraped LLVM Bugzilla JSON → documents for Pinecone #85

@jonathanMLDev

Description

@jonathanMLDev

Summary

Preprocess already scraped LLVM Bugzilla data in JSON into document objects (content + metadata) for Pinecone upsert. No scraping or Bugzilla API calls.

Scope

  • Input: Scraped Bugzilla JSON (existing scrape schema). Output: Documents with content (e.g. summary + description, optional comments) and metadata (e.g. bug_id, product, component, status, priority, reporter, created_at, url).
  • In scope: Parse/validate JSON; normalize text (strip HTML); define document schema; one doc per bug or per comment as agreed. Out of scope: Fetching from Bugzilla; calling Pinecone.

Result

Library or CLI: JSON input (file/stream) → list of { content, metadata }. Config for field mapping/truncation. Code, tests, and doc schema README.

Acceptance criteria

  • Accepts scraped Bugzilla JSON in agreed format; outputs stable content + metadata.
  • Metadata includes bug id, product/component, filterable fields. Schema documented.
  • No Bugzilla API or scraping in this component.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions