Preprocessor for scraped LLVM Bugzilla JSON → documents for Pinecone

## Summary
Preprocess **already scraped** LLVM Bugzilla data in **JSON** into document objects (content + metadata) for Pinecone upsert. No scraping or Bugzilla API calls.

## Scope
- **Input**: Scraped Bugzilla **JSON** (existing scrape schema). **Output**: Documents with `content` (e.g. summary + description, optional comments) and `metadata` (e.g. `bug_id`, `product`, `component`, `status`, `priority`, `reporter`, `created_at`, `url`).
- **In scope**: Parse/validate JSON; normalize text (strip HTML); define document schema; one doc per bug or per comment as agreed. **Out of scope**: Fetching from Bugzilla; calling Pinecone.

## Result
Library or CLI: JSON input (file/stream) → list of `{ content, metadata }`. Config for field mapping/truncation. Code, tests, and doc schema README.

## Acceptance criteria
- [ ] Accepts scraped Bugzilla JSON in agreed format; outputs stable `content` + `metadata`.
- [ ] Metadata includes bug id, product/component, filterable fields. Schema documented.
- [ ] No Bugzilla API or scraping in this component.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preprocessor for scraped LLVM Bugzilla JSON → documents for Pinecone #85

Summary

Scope

Result

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preprocessor for scraped LLVM Bugzilla JSON → documents for Pinecone #85

Description

Summary

Scope

Result

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions