#85-add bugzila preprocessor for clang by henry0816191 · Pull Request #89 · CppDigest/cppa-brain-backend

henry0816191 · 2026-02-23T20:11:55Z

Summary by CodeRabbit

New Features
- Added a Bugzilla preprocessor that reads JSON bug data from a configurable directory, builds human-readable bug documents (product, component, status, resolution, severity, priority, keywords, and concatenated comments), extracts rich metadata (bug number, title, author, timestamps, URL), skips malformed or too-short entries, and logs processing results and counts.

coderabbitai · 2026-02-23T20:12:14Z

📝 Walkthrough

Walkthrough

Adds a new BugIssuePreprocessor that recursively reads Bugzilla-like JSON files, assembles validated human-readable content with comments, extracts structured metadata (including timestamps), and returns a list of Document objects for each valid bug.

Changes

Cohort / File(s)	Summary
New Bugzilla Preprocessor `pinecone_rag/preprocessor/bugzila_preprocessor.py`	Introduces `BugIssuePreprocessor` with recursive directory scan for `*.json`, JSON parsing, comment concatenation, content length validation, metadata extraction (including timestamp conversion), Document creation, optional result limiting, and logging for errors, missing data, and counts.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐇 I nibble through JSON, line by line and bright,
Gathering bugs and comments by soft moonlight.
I stitch fields to stories, timestamps in tow,
Hopping them into Documents ready to go. 🍃✨

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately reflects the main change: adding a Bugzilla preprocessor module for handling Clang bug data in the Pinecone RAG system.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 12-29: The _to_timestamp function should parse the ISO UTC string
into a timezone-aware datetime and only handle parsing errors specifically;
update imports to include timezone from datetime, then in _to_timestamp after
datetime.strptime(value, "%Y-%m-%dT%H:%M:%SZ") attach tzinfo=timezone.utc (e.g.,
dt = dt.replace(tzinfo=timezone.utc)) before returning dt.timestamp(), and
change the broad except Exception to except ValueError to avoid masking other
errors; ensure the function still returns 0.0 on parse failure.
- Around line 87-92: The code falls back to -1 for bug_id and builds a bogus
bug_url; update the bugzila_preprocessor.py logic where bug_id and bug_url are
set so that you validate the id (e.g., ensure bug.get("id") or data.get("id") is
present and a positive integer) and skip/omit processing this record when the id
is missing or invalid instead of assigning -1; specifically change the
assignment/flow around the bug_id and bug_url variables so invalid ids cause the
function to return/continue (no metadata added and no URL constructed).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfc953 and 128908a.

📒 Files selected for processing (2)

pinecone_rag/preprocessor/bugzila_preprocessor.py
requirements.txt

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

pinecone_rag/preprocessor/bugzila_preprocessor.py (1)

117-127: Apply limit after filtering to honor requested document count.

Right now limit slices the file list before skipping hidden/invalid entries, which can return fewer documents than requested. Consider enforcing the limit on the resulting documents.

♻️ Suggested adjustment

-        json_paths = sorted(self.data_dir.rglob("*.json"))
-        if limit is not None:
-            json_paths = json_paths[:limit]
+        json_paths = sorted(self.data_dir.rglob("*.json"))
@@
-        for json_path in json_paths:
+        for json_path in json_paths:
             if json_path.name.startswith("."):
                 continue
             doc = _load_bug_document(json_path, self.min_content_length)
             if doc is not None:
                 documents.append(doc)
+                if limit is not None and len(documents) >= limit:
+                    break

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/bugzila_preprocessor.py` around lines 117 - 127,
The current code slices json_paths before filtering hidden/invalid files which
can yield fewer results than requested; change the logic in the loop that builds
documents (using json_paths, _load_bug_document, documents, min_content_length)
so the limit is applied after filtering — either remove the early slice and
after appending check if len(documents) == limit and break, or build the full
filtered list first then take documents[:limit]; ensure skipping
json_path.name.startswith(".") and None returns from _load_bug_document still
respect the requested limit.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 1-8: The module docstring path is inconsistent with the
constructor default: the docstring says "data/bugs/**" but the constructor uses
"data/github/Clang/bugs"; update the top-of-file docstring in
pinecone_rag/preprocessor/bugzila_preprocessor.py to reference the same default
path string ("data/github/Clang/bugs") to match the constructor (or
alternatively change the constructor default to "data/bugs/**" if that is
preferred) so both the docstring and the constructor default are consistent.

---

Nitpick comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 117-127: The current code slices json_paths before filtering
hidden/invalid files which can yield fewer results than requested; change the
logic in the loop that builds documents (using json_paths, _load_bug_document,
documents, min_content_length) so the limit is applied after filtering — either
remove the early slice and after appending check if len(documents) == limit and
break, or build the full filtered list first then take documents[:limit]; ensure
skipping json_path.name.startswith(".") and None returns from _load_bug_document
still respect the requested limit.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a090a80 and b3fd1d4.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/bugzila_preprocessor.py

CppDigest#85-add bugzila preprocessor for clang

128908a

henry0816191 added the enhancement New feature or request label Feb 23, 2026

henry0816191 self-assigned this Feb 23, 2026

coderabbitai Bot reviewed Feb 23, 2026

View reviewed changes

Comment thread pinecone_rag/preprocessor/bugzila_preprocessor.py Outdated

Comment thread pinecone_rag/preprocessor/bugzila_preprocessor.py

CppDigest#85-update preprocessor

a090a80

henry0816191 requested review from snowfox1003 February 24, 2026 17:36

CppDigest#85-add docstring

b3fd1d4

coderabbitai Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread pinecone_rag/preprocessor/bugzila_preprocessor.py

jonathanMLDev mentioned this pull request Feb 26, 2026

Preprocessor for scraped LLVM Bugzilla JSON → documents for Pinecone #85

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#85-add bugzila preprocessor for clang#89

#85-add bugzila preprocessor for clang#89
henry0816191 wants to merge 3 commits intoCppDigest:mainfrom
henry0816191:dev-85

henry0816191 commented Feb 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

henry0816191 commented Feb 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

henry0816191 commented Feb 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 23, 2026 •

edited

Loading