Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

#85-add bugzila preprocessor for clang#89

Open
henry0816191 wants to merge 3 commits intoCppDigest:mainfrom
henry0816191:dev-85
Open

#85-add bugzila preprocessor for clang#89
henry0816191 wants to merge 3 commits intoCppDigest:mainfrom
henry0816191:dev-85

Conversation

@henry0816191
Copy link
Copy Markdown
Collaborator

@henry0816191 henry0816191 commented Feb 23, 2026

Summary by CodeRabbit

  • New Features
    • Added a Bugzilla preprocessor that reads JSON bug data from a configurable directory, builds human-readable bug documents (product, component, status, resolution, severity, priority, keywords, and concatenated comments), extracts rich metadata (bug number, title, author, timestamps, URL), skips malformed or too-short entries, and logs processing results and counts.

@henry0816191 henry0816191 added the enhancement New feature or request label Feb 23, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

Adds a new BugIssuePreprocessor that recursively reads Bugzilla-like JSON files, assembles validated human-readable content with comments, extracts structured metadata (including timestamps), and returns a list of Document objects for each valid bug.

Changes

Cohort / File(s) Summary
New Bugzilla Preprocessor
pinecone_rag/preprocessor/bugzila_preprocessor.py
Introduces BugIssuePreprocessor with recursive directory scan for *.json, JSON parsing, comment concatenation, content length validation, metadata extraction (including timestamp conversion), Document creation, optional result limiting, and logging for errors, missing data, and counts.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~15 minutes

Poem

🐇 I nibble through JSON, line by line and bright,
Gathering bugs and comments by soft moonlight.
I stitch fields to stories, timestamps in tow,
Hopping them into Documents ready to go. 🍃✨

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately reflects the main change: adding a Bugzilla preprocessor module for handling Clang bug data in the Pinecone RAG system.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@henry0816191 henry0816191 self-assigned this Feb 23, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 12-29: The _to_timestamp function should parse the ISO UTC string
into a timezone-aware datetime and only handle parsing errors specifically;
update imports to include timezone from datetime, then in _to_timestamp after
datetime.strptime(value, "%Y-%m-%dT%H:%M:%SZ") attach tzinfo=timezone.utc (e.g.,
dt = dt.replace(tzinfo=timezone.utc)) before returning dt.timestamp(), and
change the broad except Exception to except ValueError to avoid masking other
errors; ensure the function still returns 0.0 on parse failure.
- Around line 87-92: The code falls back to -1 for bug_id and builds a bogus
bug_url; update the bugzila_preprocessor.py logic where bug_id and bug_url are
set so that you validate the id (e.g., ensure bug.get("id") or data.get("id") is
present and a positive integer) and skip/omit processing this record when the id
is missing or invalid instead of assigning -1; specifically change the
assignment/flow around the bug_id and bug_url variables so invalid ids cause the
function to return/continue (no metadata added and no URL constructed).

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfc953 and 128908a.

📒 Files selected for processing (2)
  • pinecone_rag/preprocessor/bugzila_preprocessor.py
  • requirements.txt

Comment thread pinecone_rag/preprocessor/bugzila_preprocessor.py Outdated
Comment thread pinecone_rag/preprocessor/bugzila_preprocessor.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
pinecone_rag/preprocessor/bugzila_preprocessor.py (1)

117-127: Apply limit after filtering to honor requested document count.

Right now limit slices the file list before skipping hidden/invalid entries, which can return fewer documents than requested. Consider enforcing the limit on the resulting documents.

♻️ Suggested adjustment
-        json_paths = sorted(self.data_dir.rglob("*.json"))
-        if limit is not None:
-            json_paths = json_paths[:limit]
+        json_paths = sorted(self.data_dir.rglob("*.json"))
@@
-        for json_path in json_paths:
+        for json_path in json_paths:
             if json_path.name.startswith("."):
                 continue
             doc = _load_bug_document(json_path, self.min_content_length)
             if doc is not None:
                 documents.append(doc)
+                if limit is not None and len(documents) >= limit:
+                    break
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/bugzila_preprocessor.py` around lines 117 - 127,
The current code slices json_paths before filtering hidden/invalid files which
can yield fewer results than requested; change the logic in the loop that builds
documents (using json_paths, _load_bug_document, documents, min_content_length)
so the limit is applied after filtering — either remove the early slice and
after appending check if len(documents) == limit and break, or build the full
filtered list first then take documents[:limit]; ensure skipping
json_path.name.startswith(".") and None returns from _load_bug_document still
respect the requested limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 1-8: The module docstring path is inconsistent with the
constructor default: the docstring says "data/bugs/**" but the constructor uses
"data/github/Clang/bugs"; update the top-of-file docstring in
pinecone_rag/preprocessor/bugzila_preprocessor.py to reference the same default
path string ("data/github/Clang/bugs") to match the constructor (or
alternatively change the constructor default to "data/bugs/**" if that is
preferred) so both the docstring and the constructor default are consistent.

---

Nitpick comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 117-127: The current code slices json_paths before filtering
hidden/invalid files which can yield fewer results than requested; change the
logic in the loop that builds documents (using json_paths, _load_bug_document,
documents, min_content_length) so the limit is applied after filtering — either
remove the early slice and after appending check if len(documents) == limit and
break, or build the full filtered list first then take documents[:limit]; ensure
skipping json_path.name.startswith(".") and None returns from _load_bug_document
still respect the requested limit.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a090a80 and b3fd1d4.

📒 Files selected for processing (1)
  • pinecone_rag/preprocessor/bugzila_preprocessor.py

Comment thread pinecone_rag/preprocessor/bugzila_preprocessor.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant