#85-add bugzila preprocessor for clang#89
#85-add bugzila preprocessor for clang#89henry0816191 wants to merge 3 commits intoCppDigest:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a new BugIssuePreprocessor that recursively reads Bugzilla-like JSON files, assembles validated human-readable content with comments, extracts structured metadata (including timestamps), and returns a list of Document objects for each valid bug. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~15 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 12-29: The _to_timestamp function should parse the ISO UTC string
into a timezone-aware datetime and only handle parsing errors specifically;
update imports to include timezone from datetime, then in _to_timestamp after
datetime.strptime(value, "%Y-%m-%dT%H:%M:%SZ") attach tzinfo=timezone.utc (e.g.,
dt = dt.replace(tzinfo=timezone.utc)) before returning dt.timestamp(), and
change the broad except Exception to except ValueError to avoid masking other
errors; ensure the function still returns 0.0 on parse failure.
- Around line 87-92: The code falls back to -1 for bug_id and builds a bogus
bug_url; update the bugzila_preprocessor.py logic where bug_id and bug_url are
set so that you validate the id (e.g., ensure bug.get("id") or data.get("id") is
present and a positive integer) and skip/omit processing this record when the id
is missing or invalid instead of assigning -1; specifically change the
assignment/flow around the bug_id and bug_url variables so invalid ids cause the
function to return/continue (no metadata added and no URL constructed).
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pinecone_rag/preprocessor/bugzila_preprocessor.py (1)
117-127: Applylimitafter filtering to honor requested document count.Right now
limitslices the file list before skipping hidden/invalid entries, which can return fewer documents than requested. Consider enforcing the limit on the resulting documents.♻️ Suggested adjustment
- json_paths = sorted(self.data_dir.rglob("*.json")) - if limit is not None: - json_paths = json_paths[:limit] + json_paths = sorted(self.data_dir.rglob("*.json")) @@ - for json_path in json_paths: + for json_path in json_paths: if json_path.name.startswith("."): continue doc = _load_bug_document(json_path, self.min_content_length) if doc is not None: documents.append(doc) + if limit is not None and len(documents) >= limit: + break🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pinecone_rag/preprocessor/bugzila_preprocessor.py` around lines 117 - 127, The current code slices json_paths before filtering hidden/invalid files which can yield fewer results than requested; change the logic in the loop that builds documents (using json_paths, _load_bug_document, documents, min_content_length) so the limit is applied after filtering — either remove the early slice and after appending check if len(documents) == limit and break, or build the full filtered list first then take documents[:limit]; ensure skipping json_path.name.startswith(".") and None returns from _load_bug_document still respect the requested limit.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 1-8: The module docstring path is inconsistent with the
constructor default: the docstring says "data/bugs/**" but the constructor uses
"data/github/Clang/bugs"; update the top-of-file docstring in
pinecone_rag/preprocessor/bugzila_preprocessor.py to reference the same default
path string ("data/github/Clang/bugs") to match the constructor (or
alternatively change the constructor default to "data/bugs/**" if that is
preferred) so both the docstring and the constructor default are consistent.
---
Nitpick comments:
In `@pinecone_rag/preprocessor/bugzila_preprocessor.py`:
- Around line 117-127: The current code slices json_paths before filtering
hidden/invalid files which can yield fewer results than requested; change the
logic in the loop that builds documents (using json_paths, _load_bug_document,
documents, min_content_length) so the limit is applied after filtering — either
remove the early slice and after appending check if len(documents) == limit and
break, or build the full filtered list first then take documents[:limit]; ensure
skipping json_path.name.startswith(".") and None returns from _load_bug_document
still respect the requested limit.
Summary by CodeRabbit