#87-add llvm github issue preprocessor#94
#87-add llvm github issue preprocessor#94jonathanMLDev wants to merge 3 commits intoCppDigest:mainfrom
Conversation
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
📝 WalkthroughWalkthroughAdds a new GitIssuePreprocessor that loads GitHub issue JSON files from data/github/**/issue/*.json, filters LLVM-related issues, extracts metadata and comments, builds content, enforces minimum content length, and returns LangChain Document instances. Changes
Sequence DiagramsequenceDiagram
actor User
participant GitIssuePreprocessor as GitIssuePreprocessor
participant FileSystem as FileSystem
participant JSONParser as JSONParser
participant LangChainDoc as LangChainDoc
User->>GitIssuePreprocessor: load_documents(limit)
GitIssuePreprocessor->>FileSystem: scan data/github/**/issue/*.json
FileSystem-->>GitIssuePreprocessor: list of files
loop per file
GitIssuePreprocessor->>FileSystem: read file
FileSystem-->>GitIssuePreprocessor: raw JSON
GitIssuePreprocessor->>JSONParser: parse issue and comments
JSONParser-->>GitIssuePreprocessor: structured data
GitIssuePreprocessor->>GitIssuePreprocessor: extract metadata & build content
alt content >= min_length
GitIssuePreprocessor->>LangChainDoc: create Document(page_content, metadata)
LangChainDoc-->>GitIssuePreprocessor: Document
else content too short or invalid
GitIssuePreprocessor-->>GitIssuePreprocessor: log and skip
end
end
GitIssuePreprocessor-->>User: return List[Document]
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
pinecone_rag/preprocessor/git_issue_preprocessor.py (1)
88-99: Redundantorfallbacks in metadata dictionary
numberis already initialised to-1viainfo.get("number", -1)(line 81), andcreated_at / updated_at / closed_atare already0.0by default fromget_timestamp_from_date(itsdefaultparameter is0.0). Theor -1/or 0.0guards in the metadata dict are therefore no-ops in normal flow and can mislead readers into thinking the values could still be falsy at this point.♻️ Proposed cleanup
meta: Dict[str, Any] = { - "author": author or "", - "title": title or "", - "number": number or -1, - "url": html_url or "", - "created_at": created_at or 0.0, - "updated_at": updated_at or 0.0, - "closed_at": closed_at or 0.0, + "author": author, + "title": title, + "number": number, + "url": html_url, + "created_at": created_at, + "updated_at": updated_at, + "closed_at": closed_at, "type": "issue", - "state": state or "", - "state_reason": state_reason or "", + "state": state, + "state_reason": state_reason, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pinecone_rag/preprocessor/git_issue_preprocessor.py` around lines 88 - 99, The meta dict in git_issue_preprocessor.py contains redundant `or` fallbacks for fields already given defaults (`number`, `created_at`, `updated_at`, `closed_at`); update the construction of the `meta` dict in the function that defines `meta` (the dict literal using keys "number", "created_at", "updated_at", "closed_at") to remove the `or -1` and `or 0.0` fallbacks and rely on the earlier defaults returned by `info.get("number", -1)` and `get_timestamp_from_date(..., default=0.0)`, leaving other fallbacks (e.g., `author or ""`, `title or ""`) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 54-59: The function _info_to_document currently lacks the doc_type
parameter but callers (_pr_json_to_document) pass a fifth argument "github-pr",
causing a TypeError; add a new parameter doc_type: str (accepting "github-issue"
or "github-pr") to _info_to_document's signature, use that value instead of the
hardcoded "type": "issue" in the returned Document payload, and update
call-sites (notably _pr_json_to_document and _issue_json_to_document) to pass
the appropriate doc_type string ("github-pr" and "github-issue" respectively) so
the Document.type reflects the incoming doc_type.
- Around line 44-47: In _build_content_parts remove the redundant check by
deleting the "or com_body == \"\"" portion in the loop that assigns com_body
(the for com in comments loop); keep the existing defensive fallback
(com.get("body") or "").strip() and the subsequent if not com_body: continue so
empty strings and None are already handled by not com_body and no additional
equality check is needed.
- Around line 139-146: Remove the dead variable assignment to path_str or
actually use it: either delete the two lines that create and lowercase path_str
(the unused path_str variable in the try block) to avoid dead code, or if you
intended path-based detection, use path_str in the subsequent logic that checks
JSON keys; update the try block around json_path.read_text/json.loads in the
function (the code that logs on exception via logger.debug and returns None) so
it no longer creates an unused path_str variable and keep behavior otherwise
identical.
- Around line 74-76: The code sets closed_at using
get_timestamp_from_date(info.get("closed_at", info.get("pushed_at", ""))) which
wrongly falls back to the PR's pushed_at (last push) and can mark open PRs as
closed; change the fallback to an empty string (or a proper close-related field
like "merged_at" if you intend to treat merges as closes) so use
info.get("closed_at", "") (or info.get("closed_at", info.get("merged_at", "")))
when calling get_timestamp_from_date to avoid using pushed_at; update the
expression around closed_at and get_timestamp_from_date accordingly.
- Around line 65-67: The current fallback uses info.get("url") which is the API
endpoint, not a browser link; change the logic around the html_url variable so
we do not store API endpoints as browser URLs: require html_url =
info.get("html_url") and if it's missing log/warn and return None (or otherwise
skip the record) instead of falling back to info.get("url"); alternatively, if
you must recover, translate the API endpoint to a browser URL by converting the
API prefix (e.g., replace "https://api.github.com/repos/" with
"https://github.com/" and remove the "api" segments) before returning—update
references around html_url and the code that reads info.get("url") in
git_issue_preprocessor.py accordingly.
---
Nitpick comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 88-99: The meta dict in git_issue_preprocessor.py contains
redundant `or` fallbacks for fields already given defaults (`number`,
`created_at`, `updated_at`, `closed_at`); update the construction of the `meta`
dict in the function that defines `meta` (the dict literal using keys "number",
"created_at", "updated_at", "closed_at") to remove the `or -1` and `or 0.0`
fallbacks and rely on the earlier defaults returned by `info.get("number", -1)`
and `get_timestamp_from_date(..., default=0.0)`, leaving other fallbacks (e.g.,
`author or ""`, `title or ""`) intact.
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 45-57: The _extract_labels function can return None entries
because lb.get("name") may be None, which will cause ", ".join(...) to fail
later; update _extract_labels to coerce and filter label names: for each lb in
labels_raw compute name = lb.get("name") if isinstance(lb, dict) else str(lb),
then include only truthy/non-empty names (e.g., if name) in the returned list so
you never return None or empty strings.
- Around line 27-42: The _get_nested function can raise when an intermediate
becomes the default and is not a dict; update _get_nested so inside the for loop
you check if the current data is a dict before calling .get and immediately
return default if it is not (i.e., in function _get_nested, on each iteration
validate isinstance(data, dict) and if false return default; only then do data =
data.get(key, default)), preserving the existing signature and return behavior.
- Around line 85-141: The function _info_to_document currently ingests all
issues; add an LLVM-only guard after extracting html_url (and before building
content) to return None for non-LLVM repos. Specifically, check html_url (and
optionally info.get("repository_url") if present) for an LLVM indicator such as
containing "github.com/llvm" or "/llvm/" (case-insensitive) and return None when
the check fails so only LLVM issues are processed; place this check right after
html_url is computed and before content assembly in _info_to_document.
Summary by CodeRabbit