Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

#87-add llvm github issue preprocessor#94

Open
jonathanMLDev wants to merge 3 commits intoCppDigest:mainfrom
jonathanMLDev:dev-87
Open

#87-add llvm github issue preprocessor#94
jonathanMLDev wants to merge 3 commits intoCppDigest:mainfrom
jonathanMLDev:dev-87

Conversation

@jonathanMLDev
Copy link
Copy Markdown
Collaborator

@jonathanMLDev jonathanMLDev commented Feb 24, 2026

Summary by CodeRabbit

  • New Features
    • Added GitHub issue preprocessing that loads issue JSON, extracts metadata (author, title, number, URL, timestamps, state, labels, type), builds consolidated content from labels/body/comments, supports an optional limit, enforces configurable minimum content length, validates/skips malformed or incomplete entries, and logs granular debug information.

@jonathanMLDev jonathanMLDev self-assigned this Feb 24, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 24, 2026

Warning

Rate limit exceeded

@jonathanMLDev has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 37 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 204397e and c1d4b90.

📒 Files selected for processing (1)
  • pinecone_rag/preprocessor/git_issue_preprocessor.py
📝 Walkthrough

Walkthrough

Adds a new GitIssuePreprocessor that loads GitHub issue JSON files from data/github/**/issue/*.json, filters LLVM-related issues, extracts metadata and comments, builds content, enforces minimum content length, and returns LangChain Document instances.

Changes

Cohort / File(s) Summary
New GitIssuePreprocessor
pinecone_rag/preprocessor/git_issue_preprocessor.py
Introduces GitIssuePreprocessor with __init__(config: Optional[GitConfig]) and load_documents(limit: Optional[int]) -> List[Document]. Scans data/github/**/issue/*.json, parses issues and comments, extracts metadata (author, title, number, url, timestamps, state, state_reason, type, labels), assembles page content, validates minimum content length, skips invalid/malformed entries, and emits Documents. Includes granular debug logging and helper functions for nested key access and label/content assembly.

Sequence Diagram

sequenceDiagram
    actor User
    participant GitIssuePreprocessor as GitIssuePreprocessor
    participant FileSystem as FileSystem
    participant JSONParser as JSONParser
    participant LangChainDoc as LangChainDoc

    User->>GitIssuePreprocessor: load_documents(limit)
    GitIssuePreprocessor->>FileSystem: scan data/github/**/issue/*.json
    FileSystem-->>GitIssuePreprocessor: list of files
    loop per file
        GitIssuePreprocessor->>FileSystem: read file
        FileSystem-->>GitIssuePreprocessor: raw JSON
        GitIssuePreprocessor->>JSONParser: parse issue and comments
        JSONParser-->>GitIssuePreprocessor: structured data
        GitIssuePreprocessor->>GitIssuePreprocessor: extract metadata & build content
        alt content >= min_length
            GitIssuePreprocessor->>LangChainDoc: create Document(page_content, metadata)
            LangChainDoc-->>GitIssuePreprocessor: Document
        else content too short or invalid
            GitIssuePreprocessor-->>GitIssuePreprocessor: log and skip
        end
    end
    GitIssuePreprocessor-->>User: return List[Document]
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble through JSON files by night,

extracting labels, comments, and light.
Pages stitched from issues and thread,
Documents hop where data led.
Hooray—LLVM stories now softly read!

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a new LLVM GitHub issue preprocessor class with appropriate functionality.
Docstring Coverage ✅ Passed Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (1)
pinecone_rag/preprocessor/git_issue_preprocessor.py (1)

88-99: Redundant or fallbacks in metadata dictionary

number is already initialised to -1 via info.get("number", -1) (line 81), and created_at / updated_at / closed_at are already 0.0 by default from get_timestamp_from_date (its default parameter is 0.0). The or -1 / or 0.0 guards in the metadata dict are therefore no-ops in normal flow and can mislead readers into thinking the values could still be falsy at this point.

♻️ Proposed cleanup
     meta: Dict[str, Any] = {
-        "author": author or "",
-        "title": title or "",
-        "number": number or -1,
-        "url": html_url or "",
-        "created_at": created_at or 0.0,
-        "updated_at": updated_at or 0.0,
-        "closed_at": closed_at or 0.0,
+        "author": author,
+        "title": title,
+        "number": number,
+        "url": html_url,
+        "created_at": created_at,
+        "updated_at": updated_at,
+        "closed_at": closed_at,
         "type": "issue",
-        "state": state or "",
-        "state_reason": state_reason or "",
+        "state": state,
+        "state_reason": state_reason,
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/git_issue_preprocessor.py` around lines 88 - 99,
The meta dict in git_issue_preprocessor.py contains redundant `or` fallbacks for
fields already given defaults (`number`, `created_at`, `updated_at`,
`closed_at`); update the construction of the `meta` dict in the function that
defines `meta` (the dict literal using keys "number", "created_at",
"updated_at", "closed_at") to remove the `or -1` and `or 0.0` fallbacks and rely
on the earlier defaults returned by `info.get("number", -1)` and
`get_timestamp_from_date(..., default=0.0)`, leaving other fallbacks (e.g.,
`author or ""`, `title or ""`) intact.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 54-59: The function _info_to_document currently lacks the doc_type
parameter but callers (_pr_json_to_document) pass a fifth argument "github-pr",
causing a TypeError; add a new parameter doc_type: str (accepting "github-issue"
or "github-pr") to _info_to_document's signature, use that value instead of the
hardcoded "type": "issue" in the returned Document payload, and update
call-sites (notably _pr_json_to_document and _issue_json_to_document) to pass
the appropriate doc_type string ("github-pr" and "github-issue" respectively) so
the Document.type reflects the incoming doc_type.
- Around line 44-47: In _build_content_parts remove the redundant check by
deleting the "or com_body == \"\"" portion in the loop that assigns com_body
(the for com in comments loop); keep the existing defensive fallback
(com.get("body") or "").strip() and the subsequent if not com_body: continue so
empty strings and None are already handled by not com_body and no additional
equality check is needed.
- Around line 139-146: Remove the dead variable assignment to path_str or
actually use it: either delete the two lines that create and lowercase path_str
(the unused path_str variable in the try block) to avoid dead code, or if you
intended path-based detection, use path_str in the subsequent logic that checks
JSON keys; update the try block around json_path.read_text/json.loads in the
function (the code that logs on exception via logger.debug and returns None) so
it no longer creates an unused path_str variable and keep behavior otherwise
identical.
- Around line 74-76: The code sets closed_at using
get_timestamp_from_date(info.get("closed_at", info.get("pushed_at", ""))) which
wrongly falls back to the PR's pushed_at (last push) and can mark open PRs as
closed; change the fallback to an empty string (or a proper close-related field
like "merged_at" if you intend to treat merges as closes) so use
info.get("closed_at", "") (or info.get("closed_at", info.get("merged_at", "")))
when calling get_timestamp_from_date to avoid using pushed_at; update the
expression around closed_at and get_timestamp_from_date accordingly.
- Around line 65-67: The current fallback uses info.get("url") which is the API
endpoint, not a browser link; change the logic around the html_url variable so
we do not store API endpoints as browser URLs: require html_url =
info.get("html_url") and if it's missing log/warn and return None (or otherwise
skip the record) instead of falling back to info.get("url"); alternatively, if
you must recover, translate the API endpoint to a browser URL by converting the
API prefix (e.g., replace "https://api.github.com/repos/" with
"https://github.com/" and remove the "api" segments) before returning—update
references around html_url and the code that reads info.get("url") in
git_issue_preprocessor.py accordingly.

---

Nitpick comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 88-99: The meta dict in git_issue_preprocessor.py contains
redundant `or` fallbacks for fields already given defaults (`number`,
`created_at`, `updated_at`, `closed_at`); update the construction of the `meta`
dict in the function that defines `meta` (the dict literal using keys "number",
"created_at", "updated_at", "closed_at") to remove the `or -1` and `or 0.0`
fallbacks and rely on the earlier defaults returned by `info.get("number", -1)`
and `get_timestamp_from_date(..., default=0.0)`, leaving other fallbacks (e.g.,
`author or ""`, `title or ""`) intact.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfc953 and fefa0bc.

📒 Files selected for processing (1)
  • pinecone_rag/preprocessor/git_issue_preprocessor.py

Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py
Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py
Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py Outdated
Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py Outdated
Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 45-57: The _extract_labels function can return None entries
because lb.get("name") may be None, which will cause ", ".join(...) to fail
later; update _extract_labels to coerce and filter label names: for each lb in
labels_raw compute name = lb.get("name") if isinstance(lb, dict) else str(lb),
then include only truthy/non-empty names (e.g., if name) in the returned list so
you never return None or empty strings.
- Around line 27-42: The _get_nested function can raise when an intermediate
becomes the default and is not a dict; update _get_nested so inside the for loop
you check if the current data is a dict before calling .get and immediately
return default if it is not (i.e., in function _get_nested, on each iteration
validate isinstance(data, dict) and if false return default; only then do data =
data.get(key, default)), preserving the existing signature and return behavior.
- Around line 85-141: The function _info_to_document currently ingests all
issues; add an LLVM-only guard after extracting html_url (and before building
content) to return None for non-LLVM repos. Specifically, check html_url (and
optionally info.get("repository_url") if present) for an LLVM indicator such as
containing "github.com/llvm" or "/llvm/" (case-insensitive) and return None when
the check fails so only LLVM issues are processed; place this check right after
html_url is computed and before content assembly in _info_to_document.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fefa0bc and 204397e.

📒 Files selected for processing (1)
  • pinecone_rag/preprocessor/git_issue_preprocessor.py

Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py
Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py Outdated
Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant