#87-add llvm github issue preprocessor by jonathanMLDev · Pull Request #94 · CppDigest/cppa-brain-backend

jonathanMLDev · 2026-02-24T19:24:38Z

Summary by CodeRabbit

New Features
- Added GitHub issue preprocessing that loads issue JSON, extracts metadata (author, title, number, URL, timestamps, state, labels, type), builds consolidated content from labels/body/comments, supports an optional limit, enforces configurable minimum content length, validates/skips malformed or incomplete entries, and logs granular debug information.

coderabbitai · 2026-02-24T19:24:53Z

Warning

Rate limit exceeded

@jonathanMLDev has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 2 minutes and 37 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 204397e and c1d4b90.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/git_issue_preprocessor.py

📝 Walkthrough

Walkthrough

Adds a new GitIssuePreprocessor that loads GitHub issue JSON files from data/github/**/issue/*.json, filters LLVM-related issues, extracts metadata and comments, builds content, enforces minimum content length, and returns LangChain Document instances.

Changes

Cohort / File(s)	Summary
New GitIssuePreprocessor `pinecone_rag/preprocessor/git_issue_preprocessor.py`	Introduces `GitIssuePreprocessor` with `__init__(config: Optional[GitConfig])` and `load_documents(limit: Optional[int]) -> List[Document]`. Scans `data/github/*/issue/.json`, parses issues and comments, extracts metadata (author, title, number, url, timestamps, state, state_reason, type, labels), assembles page content, validates minimum content length, skips invalid/malformed entries, and emits Documents. Includes granular debug logging and helper functions for nested key access and label/content assembly.

Sequence Diagram

sequenceDiagram
    actor User
    participant GitIssuePreprocessor as GitIssuePreprocessor
    participant FileSystem as FileSystem
    participant JSONParser as JSONParser
    participant LangChainDoc as LangChainDoc

    User->>GitIssuePreprocessor: load_documents(limit)
    GitIssuePreprocessor->>FileSystem: scan data/github/**/issue/*.json
    FileSystem-->>GitIssuePreprocessor: list of files
    loop per file
        GitIssuePreprocessor->>FileSystem: read file
        FileSystem-->>GitIssuePreprocessor: raw JSON
        GitIssuePreprocessor->>JSONParser: parse issue and comments
        JSONParser-->>GitIssuePreprocessor: structured data
        GitIssuePreprocessor->>GitIssuePreprocessor: extract metadata & build content
        alt content >= min_length
            GitIssuePreprocessor->>LangChainDoc: create Document(page_content, metadata)
            LangChainDoc-->>GitIssuePreprocessor: Document
        else content too short or invalid
            GitIssuePreprocessor-->>GitIssuePreprocessor: log and skip
        end
    end
    GitIssuePreprocessor-->>User: return List[Document]

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble through JSON files by night,

extracting labels, comments, and light.
Pages stitched from issues and thread,
Documents hop where data led.
Hooray—LLVM stories now softly read!

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a new LLVM GitHub issue preprocessor class with appropriate functionality.
Docstring Coverage	✅ Passed	Docstring coverage is 88.89% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (1)

pinecone_rag/preprocessor/git_issue_preprocessor.py (1)

88-99: Redundant or fallbacks in metadata dictionary

number is already initialised to -1 via info.get("number", -1) (line 81), and created_at / updated_at / closed_at are already 0.0 by default from get_timestamp_from_date (its default parameter is 0.0). The or -1 / or 0.0 guards in the metadata dict are therefore no-ops in normal flow and can mislead readers into thinking the values could still be falsy at this point.

♻️ Proposed cleanup

     meta: Dict[str, Any] = {
-        "author": author or "",
-        "title": title or "",
-        "number": number or -1,
-        "url": html_url or "",
-        "created_at": created_at or 0.0,
-        "updated_at": updated_at or 0.0,
-        "closed_at": closed_at or 0.0,
+        "author": author,
+        "title": title,
+        "number": number,
+        "url": html_url,
+        "created_at": created_at,
+        "updated_at": updated_at,
+        "closed_at": closed_at,
         "type": "issue",
-        "state": state or "",
-        "state_reason": state_reason or "",
+        "state": state,
+        "state_reason": state_reason,
     }

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/git_issue_preprocessor.py` around lines 88 - 99,
The meta dict in git_issue_preprocessor.py contains redundant `or` fallbacks for
fields already given defaults (`number`, `created_at`, `updated_at`,
`closed_at`); update the construction of the `meta` dict in the function that
defines `meta` (the dict literal using keys "number", "created_at",
"updated_at", "closed_at") to remove the `or -1` and `or 0.0` fallbacks and rely
on the earlier defaults returned by `info.get("number", -1)` and
`get_timestamp_from_date(..., default=0.0)`, leaving other fallbacks (e.g.,
`author or ""`, `title or ""`) intact.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 54-59: The function _info_to_document currently lacks the doc_type
parameter but callers (_pr_json_to_document) pass a fifth argument "github-pr",
causing a TypeError; add a new parameter doc_type: str (accepting "github-issue"
or "github-pr") to _info_to_document's signature, use that value instead of the
hardcoded "type": "issue" in the returned Document payload, and update
call-sites (notably _pr_json_to_document and _issue_json_to_document) to pass
the appropriate doc_type string ("github-pr" and "github-issue" respectively) so
the Document.type reflects the incoming doc_type.
- Around line 44-47: In _build_content_parts remove the redundant check by
deleting the "or com_body == \"\"" portion in the loop that assigns com_body
(the for com in comments loop); keep the existing defensive fallback
(com.get("body") or "").strip() and the subsequent if not com_body: continue so
empty strings and None are already handled by not com_body and no additional
equality check is needed.
- Around line 139-146: Remove the dead variable assignment to path_str or
actually use it: either delete the two lines that create and lowercase path_str
(the unused path_str variable in the try block) to avoid dead code, or if you
intended path-based detection, use path_str in the subsequent logic that checks
JSON keys; update the try block around json_path.read_text/json.loads in the
function (the code that logs on exception via logger.debug and returns None) so
it no longer creates an unused path_str variable and keep behavior otherwise
identical.
- Around line 74-76: The code sets closed_at using
get_timestamp_from_date(info.get("closed_at", info.get("pushed_at", ""))) which
wrongly falls back to the PR's pushed_at (last push) and can mark open PRs as
closed; change the fallback to an empty string (or a proper close-related field
like "merged_at" if you intend to treat merges as closes) so use
info.get("closed_at", "") (or info.get("closed_at", info.get("merged_at", "")))
when calling get_timestamp_from_date to avoid using pushed_at; update the
expression around closed_at and get_timestamp_from_date accordingly.
- Around line 65-67: The current fallback uses info.get("url") which is the API
endpoint, not a browser link; change the logic around the html_url variable so
we do not store API endpoints as browser URLs: require html_url =
info.get("html_url") and if it's missing log/warn and return None (or otherwise
skip the record) instead of falling back to info.get("url"); alternatively, if
you must recover, translate the API endpoint to a browser URL by converting the
API prefix (e.g., replace "https://api.github.com/repos/" with
"https://github.com/" and remove the "api" segments) before returning—update
references around html_url and the code that reads info.get("url") in
git_issue_preprocessor.py accordingly.

---

Nitpick comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 88-99: The meta dict in git_issue_preprocessor.py contains
redundant `or` fallbacks for fields already given defaults (`number`,
`created_at`, `updated_at`, `closed_at`); update the construction of the `meta`
dict in the function that defines `meta` (the dict literal using keys "number",
"created_at", "updated_at", "closed_at") to remove the `or -1` and `or 0.0`
fallbacks and rely on the earlier defaults returned by `info.get("number", -1)`
and `get_timestamp_from_date(..., default=0.0)`, leaving other fallbacks (e.g.,
`author or ""`, `title or ""`) intact.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfc953 and fefa0bc.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/git_issue_preprocessor.py

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/git_issue_preprocessor.py`:
- Around line 45-57: The _extract_labels function can return None entries
because lb.get("name") may be None, which will cause ", ".join(...) to fail
later; update _extract_labels to coerce and filter label names: for each lb in
labels_raw compute name = lb.get("name") if isinstance(lb, dict) else str(lb),
then include only truthy/non-empty names (e.g., if name) in the returned list so
you never return None or empty strings.
- Around line 27-42: The _get_nested function can raise when an intermediate
becomes the default and is not a dict; update _get_nested so inside the for loop
you check if the current data is a dict before calling .get and immediately
return default if it is not (i.e., in function _get_nested, on each iteration
validate isinstance(data, dict) and if false return default; only then do data =
data.get(key, default)), preserving the existing signature and return behavior.
- Around line 85-141: The function _info_to_document currently ingests all
issues; add an LLVM-only guard after extracting html_url (and before building
content) to return None for non-LLVM repos. Specifically, check html_url (and
optionally info.get("repository_url") if present) for an LLVM indicator such as
containing "github.com/llvm" or "/llvm/" (case-insensitive) and return None when
the check fails so only LLVM issues are processed; place this check right after
html_url is computed and before content assembly in _info_to_document.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fefa0bc and 204397e.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/git_issue_preprocessor.py

CppDigest#87-add llvm github issue preprocessor

fefa0bc

jonathanMLDev self-assigned this Feb 24, 2026

jonathanMLDev mentioned this pull request Feb 24, 2026

Preprocessor for scraped llvm/llvm-project issues & comments → documents for Pinecone #87

Open

3 tasks

coderabbitai Bot reviewed Feb 24, 2026

View reviewed changes

CppDigest#87-fix issue preprocessor for first review

204397e

coderabbitai Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py

Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py Outdated

Comment thread pinecone_rag/preprocessor/git_issue_preprocessor.py

CppDigest#87-fix some error for review bot

c1d4b90

jonathanMLDev requested a review from whisper67265 February 24, 2026 20:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#87-add llvm github issue preprocessor#94

#87-add llvm github issue preprocessor#94
jonathanMLDev wants to merge 3 commits intoCppDigest:mainfrom
jonathanMLDev:dev-87

jonathanMLDev commented Feb 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 24, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonathanMLDev commented Feb 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jonathanMLDev commented Feb 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 24, 2026 •

edited

Loading