#88-add llvm git pr preprocessor by jonathanMLDev · Pull Request #93 · CppDigest/cppa-brain-backend

jonathanMLDev · 2026-02-24T19:23:49Z

Summary by CodeRabbit

Release Notes

New Features
- Added GitHub pull request ingestion capability to automatically load PR data from JSON files and convert them into indexed documents with extracted metadata including title, author, timestamps, and status.

coderabbitai · 2026-02-24T19:24:04Z

Warning

Rate limit exceeded

@jonathanMLDev has exceeded the limit for the number of commits that can be reviewed per hour. Please wait 7 minutes and 41 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between 5079edf and 2b99fd1.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/git_pr_preprocessor.py

📝 Walkthrough

Walkthrough

Introduces GitPrPreprocessor, a new Python module that loads GitHub PR JSON files from the filesystem, converts them to markdown content via existing utilities, validates content length, constructs metadata with timestamps, and returns a list of Document objects.

Changes

Cohort / File(s)	Summary
GitHub PR Preprocessor `pinecone_rag/preprocessor/git_pr_preprocessor.py`	New module implementing GitPrPreprocessor class for loading PR JSON files from `data_dir/pr`. Includes `load_documents()` method with optional limit parameter, internal helper `_get_nested()` for safe dict access, and `_load_pr_document()` for single-file processing with validation and metadata enrichment. Integrates with existing `convert_pr_to_markdown()` utility and timestamp normalization.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 Hop, hop! New PR loaders appear,
JSON files sorted with care and cheer,
Converting PRs to markdown delight,
Documents crafted, metadata bright!
The warren grows richer, pull by pull—
Our data pipelines now twice as full! ✨

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Title check	⚠️ Warning	The title mentions 'llvm git pr preprocessor' but the actual implementation is a 'GitPrPreprocessor' for loading GitHub PR JSON files generically, not specifically for LLVM.	Revise the title to accurately reflect the generic nature of the implementation, such as 'Add GitPrPreprocessor for loading GitHub PR documents' or similar, removing the LLVM-specific reference.
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🧹 Nitpick comments (2)

pinecone_rag/preprocessor/git_pr_preprocessor.py (2)

49-55: Redundant or [] before isinstance guard.

data.get("comments") or [] already returns [] for falsy values (None, "", 0, []). The subsequent isinstance(comments, list) check then re-assigns to [] for any truthy non-list value. The or [] is unnecessary — the isinstance guard alone is sufficient.

♻️ Proposed simplification

-    comments = data.get("comments") or []
-    if not isinstance(comments, list):
-        comments = []
-
-    reviews = data.get("reviews") or []
-    if not isinstance(reviews, list):
-        reviews = []
+    raw_comments = data.get("comments")
+    comments = raw_comments if isinstance(raw_comments, list) else []
+
+    raw_reviews = data.get("reviews")
+    reviews = raw_reviews if isinstance(raw_reviews, list) else []

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/git_pr_preprocessor.py` around lines 49 - 55, The
code redundantly uses `or []` before an `isinstance` guard for `comments` and
`reviews`; change each block to fetch the raw value (e.g., comments =
data.get("comments")) and then only use the type check to normalize non-list
values (if not isinstance(comments, list): comments = []), and do the same for
`reviews`, removing the unnecessary `or []` parts.

100-108: limit caps files scanned, not documents returned.

The limit is sliced on json_paths (line 102) before loading, so when files are skipped due to short content or missing pr_info, the caller receives fewer than limit documents with no indication. If the intent is to guarantee at most limit documents out, the limit should be applied to the accumulated documents list instead (or both behaviors documented clearly).

♻️ Proposed fix to apply limit to output documents

         json_paths = sorted(self.data_dir.rglob("*.json"))
-        if limit is not None:
-            json_paths = json_paths[:limit]
 
         documents: List[Document] = []
         for json_path in json_paths:
             doc = _load_pr_document(json_path, self.min_content_length)
             if doc is not None:
                 documents.append(doc)
+                if limit is not None and len(documents) >= limit:
+                    break

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/git_pr_preprocessor.py` around lines 100 - 108, The
current code slices json_paths before loading, so files skipped by
_load_pr_document reduce the returned documents count; change the loop to
collect documents until we reach limit (if provided) instead of slicing
json_paths: iterate over the full sorted json_paths, call
_load_pr_document(json_path, self.min_content_length), append non-None results
to documents, and break once len(documents) == limit; keep limit None behavior
as "no cap". Ensure you reference json_paths, _load_pr_document, documents, and
limit when making the change.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/git_pr_preprocessor.py`:
- Around line 38-47: The code in _load_pr_document assumes json.loads returns a
dict and calls data.get("pr_info"), which raises AttributeError for JSON roots
that are lists/strings; modify _load_pr_document to validate that `data` is a
dict before accessing keys (e.g., check isinstance(data, dict) after
json.loads), log and return None for non-dict JSON roots (use logger.debug("Skip
%s: non-object JSON root", json_path.name) or include exc/context), and keep the
existing exception handling around json.loads/json_path.read_text to avoid
breaking load_documents when non-object JSON files are encountered.
- Around line 1-10: The module docstring's glob pattern is inconsistent with the
implementation: update the top-level docstring string (module header) to use the
same "pr/*.json" pattern used in GitPrPreprocessor.__init__ and the
load_documents docstring; specifically change `data/github/**/prs/*.json` to
`data/github/**/pr/*.json` so the documentation matches the actual glob used by
GitPrPreprocessor.__init__ and load_documents.
- Around line 78-80: The metadata currently sets "closed_at" by always calling
get_timestamp_from_date(pr_info.get("closed_at")), which returns 0.0 for None
and makes open PRs indistinguishable from the 1970 epoch; change the logic
around the "closed_at" key so you first check pr_info.get("closed_at") and only
call get_timestamp_from_date when it is not None (or set "closed_at" to
None/omit the key for open PRs) so open PRs don't get a 0.0 timestamp; update
the code that builds the PR metadata (reference: get_timestamp_from_date and the
"closed_at" entry using pr_info.get("closed_at")) accordingly.

---

Nitpick comments:
In `@pinecone_rag/preprocessor/git_pr_preprocessor.py`:
- Around line 49-55: The code redundantly uses `or []` before an `isinstance`
guard for `comments` and `reviews`; change each block to fetch the raw value
(e.g., comments = data.get("comments")) and then only use the type check to
normalize non-list values (if not isinstance(comments, list): comments = []),
and do the same for `reviews`, removing the unnecessary `or []` parts.
- Around line 100-108: The current code slices json_paths before loading, so
files skipped by _load_pr_document reduce the returned documents count; change
the loop to collect documents until we reach limit (if provided) instead of
slicing json_paths: iterate over the full sorted json_paths, call
_load_pr_document(json_path, self.min_content_length), append non-None results
to documents, and break once len(documents) == limit; keep limit None behavior
as "no cap". Ensure you reference json_paths, _load_pr_document, documents, and
limit when making the change.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfc953 and 5079edf.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/git_pr_preprocessor.py

CppDigest#88-add llvm git pr preprocessor

5079edf

jonathanMLDev self-assigned this Feb 24, 2026

jonathanMLDev mentioned this pull request Feb 24, 2026

Preprocessor for scraped llvm/llvm-project PRs & comments → documents for Pinecone #88

Open

3 tasks

coderabbitai Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread pinecone_rag/preprocessor/git_pr_preprocessor.py

Comment thread pinecone_rag/preprocessor/git_pr_preprocessor.py

Comment thread pinecone_rag/preprocessor/git_pr_preprocessor.py

CppDigest#88-update for coderabbitai

2b99fd1

jonathanMLDev requested a review from AuraMindNest February 24, 2026 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#88-add llvm git pr preprocessor#93

#88-add llvm git pr preprocessor#93
jonathanMLDev wants to merge 2 commits intoCppDigest:mainfrom
jonathanMLDev:dev-88

jonathanMLDev commented Feb 24, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 24, 2026 •

edited

Loading

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jonathanMLDev commented Feb 24, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented Feb 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rate limit exceeded

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

jonathanMLDev commented Feb 24, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 24, 2026 •

edited

Loading