#86-add phabricator preprocessor for clang by henry0816191 · Pull Request #90 · CppDigest/cppa-brain-backend

henry0816191 · 2026-02-23T20:12:26Z

Summary by CodeRabbit

New Features
- Added a Phabricator PR preprocessor that ingests markdown-based PR data, extracts and normalizes PR metadata (title, author, state, timestamps, URL), and converts each PR into searchable documents. Handles recursive loading from the data directory, enforces minimum content, and logs progress for downstream indexing.

coderabbitai · 2026-02-23T20:12:44Z

📝 Walkthrough

Walkthrough

Adds a Phabricator PR preprocessor that loads markdown files from a data directory, parses PR headers and body to extract structured metadata (titles, numbers, authors, states, timestamps, URLs), validates content length, and returns Document objects with page_content and metadata.

Changes

Cohort / File(s)	Summary
Phabricator PR Preprocessor `pinecone_rag/preprocessor/phabricator_preprocessor.py`	New module adding `PhabricatorPrPreprocessor` and helpers. Recursively discovers `*.md` in the data dir, parses headers via regex (`_HEADER_RE`, `_USERNAME_RE`, `_CREATED_AT_RE`, `_URL_RE`), validates content length (`_is_valid_content`), converts dates to UTC timestamps (`_parse_created_at_to_timestamp`), extracts metadata (`_extract_metadata`), and loads Documents (`_load_pr_document`, `load_documents`). Logs progress and handles missing dirs. Pay attention to date-format handling, state normalization, and derived URL logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I nibble through markdown rows,
Find titles, timestamps, author glows,
I stitch each PR into a doc,
Hop, parse, and gently tuck in clock,
Tiny rabbit, big metadata shows. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and specifically describes the main change: adding a Phabricator preprocessor for Clang PR processing, which matches the core functionality introduced in the changeset.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessort.py`:
- Around line 1-10: Rename the module file from phabricator_preprocessort.py to
phabricator_preprocessor.py to fix the typo and update any imports that
reference the old name; inside the module update the module docstring to match
the actual default data directory (replace references to data/phabricator/**
with the concrete default used in the code) and change the default data_dir
string value (currently set to "data/github/Clang/phabricator") if it was
intended to be something else so the docstring and the default data_dir in the
module (where the default is defined) are consistent.
- Around line 76-86: The current logic uses the first successfully parsed
timestamp from _CREATED_AT_RE/_parse_created_at_to_timestamp to set created_at,
which can mistakenly pick a comment's date if the PR header fails to parse;
change created_at to use the earliest parsed timestamp (min(valid_timestamps))
and keep last_activity as max(valid_timestamps) so created_at reflects the PR
creation heuristic; ensure the fallback for empty valid_timestamps remains 0.0
and update any variable created_at/last_activity/updated_at assignments
accordingly.
- Around line 45-50: Change the error handling and timezone handling in the
date-parsing loop: instead of catching a bare Exception around
datetime.strptime(value, pattern).timestamp(), catch ValueError only (the
expected parse failure) and, after a successful datetime.strptime call, make the
datetime explicit UTC before calling .timestamp() (e.g., call
.replace(tzinfo=timezone.utc) or otherwise convert to UTC) so the epoch is
consistent across TZs; update the block surrounding datetime.strptime, patterns,
and the return value accordingly.
- Line 28: Remove the unused regex constant _COMMENT_RE from
phabricator_preprocessort.py by deleting its definition line (the assignment of
re.compile(r"^##\s*Comment\s+\d+", re.MULTILINE)); also check and remove the
now-unreferenced import of re if it becomes unused elsewhere in the file to
avoid unused-import warnings.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfc953 and 636c9f4.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/phabricator_preprocessort.py

coderabbitai

Actionable comments posted: 1

♻️ Duplicate comments (1)

pinecone_rag/preprocessor/phabricator_preprocessor.py (1)

1-10: ⚠️ Potential issue | 🟡 Minor

Align docstrings with the actual default data directory.

The module/class docstrings still reference data/phabricator, but the default path is data/github/Clang/phabricator. This mismatch can mislead users.

✏️ Suggested docstring update

-Reads markdown files under data/phabricator/** and builds one Document per file.
+Reads markdown files under data/github/Clang/phabricator/** and builds one Document per file.
@@
-class PhabricatorPrPreprocessor:
-    """Load Phabricator markdown files from data/phabricator and produce Documents."""
+class PhabricatorPrPreprocessor:
+    """Load Phabricator markdown files from data/github/Clang/phabricator and produce Documents."""

Also applies to: 119-131

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 1 - 10,
Update the misleading docstrings that reference the old default data directory:
replace occurrences of "data/phabricator" with the actual default
"data/github/Clang/phabricator" in the module-level docstring at the top of the
file and in the class/docstring blocks later in the file (the preprocessor class
docstring around the other header comment region). Ensure all descriptive
examples and "Expected markdown header" notes remain accurate but point to the
correct default path.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 77-86: The comment above the timestamp extraction is misleading:
update the inline comment near _CREATED_AT_RE, _parse_created_at_to_timestamp,
and the all_timestamps/valid_timestamps logic to say that we collect all
"Created at:" timestamps and use the earliest (min) as created_at and the latest
(max) as last_activity (i.e., “earliest timestamp is treated as the PR’s
creation time” instead of “first match”); leave the code using
min(valid_timestamps) and max(valid_timestamps) unchanged.

---

Duplicate comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 1-10: Update the misleading docstrings that reference the old
default data directory: replace occurrences of "data/phabricator" with the
actual default "data/github/Clang/phabricator" in the module-level docstring at
the top of the file and in the class/docstring blocks later in the file (the
preprocessor class docstring around the other header comment region). Ensure all
descriptive examples and "Expected markdown header" notes remain accurate but
point to the correct default path.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 636c9f4 and 443a7a4.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/phabricator_preprocessor.py

coderabbitai

♻️ Duplicate comments (2)

pinecone_rag/preprocessor/phabricator_preprocessor.py (2)

80-82: ⚠️ Potential issue | 🟡 Minor

Update the timestamp comment to match min/max logic.

The comment still says “first match,” but the code uses min(valid_timestamps).

✏️ Suggested comment tweak

-    # Collect all "Created at:" timestamps from PR header + all comments/reviews.
-    # The first match is the PR's own creation time; the maximum is the last activity.
+    # Collect all "Created at:" timestamps from PR header + all comments/reviews.
+    # The earliest timestamp is treated as PR creation; the maximum is last activity.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 80 - 82,
The comment above the timestamp collection is inaccurate: it says "The first
match is the PR's own creation time" but the code uses min(valid_timestamps) and
max(valid_timestamps); update the comment near the
all_timestamps/valid_timestamps logic in phabricator_preprocessor.py to state
that the minimum timestamp is treated as the PR creation time and the maximum as
the last activity (reflecting use of min(valid_timestamps) and
max(valid_timestamps)).

4-9: ⚠️ Potential issue | 🟡 Minor

Align module docstring with the actual default data directory.

The module docstring references data/phabricator/**, but the default data_dir is data/github/Clang/phabricator, which can mislead users.

✅ Suggested docstring fix

-Reads markdown files under data/phabricator/** and builds one Document per file.
+Reads markdown files under data/github/Clang/phabricator/** and builds one Document per file.

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 4 - 9,
The module docstring currently states files are under data/phabricator/** but
the code's default data_dir is data/github/Clang/phabricator; update the
top-of-file docstring in pinecone_rag/preprocessor/phabricator_preprocessor.py
to reference the actual default path (or describe both the general pattern and
the concrete default) so it matches the default data_dir used by the
preprocessor (e.g., mention data/github/Clang/phabricator) and keep the expected
markdown header description unchanged.

🧹 Nitpick comments (1)

pinecone_rag/preprocessor/phabricator_preprocessor.py (1)
37-53: Prefer the shared date parsing utility to avoid duplicate logic.

Consider using pinecone_rag.preprocessor.utility.get_timestamp_from_date here instead of custom parsing so date handling stays consistent across preprocessors. Based on learnings: “use the existing get_timestamp_from_date utility function from pinecone_rag.preprocessor.utility instead of implementing custom date-to-timestamp conversion logic.”
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 37 - 53,
The _parse_created_at_to_timestamp function duplicates date parsing logic;
replace its custom patterns and loop with a call to the shared utility
get_timestamp_from_date from pinecone_rag.preprocessor.utility: import
get_timestamp_from_date, call it with the incoming value, and return the float
timestamp from that function (ensuring you preserve the previous behavior of
returning 0.0 for empty or unparseable values); remove the manual
datetime.strptime loop and related debug logging so all date parsing is
consolidated in get_timestamp_from_date.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 80-82: The comment above the timestamp collection is inaccurate:
it says "The first match is the PR's own creation time" but the code uses
min(valid_timestamps) and max(valid_timestamps); update the comment near the
all_timestamps/valid_timestamps logic in phabricator_preprocessor.py to state
that the minimum timestamp is treated as the PR creation time and the maximum as
the last activity (reflecting use of min(valid_timestamps) and
max(valid_timestamps)).
- Around line 4-9: The module docstring currently states files are under
data/phabricator/** but the code's default data_dir is
data/github/Clang/phabricator; update the top-of-file docstring in
pinecone_rag/preprocessor/phabricator_preprocessor.py to reference the actual
default path (or describe both the general pattern and the concrete default) so
it matches the default data_dir used by the preprocessor (e.g., mention
data/github/Clang/phabricator) and keep the expected markdown header description
unchanged.

---

Nitpick comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 37-53: The _parse_created_at_to_timestamp function duplicates date
parsing logic; replace its custom patterns and loop with a call to the shared
utility get_timestamp_from_date from pinecone_rag.preprocessor.utility: import
get_timestamp_from_date, call it with the incoming value, and return the float
timestamp from that function (ensuring you preserve the previous behavior of
returning 0.0 for empty or unparseable values); remove the manual
datetime.strptime loop and related debug logging so all date parsing is
consolidated in get_timestamp_from_date.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 443a7a4 and 8fd3c6b.

📒 Files selected for processing (1)

pinecone_rag/preprocessor/phabricator_preprocessor.py

CppDigest#86-add phabricator preprocessor for clang

636c9f4

henry0816191 self-assigned this Feb 23, 2026

coderabbitai Bot reviewed Feb 23, 2026

View reviewed changes

Comment thread pinecone_rag/preprocessor/phabricator_preprocessor.py

Comment thread pinecone_rag/preprocessor/phabricator_preprocessort.py Outdated

Comment thread pinecone_rag/preprocessor/phabricator_preprocessor.py

Comment thread pinecone_rag/preprocessor/phabricator_preprocessor.py

CppDigest#86-rename and fix some errors

443a7a4

coderabbitai Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread pinecone_rag/preprocessor/phabricator_preprocessor.py

henry0816191 requested review from leostar0412 February 24, 2026 17:35

CppDigest#86-add docstring

8fd3c6b

coderabbitai Bot reviewed Feb 24, 2026

View reviewed changes

jonathanMLDev mentioned this pull request Feb 26, 2026

Preprocessor for scraped LLVM Phabricator markdown → documents for Pinecone #86

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

#86-add phabricator preprocessor for clang#90

#86-add phabricator preprocessor for clang#90
henry0816191 wants to merge 3 commits intoCppDigest:mainfrom
henry0816191:dev-86

henry0816191 commented Feb 23, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented Feb 23, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

henry0816191 commented Feb 23, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented Feb 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

henry0816191 commented Feb 23, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented Feb 23, 2026 •

edited

Loading