Skip to content
This repository was archived by the owner on Apr 23, 2026. It is now read-only.

#86-add phabricator preprocessor for clang#90

Open
henry0816191 wants to merge 3 commits intoCppDigest:mainfrom
henry0816191:dev-86
Open

#86-add phabricator preprocessor for clang#90
henry0816191 wants to merge 3 commits intoCppDigest:mainfrom
henry0816191:dev-86

Conversation

@henry0816191
Copy link
Copy Markdown
Collaborator

@henry0816191 henry0816191 commented Feb 23, 2026

Summary by CodeRabbit

  • New Features
    • Added a Phabricator PR preprocessor that ingests markdown-based PR data, extracts and normalizes PR metadata (title, author, state, timestamps, URL), and converts each PR into searchable documents. Handles recursive loading from the data directory, enforces minimum content, and logs progress for downstream indexing.

@henry0816191 henry0816191 self-assigned this Feb 23, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented Feb 23, 2026

📝 Walkthrough

Walkthrough

Adds a Phabricator PR preprocessor that loads markdown files from a data directory, parses PR headers and body to extract structured metadata (titles, numbers, authors, states, timestamps, URLs), validates content length, and returns Document objects with page_content and metadata.

Changes

Cohort / File(s) Summary
Phabricator PR Preprocessor
pinecone_rag/preprocessor/phabricator_preprocessor.py
New module adding PhabricatorPrPreprocessor and helpers. Recursively discovers *.md in the data dir, parses headers via regex (_HEADER_RE, _USERNAME_RE, _CREATED_AT_RE, _URL_RE), validates content length (_is_valid_content), converts dates to UTC timestamps (_parse_created_at_to_timestamp), extracts metadata (_extract_metadata), and loads Documents (_load_pr_document, load_documents). Logs progress and handles missing dirs. Pay attention to date-format handling, state normalization, and derived URL logic.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰
I nibble through markdown rows,
Find titles, timestamps, author glows,
I stitch each PR into a doc,
Hop, parse, and gently tuck in clock,
Tiny rabbit, big metadata shows. 🥕

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically describes the main change: adding a Phabricator preprocessor for Clang PR processing, which matches the core functionality introduced in the changeset.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessort.py`:
- Around line 1-10: Rename the module file from phabricator_preprocessort.py to
phabricator_preprocessor.py to fix the typo and update any imports that
reference the old name; inside the module update the module docstring to match
the actual default data directory (replace references to data/phabricator/**
with the concrete default used in the code) and change the default data_dir
string value (currently set to "data/github/Clang/phabricator") if it was
intended to be something else so the docstring and the default data_dir in the
module (where the default is defined) are consistent.
- Around line 76-86: The current logic uses the first successfully parsed
timestamp from _CREATED_AT_RE/_parse_created_at_to_timestamp to set created_at,
which can mistakenly pick a comment's date if the PR header fails to parse;
change created_at to use the earliest parsed timestamp (min(valid_timestamps))
and keep last_activity as max(valid_timestamps) so created_at reflects the PR
creation heuristic; ensure the fallback for empty valid_timestamps remains 0.0
and update any variable created_at/last_activity/updated_at assignments
accordingly.
- Around line 45-50: Change the error handling and timezone handling in the
date-parsing loop: instead of catching a bare Exception around
datetime.strptime(value, pattern).timestamp(), catch ValueError only (the
expected parse failure) and, after a successful datetime.strptime call, make the
datetime explicit UTC before calling .timestamp() (e.g., call
.replace(tzinfo=timezone.utc) or otherwise convert to UTC) so the epoch is
consistent across TZs; update the block surrounding datetime.strptime, patterns,
and the return value accordingly.
- Line 28: Remove the unused regex constant _COMMENT_RE from
phabricator_preprocessort.py by deleting its definition line (the assignment of
re.compile(r"^##\s*Comment\s+\d+", re.MULTILINE)); also check and remove the
now-unreferenced import of re if it becomes unused elsewhere in the file to
avoid unused-import warnings.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 7dfc953 and 636c9f4.

📒 Files selected for processing (1)
  • pinecone_rag/preprocessor/phabricator_preprocessort.py

Comment thread pinecone_rag/preprocessor/phabricator_preprocessor.py
Comment thread pinecone_rag/preprocessor/phabricator_preprocessort.py Outdated
Comment thread pinecone_rag/preprocessor/phabricator_preprocessor.py
Comment thread pinecone_rag/preprocessor/phabricator_preprocessor.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (1)
pinecone_rag/preprocessor/phabricator_preprocessor.py (1)

1-10: ⚠️ Potential issue | 🟡 Minor

Align docstrings with the actual default data directory.

The module/class docstrings still reference data/phabricator, but the default path is data/github/Clang/phabricator. This mismatch can mislead users.

✏️ Suggested docstring update
-Reads markdown files under data/phabricator/** and builds one Document per file.
+Reads markdown files under data/github/Clang/phabricator/** and builds one Document per file.
@@
-class PhabricatorPrPreprocessor:
-    """Load Phabricator markdown files from data/phabricator and produce Documents."""
+class PhabricatorPrPreprocessor:
+    """Load Phabricator markdown files from data/github/Clang/phabricator and produce Documents."""

Also applies to: 119-131

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 1 - 10,
Update the misleading docstrings that reference the old default data directory:
replace occurrences of "data/phabricator" with the actual default
"data/github/Clang/phabricator" in the module-level docstring at the top of the
file and in the class/docstring blocks later in the file (the preprocessor class
docstring around the other header comment region). Ensure all descriptive
examples and "Expected markdown header" notes remain accurate but point to the
correct default path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 77-86: The comment above the timestamp extraction is misleading:
update the inline comment near _CREATED_AT_RE, _parse_created_at_to_timestamp,
and the all_timestamps/valid_timestamps logic to say that we collect all
"Created at:" timestamps and use the earliest (min) as created_at and the latest
(max) as last_activity (i.e., “earliest timestamp is treated as the PR’s
creation time” instead of “first match”); leave the code using
min(valid_timestamps) and max(valid_timestamps) unchanged.

---

Duplicate comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 1-10: Update the misleading docstrings that reference the old
default data directory: replace occurrences of "data/phabricator" with the
actual default "data/github/Clang/phabricator" in the module-level docstring at
the top of the file and in the class/docstring blocks later in the file (the
preprocessor class docstring around the other header comment region). Ensure all
descriptive examples and "Expected markdown header" notes remain accurate but
point to the correct default path.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 636c9f4 and 443a7a4.

📒 Files selected for processing (1)
  • pinecone_rag/preprocessor/phabricator_preprocessor.py

Comment thread pinecone_rag/preprocessor/phabricator_preprocessor.py
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
pinecone_rag/preprocessor/phabricator_preprocessor.py (2)

80-82: ⚠️ Potential issue | 🟡 Minor

Update the timestamp comment to match min/max logic.

The comment still says “first match,” but the code uses min(valid_timestamps).

✏️ Suggested comment tweak
-    # Collect all "Created at:" timestamps from PR header + all comments/reviews.
-    # The first match is the PR's own creation time; the maximum is the last activity.
+    # Collect all "Created at:" timestamps from PR header + all comments/reviews.
+    # The earliest timestamp is treated as PR creation; the maximum is last activity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 80 - 82,
The comment above the timestamp collection is inaccurate: it says "The first
match is the PR's own creation time" but the code uses min(valid_timestamps) and
max(valid_timestamps); update the comment near the
all_timestamps/valid_timestamps logic in phabricator_preprocessor.py to state
that the minimum timestamp is treated as the PR creation time and the maximum as
the last activity (reflecting use of min(valid_timestamps) and
max(valid_timestamps)).

4-9: ⚠️ Potential issue | 🟡 Minor

Align module docstring with the actual default data directory.

The module docstring references data/phabricator/**, but the default data_dir is data/github/Clang/phabricator, which can mislead users.

✅ Suggested docstring fix
-Reads markdown files under data/phabricator/** and builds one Document per file.
+Reads markdown files under data/github/Clang/phabricator/** and builds one Document per file.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 4 - 9,
The module docstring currently states files are under data/phabricator/** but
the code's default data_dir is data/github/Clang/phabricator; update the
top-of-file docstring in pinecone_rag/preprocessor/phabricator_preprocessor.py
to reference the actual default path (or describe both the general pattern and
the concrete default) so it matches the default data_dir used by the
preprocessor (e.g., mention data/github/Clang/phabricator) and keep the expected
markdown header description unchanged.
🧹 Nitpick comments (1)
pinecone_rag/preprocessor/phabricator_preprocessor.py (1)

37-53: Prefer the shared date parsing utility to avoid duplicate logic.

Consider using pinecone_rag.preprocessor.utility.get_timestamp_from_date here instead of custom parsing so date handling stays consistent across preprocessors. Based on learnings: “use the existing get_timestamp_from_date utility function from pinecone_rag.preprocessor.utility instead of implementing custom date-to-timestamp conversion logic.”

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 37 - 53,
The _parse_created_at_to_timestamp function duplicates date parsing logic;
replace its custom patterns and loop with a call to the shared utility
get_timestamp_from_date from pinecone_rag.preprocessor.utility: import
get_timestamp_from_date, call it with the incoming value, and return the float
timestamp from that function (ensuring you preserve the previous behavior of
returning 0.0 for empty or unparseable values); remove the manual
datetime.strptime loop and related debug logging so all date parsing is
consolidated in get_timestamp_from_date.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 80-82: The comment above the timestamp collection is inaccurate:
it says "The first match is the PR's own creation time" but the code uses
min(valid_timestamps) and max(valid_timestamps); update the comment near the
all_timestamps/valid_timestamps logic in phabricator_preprocessor.py to state
that the minimum timestamp is treated as the PR creation time and the maximum as
the last activity (reflecting use of min(valid_timestamps) and
max(valid_timestamps)).
- Around line 4-9: The module docstring currently states files are under
data/phabricator/** but the code's default data_dir is
data/github/Clang/phabricator; update the top-of-file docstring in
pinecone_rag/preprocessor/phabricator_preprocessor.py to reference the actual
default path (or describe both the general pattern and the concrete default) so
it matches the default data_dir used by the preprocessor (e.g., mention
data/github/Clang/phabricator) and keep the expected markdown header description
unchanged.

---

Nitpick comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 37-53: The _parse_created_at_to_timestamp function duplicates date
parsing logic; replace its custom patterns and loop with a call to the shared
utility get_timestamp_from_date from pinecone_rag.preprocessor.utility: import
get_timestamp_from_date, call it with the incoming value, and return the float
timestamp from that function (ensuring you preserve the previous behavior of
returning 0.0 for empty or unparseable values); remove the manual
datetime.strptime loop and related debug logging so all date parsing is
consolidated in get_timestamp_from_date.

ℹ️ Review info

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 443a7a4 and 8fd3c6b.

📒 Files selected for processing (1)
  • pinecone_rag/preprocessor/phabricator_preprocessor.py

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant