#86-add phabricator preprocessor for clang#90
#86-add phabricator preprocessor for clang#90henry0816191 wants to merge 3 commits intoCppDigest:mainfrom
Conversation
📝 WalkthroughWalkthroughAdds a Phabricator PR preprocessor that loads markdown files from a data directory, parses PR headers and body to extract structured metadata (titles, numbers, authors, states, timestamps, URLs), validates content length, and returns Document objects with page_content and metadata. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessort.py`:
- Around line 1-10: Rename the module file from phabricator_preprocessort.py to
phabricator_preprocessor.py to fix the typo and update any imports that
reference the old name; inside the module update the module docstring to match
the actual default data directory (replace references to data/phabricator/**
with the concrete default used in the code) and change the default data_dir
string value (currently set to "data/github/Clang/phabricator") if it was
intended to be something else so the docstring and the default data_dir in the
module (where the default is defined) are consistent.
- Around line 76-86: The current logic uses the first successfully parsed
timestamp from _CREATED_AT_RE/_parse_created_at_to_timestamp to set created_at,
which can mistakenly pick a comment's date if the PR header fails to parse;
change created_at to use the earliest parsed timestamp (min(valid_timestamps))
and keep last_activity as max(valid_timestamps) so created_at reflects the PR
creation heuristic; ensure the fallback for empty valid_timestamps remains 0.0
and update any variable created_at/last_activity/updated_at assignments
accordingly.
- Around line 45-50: Change the error handling and timezone handling in the
date-parsing loop: instead of catching a bare Exception around
datetime.strptime(value, pattern).timestamp(), catch ValueError only (the
expected parse failure) and, after a successful datetime.strptime call, make the
datetime explicit UTC before calling .timestamp() (e.g., call
.replace(tzinfo=timezone.utc) or otherwise convert to UTC) so the epoch is
consistent across TZs; update the block surrounding datetime.strptime, patterns,
and the return value accordingly.
- Line 28: Remove the unused regex constant _COMMENT_RE from
phabricator_preprocessort.py by deleting its definition line (the assignment of
re.compile(r"^##\s*Comment\s+\d+", re.MULTILINE)); also check and remove the
now-unreferenced import of re if it becomes unused elsewhere in the file to
avoid unused-import warnings.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (1)
pinecone_rag/preprocessor/phabricator_preprocessor.py (1)
1-10:⚠️ Potential issue | 🟡 MinorAlign docstrings with the actual default data directory.
The module/class docstrings still reference
data/phabricator, but the default path isdata/github/Clang/phabricator. This mismatch can mislead users.✏️ Suggested docstring update
-Reads markdown files under data/phabricator/** and builds one Document per file. +Reads markdown files under data/github/Clang/phabricator/** and builds one Document per file. @@ -class PhabricatorPrPreprocessor: - """Load Phabricator markdown files from data/phabricator and produce Documents.""" +class PhabricatorPrPreprocessor: + """Load Phabricator markdown files from data/github/Clang/phabricator and produce Documents."""Also applies to: 119-131
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 1 - 10, Update the misleading docstrings that reference the old default data directory: replace occurrences of "data/phabricator" with the actual default "data/github/Clang/phabricator" in the module-level docstring at the top of the file and in the class/docstring blocks later in the file (the preprocessor class docstring around the other header comment region). Ensure all descriptive examples and "Expected markdown header" notes remain accurate but point to the correct default path.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 77-86: The comment above the timestamp extraction is misleading:
update the inline comment near _CREATED_AT_RE, _parse_created_at_to_timestamp,
and the all_timestamps/valid_timestamps logic to say that we collect all
"Created at:" timestamps and use the earliest (min) as created_at and the latest
(max) as last_activity (i.e., “earliest timestamp is treated as the PR’s
creation time” instead of “first match”); leave the code using
min(valid_timestamps) and max(valid_timestamps) unchanged.
---
Duplicate comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 1-10: Update the misleading docstrings that reference the old
default data directory: replace occurrences of "data/phabricator" with the
actual default "data/github/Clang/phabricator" in the module-level docstring at
the top of the file and in the class/docstring blocks later in the file (the
preprocessor class docstring around the other header comment region). Ensure all
descriptive examples and "Expected markdown header" notes remain accurate but
point to the correct default path.
There was a problem hiding this comment.
♻️ Duplicate comments (2)
pinecone_rag/preprocessor/phabricator_preprocessor.py (2)
80-82:⚠️ Potential issue | 🟡 MinorUpdate the timestamp comment to match min/max logic.
The comment still says “first match,” but the code uses
min(valid_timestamps).✏️ Suggested comment tweak
- # Collect all "Created at:" timestamps from PR header + all comments/reviews. - # The first match is the PR's own creation time; the maximum is the last activity. + # Collect all "Created at:" timestamps from PR header + all comments/reviews. + # The earliest timestamp is treated as PR creation; the maximum is last activity.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 80 - 82, The comment above the timestamp collection is inaccurate: it says "The first match is the PR's own creation time" but the code uses min(valid_timestamps) and max(valid_timestamps); update the comment near the all_timestamps/valid_timestamps logic in phabricator_preprocessor.py to state that the minimum timestamp is treated as the PR creation time and the maximum as the last activity (reflecting use of min(valid_timestamps) and max(valid_timestamps)).
4-9:⚠️ Potential issue | 🟡 MinorAlign module docstring with the actual default data directory.
The module docstring references
data/phabricator/**, but the defaultdata_dirisdata/github/Clang/phabricator, which can mislead users.✅ Suggested docstring fix
-Reads markdown files under data/phabricator/** and builds one Document per file. +Reads markdown files under data/github/Clang/phabricator/** and builds one Document per file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 4 - 9, The module docstring currently states files are under data/phabricator/** but the code's default data_dir is data/github/Clang/phabricator; update the top-of-file docstring in pinecone_rag/preprocessor/phabricator_preprocessor.py to reference the actual default path (or describe both the general pattern and the concrete default) so it matches the default data_dir used by the preprocessor (e.g., mention data/github/Clang/phabricator) and keep the expected markdown header description unchanged.
🧹 Nitpick comments (1)
pinecone_rag/preprocessor/phabricator_preprocessor.py (1)
37-53: Prefer the shared date parsing utility to avoid duplicate logic.Consider using
pinecone_rag.preprocessor.utility.get_timestamp_from_datehere instead of custom parsing so date handling stays consistent across preprocessors. Based on learnings: “use the existing get_timestamp_from_date utility function from pinecone_rag.preprocessor.utility instead of implementing custom date-to-timestamp conversion logic.”🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pinecone_rag/preprocessor/phabricator_preprocessor.py` around lines 37 - 53, The _parse_created_at_to_timestamp function duplicates date parsing logic; replace its custom patterns and loop with a call to the shared utility get_timestamp_from_date from pinecone_rag.preprocessor.utility: import get_timestamp_from_date, call it with the incoming value, and return the float timestamp from that function (ensuring you preserve the previous behavior of returning 0.0 for empty or unparseable values); remove the manual datetime.strptime loop and related debug logging so all date parsing is consolidated in get_timestamp_from_date.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 80-82: The comment above the timestamp collection is inaccurate:
it says "The first match is the PR's own creation time" but the code uses
min(valid_timestamps) and max(valid_timestamps); update the comment near the
all_timestamps/valid_timestamps logic in phabricator_preprocessor.py to state
that the minimum timestamp is treated as the PR creation time and the maximum as
the last activity (reflecting use of min(valid_timestamps) and
max(valid_timestamps)).
- Around line 4-9: The module docstring currently states files are under
data/phabricator/** but the code's default data_dir is
data/github/Clang/phabricator; update the top-of-file docstring in
pinecone_rag/preprocessor/phabricator_preprocessor.py to reference the actual
default path (or describe both the general pattern and the concrete default) so
it matches the default data_dir used by the preprocessor (e.g., mention
data/github/Clang/phabricator) and keep the expected markdown header description
unchanged.
---
Nitpick comments:
In `@pinecone_rag/preprocessor/phabricator_preprocessor.py`:
- Around line 37-53: The _parse_created_at_to_timestamp function duplicates date
parsing logic; replace its custom patterns and loop with a call to the shared
utility get_timestamp_from_date from pinecone_rag.preprocessor.utility: import
get_timestamp_from_date, call it with the incoming value, and return the float
timestamp from that function (ensuring you preserve the previous behavior of
returning 0.0 for empty or unparseable values); remove the manual
datetime.strptime loop and related debug logging so all date parsing is
consolidated in get_timestamp_from_date.
Summary by CodeRabbit