feat: add arxiv tool and claim tool to support an article fact check scenario#341
feat: add arxiv tool and claim tool to support an article fact check scenario#341seancoding-day wants to merge 1 commit intoMigoXLab:devfrom
Conversation
Summary of ChangesHello @seancoding-day, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the agent's capabilities by introducing two specialized tools: Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces two new tools, ArxivSearch and ClaimsExtractor, to support fact-checking scenarios. The ArxivSearch tool provides a comprehensive interface to the arXiv API, while ClaimsExtractor leverages an LLM to extract verifiable claims from text. The overall implementation is solid, with good documentation and clear separation of concerns. However, I've identified a few areas for improvement. In ArxivSearch, there's a high-severity thread-safety issue in the rate-limiting logic and some hardcoded test data that should be removed. In ClaimsExtractor, an unused import should be cleaned up, and the claim deduplication logic could be made more robust. Addressing these points will significantly improve the reliability and maintainability of these new tools.
| def _apply_rate_limiting(cls): | ||
| """ | ||
| Apply rate limiting to respect arXiv guidelines. | ||
|
|
||
| arXiv recommends at least 3 seconds between requests. | ||
| This method enforces the configured rate_limit_delay. | ||
| """ | ||
| current_time = time.time() | ||
| time_since_last_request = current_time - cls._last_request_time | ||
|
|
||
| if time_since_last_request < cls.config.rate_limit_delay: | ||
| sleep_time = cls.config.rate_limit_delay - time_since_last_request | ||
| log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds") | ||
| time.sleep(sleep_time) | ||
|
|
||
| cls._last_request_time = time.time() |
There was a problem hiding this comment.
The use of a class attribute _last_request_time to manage rate limiting is not thread-safe. If multiple threads call execute concurrently, they will share and modify this attribute, leading to a race condition. This could result in multiple requests being sent in quick succession, violating arXiv's rate-limiting guidelines.
To ensure thread safety, access to _last_request_time should be protected with a lock. You'll need to add import threading at the top of the file and _rate_limit_lock = threading.Lock() as a class attribute to ArxivSearch.
| def _apply_rate_limiting(cls): | |
| """ | |
| Apply rate limiting to respect arXiv guidelines. | |
| arXiv recommends at least 3 seconds between requests. | |
| This method enforces the configured rate_limit_delay. | |
| """ | |
| current_time = time.time() | |
| time_since_last_request = current_time - cls._last_request_time | |
| if time_since_last_request < cls.config.rate_limit_delay: | |
| sleep_time = cls.config.rate_limit_delay - time_since_last_request | |
| log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds") | |
| time.sleep(sleep_time) | |
| cls._last_request_time = time.time() | |
| def _apply_rate_limiting(cls): | |
| """ | |
| Apply rate limiting to respect arXiv guidelines. | |
| arXiv recommends at least 3 seconds between requests. | |
| This method enforces the configured rate_limit_delay. | |
| """ | |
| with cls._rate_limit_lock: | |
| current_time = time.time() | |
| time_since_last_request = current_time - cls._last_request_time | |
| if time_since_last_request < cls.config.rate_limit_delay: | |
| sleep_time = cls.config.rate_limit_delay - time_since_last_request | |
| log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds") | |
| time.sleep(sleep_time) | |
| cls._last_request_time = time.time() |
| # Special handling for known OmniDocBench paper (arXiv:2412.07626) | ||
| # This is a fallback for testing - actual implementation should use | ||
| # Semantic Scholar API or PDF parsing for reliable affiliation data | ||
| paper_id = paper.get('arxiv_id', '') | ||
| if '2412.07626' in paper_id: | ||
| # Known institutions for OmniDocBench paper | ||
| # Source: https://arxiv.org/abs/2412.07626 | ||
| institutions.update([ | ||
| 'Shanghai AI Laboratory', | ||
| 'Shanghai Artificial Intelligence Laboratory', | ||
| 'Abaka AI', | ||
| '2077AI' | ||
| ]) |
There was a problem hiding this comment.
The method _extract_institutions_from_paper contains hardcoded logic specifically for the paper arXiv:2412.07626. This is a poor practice as it makes the tool's behavior inconsistent and brittle. Test-specific data should be handled in the test suite using mocks or fixtures, not embedded in the tool's implementation, even if the parent method is deprecated.
| # New format: YYMM.NNNNN(vN)? | ||
| new_pattern = r'^\d{4}\.\d{4,5}(v\d+)?$' | ||
| if re.match(new_pattern, text): | ||
| return True | ||
|
|
||
| # Old format: archive/NNNNNNN(vN)? | ||
| old_pattern = r'^[a-z\-]+/\d{7}(v\d+)?$' | ||
| if re.match(old_pattern, text): | ||
| return True | ||
|
|
||
| return False |
There was a problem hiding this comment.
The regular expressions in _is_arxiv_id are recompiled on every function call. For better performance, especially if this function is called in a loop, these patterns should be pre-compiled once as class-level attributes. This performance improvement also applies to the regex patterns in _is_doi and detect_paper_references.
For example, you could refactor this method as follows:
# At class level
_NEW_ARXIV_PATTERN = re.compile(r'^\d{4}\.\d{4,5}(v\d+)?$')
_OLD_ARXIV_PATTERN = re.compile(r'^[a-z\-]+/\d{7}(v\d+)?$')
@classmethod
def _is_arxiv_id(cls, text: str) -> bool:
"""... docstring ..."""
text = text.strip().replace("arXiv:", "").replace("arxiv:", "")
if cls._NEW_ARXIV_PATTERN.match(text):
return True
if cls._OLD_ARXIV_PATTERN.match(text):
return True
return False| """ | ||
|
|
||
| import json | ||
| import re |
| def _deduplicate_claims(cls, claims: List[Dict]) -> List[Dict]: | ||
| """ | ||
| Remove duplicate or highly similar claims. | ||
|
|
||
| Args: | ||
| claims: List of claims | ||
|
|
||
| Returns: | ||
| Deduplicated claims | ||
| """ | ||
| if len(claims) <= 1: | ||
| return claims | ||
|
|
||
| unique_claims = [] | ||
| seen_texts = set() | ||
|
|
||
| for claim in claims: | ||
| claim_text = claim.get('claim', '').strip().lower() | ||
|
|
||
| # Skip if empty | ||
| if not claim_text: | ||
| continue | ||
|
|
||
| # Skip if exact duplicate | ||
| if claim_text in seen_texts: | ||
| continue | ||
|
|
||
| # Check for very similar claims (simple substring check) | ||
| is_duplicate = False | ||
| for seen_text in seen_texts: | ||
| # If one is substring of other and length difference < 20% | ||
| if claim_text in seen_text or seen_text in claim_text: | ||
| len_diff = abs(len(claim_text) - len(seen_text)) | ||
| if len_diff < 0.2 * max(len(claim_text), len(seen_text)): | ||
| is_duplicate = True | ||
| break | ||
|
|
||
| if not is_duplicate: | ||
| unique_claims.append(claim) | ||
| seen_texts.add(claim_text) | ||
|
|
||
| return unique_claims |
There was a problem hiding this comment.
The current deduplication logic in _deduplicate_claims uses a simple substring check, which can be unreliable. For instance, it might incorrectly flag "The model is fast" and "The model is not fast" as duplicates because one is a substring of the other. A more robust approach would be to use a method like Jaccard similarity on the sets of words in each claim to measure overlap, which is less prone to such errors.
def _deduplicate_claims(cls, claims: List[Dict]) -> List[Dict]:
"""
Remove duplicate or highly similar claims using Jaccard similarity.
Args:
claims: List of claims
Returns:
Deduplicated claims
"""
if len(claims) <= 1:
return claims
unique_claims = []
seen_claim_texts = []
for claim_data in claims:
claim_text = claim_data.get('claim', '').strip()
if not claim_text:
continue
# Tokenize and create a set of words for the current claim
current_words = set(claim_text.lower().split())
if not current_words:
continue
is_duplicate = False
for seen_text in seen_claim_texts:
# Tokenize and create a set for the seen claim
seen_words = set(seen_text.lower().split())
# Calculate Jaccard similarity
intersection = len(current_words.intersection(seen_words))
union = len(current_words.union(seen_words))
similarity = intersection / union if union > 0 else 0
# If similarity is high (e.g., > 80%), consider it a duplicate
if similarity > 0.8:
is_duplicate = True
break
if not is_duplicate:
unique_claims.append(claim_data)
seen_claim_texts.append(claim_text)
return unique_claims
No description provided.