Skip to content

feat: add arxiv tool and claim tool to support an article fact check scenario#341

Open
seancoding-day wants to merge 1 commit intoMigoXLab:devfrom
seancoding-day:feature/add-arxiv-claims-tools
Open

feat: add arxiv tool and claim tool to support an article fact check scenario#341
seancoding-day wants to merge 1 commit intoMigoXLab:devfrom
seancoding-day:feature/add-arxiv-claims-tools

Conversation

@seancoding-day
Copy link
Collaborator

@seancoding-day seancoding-day commented Feb 2, 2026

No description provided.

@seancoding-day seancoding-day self-assigned this Feb 2, 2026
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @seancoding-day, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the agent's capabilities by introducing two specialized tools: ArxivSearch and ClaimsExtractor. The ArxivSearch tool provides direct access to academic paper information, facilitating research and verification of scholarly content. Concurrently, the ClaimsExtractor tool leverages LLMs to intelligently parse and categorize verifiable statements from any given text, making it easier to identify key assertions for fact-checking. These additions collectively enhance the system's ability to analyze and validate diverse forms of information, particularly within academic and general article contexts.

Highlights

  • New ArxivSearch Tool: A new ArxivSearch tool has been added, enabling comprehensive searching for academic papers on arXiv by ID, DOI, title, author, or keywords, and retrieving detailed metadata.
  • New ClaimsExtractor Tool: A new ClaimsExtractor tool has been introduced, which utilizes Large Language Models (LLMs) to identify and extract atomic, verifiable claims from long-form text. It categorizes these claims into 8 distinct types, including factual, statistical, attribution, institutional, temporal, comparative, monetary, and technical.
  • Enhanced Fact-Checking Capabilities: Both new tools are designed to support an article fact-checking scenario, allowing for the verification of academic claims and the systematic extraction of statements for independent validation.
  • Deprecation of Institution Verification: The ArxivSearch tool includes a deprecated verify_institutions method, with clear guidance provided to use a combination of arxiv_search and tavily_search for more robust and generalizable institution verification.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces two new tools, ArxivSearch and ClaimsExtractor, to support fact-checking scenarios. The ArxivSearch tool provides a comprehensive interface to the arXiv API, while ClaimsExtractor leverages an LLM to extract verifiable claims from text. The overall implementation is solid, with good documentation and clear separation of concerns. However, I've identified a few areas for improvement. In ArxivSearch, there's a high-severity thread-safety issue in the rate-limiting logic and some hardcoded test data that should be removed. In ClaimsExtractor, an unused import should be cleaned up, and the claim deduplication logic could be made more robust. Addressing these points will significantly improve the reliability and maintainability of these new tools.

Comment on lines +389 to +404
def _apply_rate_limiting(cls):
"""
Apply rate limiting to respect arXiv guidelines.

arXiv recommends at least 3 seconds between requests.
This method enforces the configured rate_limit_delay.
"""
current_time = time.time()
time_since_last_request = current_time - cls._last_request_time

if time_since_last_request < cls.config.rate_limit_delay:
sleep_time = cls.config.rate_limit_delay - time_since_last_request
log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds")
time.sleep(sleep_time)

cls._last_request_time = time.time()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of a class attribute _last_request_time to manage rate limiting is not thread-safe. If multiple threads call execute concurrently, they will share and modify this attribute, leading to a race condition. This could result in multiple requests being sent in quick succession, violating arXiv's rate-limiting guidelines.

To ensure thread safety, access to _last_request_time should be protected with a lock. You'll need to add import threading at the top of the file and _rate_limit_lock = threading.Lock() as a class attribute to ArxivSearch.

Suggested change
def _apply_rate_limiting(cls):
"""
Apply rate limiting to respect arXiv guidelines.
arXiv recommends at least 3 seconds between requests.
This method enforces the configured rate_limit_delay.
"""
current_time = time.time()
time_since_last_request = current_time - cls._last_request_time
if time_since_last_request < cls.config.rate_limit_delay:
sleep_time = cls.config.rate_limit_delay - time_since_last_request
log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds")
time.sleep(sleep_time)
cls._last_request_time = time.time()
def _apply_rate_limiting(cls):
"""
Apply rate limiting to respect arXiv guidelines.
arXiv recommends at least 3 seconds between requests.
This method enforces the configured rate_limit_delay.
"""
with cls._rate_limit_lock:
current_time = time.time()
time_since_last_request = current_time - cls._last_request_time
if time_since_last_request < cls.config.rate_limit_delay:
sleep_time = cls.config.rate_limit_delay - time_since_last_request
log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds")
time.sleep(sleep_time)
cls._last_request_time = time.time()

Comment on lines +696 to +708
# Special handling for known OmniDocBench paper (arXiv:2412.07626)
# This is a fallback for testing - actual implementation should use
# Semantic Scholar API or PDF parsing for reliable affiliation data
paper_id = paper.get('arxiv_id', '')
if '2412.07626' in paper_id:
# Known institutions for OmniDocBench paper
# Source: https://arxiv.org/abs/2412.07626
institutions.update([
'Shanghai AI Laboratory',
'Shanghai Artificial Intelligence Laboratory',
'Abaka AI',
'2077AI'
])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The method _extract_institutions_from_paper contains hardcoded logic specifically for the paper arXiv:2412.07626. This is a poor practice as it makes the tool's behavior inconsistent and brittle. Test-specific data should be handled in the test suite using mocks or fixtures, not embedded in the tool's implementation, even if the parent method is deprecated.

Comment on lines +332 to +342
# New format: YYMM.NNNNN(vN)?
new_pattern = r'^\d{4}\.\d{4,5}(v\d+)?$'
if re.match(new_pattern, text):
return True

# Old format: archive/NNNNNNN(vN)?
old_pattern = r'^[a-z\-]+/\d{7}(v\d+)?$'
if re.match(old_pattern, text):
return True

return False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The regular expressions in _is_arxiv_id are recompiled on every function call. For better performance, especially if this function is called in a loop, these patterns should be pre-compiled once as class-level attributes. This performance improvement also applies to the regex patterns in _is_doi and detect_paper_references.

For example, you could refactor this method as follows:

# At class level
_NEW_ARXIV_PATTERN = re.compile(r'^\d{4}\.\d{4,5}(v\d+)?$')
_OLD_ARXIV_PATTERN = re.compile(r'^[a-z\-]+/\d{7}(v\d+)?$')

@classmethod
def _is_arxiv_id(cls, text: str) -> bool:
    """... docstring ..."""
    text = text.strip().replace("arXiv:", "").replace("arxiv:", "")

    if cls._NEW_ARXIV_PATTERN.match(text):
        return True

    if cls._OLD_ARXIV_PATTERN.match(text):
        return True

    return False

"""

import json
import re
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The re module is imported but never used in this file. It should be removed to keep the code clean and avoid confusion.

Comment on lines +533 to +574
def _deduplicate_claims(cls, claims: List[Dict]) -> List[Dict]:
"""
Remove duplicate or highly similar claims.

Args:
claims: List of claims

Returns:
Deduplicated claims
"""
if len(claims) <= 1:
return claims

unique_claims = []
seen_texts = set()

for claim in claims:
claim_text = claim.get('claim', '').strip().lower()

# Skip if empty
if not claim_text:
continue

# Skip if exact duplicate
if claim_text in seen_texts:
continue

# Check for very similar claims (simple substring check)
is_duplicate = False
for seen_text in seen_texts:
# If one is substring of other and length difference < 20%
if claim_text in seen_text or seen_text in claim_text:
len_diff = abs(len(claim_text) - len(seen_text))
if len_diff < 0.2 * max(len(claim_text), len(seen_text)):
is_duplicate = True
break

if not is_duplicate:
unique_claims.append(claim)
seen_texts.add(claim_text)

return unique_claims
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current deduplication logic in _deduplicate_claims uses a simple substring check, which can be unreliable. For instance, it might incorrectly flag "The model is fast" and "The model is not fast" as duplicates because one is a substring of the other. A more robust approach would be to use a method like Jaccard similarity on the sets of words in each claim to measure overlap, which is less prone to such errors.

    def _deduplicate_claims(cls, claims: List[Dict]) -> List[Dict]:
        """
        Remove duplicate or highly similar claims using Jaccard similarity.

        Args:
            claims: List of claims

        Returns:
            Deduplicated claims
        """
        if len(claims) <= 1:
            return claims

        unique_claims = []
        seen_claim_texts = []

        for claim_data in claims:
            claim_text = claim_data.get('claim', '').strip()
            if not claim_text:
                continue

            # Tokenize and create a set of words for the current claim
            current_words = set(claim_text.lower().split())
            if not current_words:
                continue

            is_duplicate = False
            for seen_text in seen_claim_texts:
                # Tokenize and create a set for the seen claim
                seen_words = set(seen_text.lower().split())

                # Calculate Jaccard similarity
                intersection = len(current_words.intersection(seen_words))
                union = len(current_words.union(seen_words))
                similarity = intersection / union if union > 0 else 0

                # If similarity is high (e.g., > 80%), consider it a duplicate
                if similarity > 0.8:
                    is_duplicate = True
                    break

            if not is_duplicate:
                unique_claims.append(claim_data)
                seen_claim_texts.append(claim_text)

        return unique_claims

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant