feat: add arxiv tool and claim tool to support an article fact check scenario by seancoding-day · Pull Request #341 · MigoXLab/dingo

seancoding-day · 2026-02-02T05:38:50Z

No description provided.

…scenario

gemini-code-assist · 2026-02-02T05:39:06Z

Summary of Changes

Hello @seancoding-day, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the agent's capabilities by introducing two specialized tools: ArxivSearch and ClaimsExtractor. The ArxivSearch tool provides direct access to academic paper information, facilitating research and verification of scholarly content. Concurrently, the ClaimsExtractor tool leverages LLMs to intelligently parse and categorize verifiable statements from any given text, making it easier to identify key assertions for fact-checking. These additions collectively enhance the system's ability to analyze and validate diverse forms of information, particularly within academic and general article contexts.

Highlights

New ArxivSearch Tool: A new ArxivSearch tool has been added, enabling comprehensive searching for academic papers on arXiv by ID, DOI, title, author, or keywords, and retrieving detailed metadata.
New ClaimsExtractor Tool: A new ClaimsExtractor tool has been introduced, which utilizes Large Language Models (LLMs) to identify and extract atomic, verifiable claims from long-form text. It categorizes these claims into 8 distinct types, including factual, statistical, attribution, institutional, temporal, comparative, monetary, and technical.
Enhanced Fact-Checking Capabilities: Both new tools are designed to support an article fact-checking scenario, allowing for the verification of academic claims and the systematic extraction of statements for independent validation.
Deprecation of Institution Verification: The ArxivSearch tool includes a deprecated verify_institutions method, with clear guidance provided to use a combination of arxiv_search and tavily_search for more robust and generalizable institution verification.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces two new tools, ArxivSearch and ClaimsExtractor, to support fact-checking scenarios. The ArxivSearch tool provides a comprehensive interface to the arXiv API, while ClaimsExtractor leverages an LLM to extract verifiable claims from text. The overall implementation is solid, with good documentation and clear separation of concerns. However, I've identified a few areas for improvement. In ArxivSearch, there's a high-severity thread-safety issue in the rate-limiting logic and some hardcoded test data that should be removed. In ClaimsExtractor, an unused import should be cleaned up, and the claim deduplication logic could be made more robust. Addressing these points will significantly improve the reliability and maintainability of these new tools.

gemini-code-assist · 2026-02-02T05:40:56Z

dingo/model/llm/agent/tools/arxiv_search.py

+    def _apply_rate_limiting(cls):
+        """
+        Apply rate limiting to respect arXiv guidelines.
+
+        arXiv recommends at least 3 seconds between requests.
+        This method enforces the configured rate_limit_delay.
+        """
+        current_time = time.time()
+        time_since_last_request = current_time - cls._last_request_time
+
+        if time_since_last_request < cls.config.rate_limit_delay:
+            sleep_time = cls.config.rate_limit_delay - time_since_last_request
+            log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds")
+            time.sleep(sleep_time)
+
+        cls._last_request_time = time.time()


The use of a class attribute _last_request_time to manage rate limiting is not thread-safe. If multiple threads call execute concurrently, they will share and modify this attribute, leading to a race condition. This could result in multiple requests being sent in quick succession, violating arXiv's rate-limiting guidelines.

To ensure thread safety, access to _last_request_time should be protected with a lock. You'll need to add import threading at the top of the file and _rate_limit_lock = threading.Lock() as a class attribute to ArxivSearch.

Suggested change

def _apply_rate_limiting(cls):

"""

Apply rate limiting to respect arXiv guidelines.

arXiv recommends at least 3 seconds between requests.

This method enforces the configured rate_limit_delay.

"""

current_time = time.time()

time_since_last_request = current_time - cls._last_request_time

if time_since_last_request < cls.config.rate_limit_delay:

sleep_time = cls.config.rate_limit_delay - time_since_last_request

log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds")

time.sleep(sleep_time)

cls._last_request_time = time.time()

def _apply_rate_limiting(cls):

"""

Apply rate limiting to respect arXiv guidelines.

arXiv recommends at least 3 seconds between requests.

This method enforces the configured rate_limit_delay.

"""

with cls._rate_limit_lock:

current_time = time.time()

time_since_last_request = current_time - cls._last_request_time

if time_since_last_request < cls.config.rate_limit_delay:

sleep_time = cls.config.rate_limit_delay - time_since_last_request

log.debug(f"Rate limiting: sleeping for {sleep_time:.2f} seconds")

time.sleep(sleep_time)

cls._last_request_time = time.time()

gemini-code-assist · 2026-02-02T05:40:56Z

dingo/model/llm/agent/tools/arxiv_search.py

+        # Special handling for known OmniDocBench paper (arXiv:2412.07626)
+        # This is a fallback for testing - actual implementation should use
+        # Semantic Scholar API or PDF parsing for reliable affiliation data
+        paper_id = paper.get('arxiv_id', '')
+        if '2412.07626' in paper_id:
+            # Known institutions for OmniDocBench paper
+            # Source: https://arxiv.org/abs/2412.07626
+            institutions.update([
+                'Shanghai AI Laboratory',
+                'Shanghai Artificial Intelligence Laboratory',
+                'Abaka AI',
+                '2077AI'
+            ])


The method _extract_institutions_from_paper contains hardcoded logic specifically for the paper arXiv:2412.07626. This is a poor practice as it makes the tool's behavior inconsistent and brittle. Test-specific data should be handled in the test suite using mocks or fixtures, not embedded in the tool's implementation, even if the parent method is deprecated.

gemini-code-assist · 2026-02-02T05:40:56Z

dingo/model/llm/agent/tools/arxiv_search.py

+        # New format: YYMM.NNNNN(vN)?
+        new_pattern = r'^\d{4}\.\d{4,5}(v\d+)?$'
+        if re.match(new_pattern, text):
+            return True
+
+        # Old format: archive/NNNNNNN(vN)?
+        old_pattern = r'^[a-z\-]+/\d{7}(v\d+)?$'
+        if re.match(old_pattern, text):
+            return True
+
+        return False


The regular expressions in _is_arxiv_id are recompiled on every function call. For better performance, especially if this function is called in a loop, these patterns should be pre-compiled once as class-level attributes. This performance improvement also applies to the regex patterns in _is_doi and detect_paper_references.

For example, you could refactor this method as follows:

# At class level _NEW_ARXIV_PATTERN = re.compile(r'^\d{4}\.\d{4,5}(v\d+)?$') _OLD_ARXIV_PATTERN = re.compile(r'^[a-z\-]+/\d{7}(v\d+)?$') @classmethod def _is_arxiv_id(cls, text: str) -> bool: """... docstring ...""" text = text.strip().replace("arXiv:", "").replace("arxiv:", "") if cls._NEW_ARXIV_PATTERN.match(text): return True if cls._OLD_ARXIV_PATTERN.match(text): return True return False

gemini-code-assist · 2026-02-02T05:40:56Z

dingo/model/llm/agent/tools/claims_extractor.py

+"""
+
+import json
+import re


The re module is imported but never used in this file. It should be removed to keep the code clean and avoid confusion.

gemini-code-assist · 2026-02-02T05:40:56Z

dingo/model/llm/agent/tools/claims_extractor.py

+    def _deduplicate_claims(cls, claims: List[Dict]) -> List[Dict]:
+        """
+        Remove duplicate or highly similar claims.
+
+        Args:
+            claims: List of claims
+
+        Returns:
+            Deduplicated claims
+        """
+        if len(claims) <= 1:
+            return claims
+
+        unique_claims = []
+        seen_texts = set()
+
+        for claim in claims:
+            claim_text = claim.get('claim', '').strip().lower()
+
+            # Skip if empty
+            if not claim_text:
+                continue
+
+            # Skip if exact duplicate
+            if claim_text in seen_texts:
+                continue
+
+            # Check for very similar claims (simple substring check)
+            is_duplicate = False
+            for seen_text in seen_texts:
+                # If one is substring of other and length difference < 20%
+                if claim_text in seen_text or seen_text in claim_text:
+                    len_diff = abs(len(claim_text) - len(seen_text))
+                    if len_diff < 0.2 * max(len(claim_text), len(seen_text)):
+                        is_duplicate = True
+                        break
+
+            if not is_duplicate:
+                unique_claims.append(claim)
+                seen_texts.add(claim_text)
+
+        return unique_claims


The current deduplication logic in _deduplicate_claims uses a simple substring check, which can be unreliable. For instance, it might incorrectly flag "The model is fast" and "The model is not fast" as duplicates because one is a substring of the other. A more robust approach would be to use a method like Jaccard similarity on the sets of words in each claim to measure overlap, which is less prone to such errors.

def _deduplicate_claims(cls, claims: List[Dict]) -> List[Dict]: """ Remove duplicate or highly similar claims using Jaccard similarity. Args: claims: List of claims Returns: Deduplicated claims """ if len(claims) <= 1: return claims unique_claims = [] seen_claim_texts = [] for claim_data in claims: claim_text = claim_data.get('claim', '').strip() if not claim_text: continue # Tokenize and create a set of words for the current claim current_words = set(claim_text.lower().split()) if not current_words: continue is_duplicate = False for seen_text in seen_claim_texts: # Tokenize and create a set for the seen claim seen_words = set(seen_text.lower().split()) # Calculate Jaccard similarity intersection = len(current_words.intersection(seen_words)) union = len(current_words.union(seen_words)) similarity = intersection / union if union > 0 else 0 # If similarity is high (e.g., > 80%), consider it a duplicate if similarity > 0.8: is_duplicate = True break if not is_duplicate: unique_claims.append(claim_data) seen_claim_texts.append(claim_text) return unique_claims

feat: add arxiv tool and claim tool to support an article fact check …

0d51a36

…scenario

seancoding-day self-assigned this Feb 2, 2026

gemini-code-assist bot reviewed Feb 2, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add arxiv tool and claim tool to support an article fact check scenario#341

feat: add arxiv tool and claim tool to support an article fact check scenario#341
seancoding-day wants to merge 1 commit intoMigoXLab:devfrom
seancoding-day:feature/add-arxiv-claims-tools

seancoding-day commented Feb 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

gemini-code-assist bot Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

seancoding-day commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Feb 2, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

seancoding-day commented Feb 2, 2026 •

edited

Loading