Skip to content

Fix/rss atom domain#113

Merged
lpi-tn merged 2 commits intomainfrom
Fix/rss-atom-domain
Mar 3, 2026
Merged

Fix/rss atom domain#113
lpi-tn merged 2 commits intomainfrom
Fix/rss-atom-domain

Conversation

@lpi-tn
Copy link
Collaborator

@lpi-tn lpi-tn commented Mar 3, 2026

This pull request updates the logic for determining the domain in both the Atom and RSS URL collectors to use the main_url from the Corpus object, rather than the feed URL. It also expands the test coverage to ensure correct behavior when the corpus domain differs from the feed domain.

Domain resolution changes:

  • Updated the collect methods in both AtomURLCollector (welearn_datastack/collectors/atom_collector.py) and RssURLCollector (welearn_datastack/collectors/rss_collector.py) to derive the domain from corpus.main_url instead of feed_url. This ensures that collected document URLs are always based on the intended corpus domain. [1] [2]

Test improvements:

  • Modified test setup in test_atom_collector.py and test_rss_collector.py to initialize the Corpus with a main_url field, aligning with the new domain resolution logic. [1] [2]
  • Added new tests in both test_atom_collector.py and test_rss_collector.py to verify that URL collection works correctly when the corpus domain is different from the feed domain. These tests check that the resulting document URLs use the main_url from the corpus. [1] [2]

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR changes RSS/Atom URL collection so the “accepted domain” is derived from Corpus.main_url rather than from the feed URL, and adds tests to cover cases where the corpus domain differs from the feed domain.

Changes:

  • Update RssURLCollector.collect() and AtomURLCollector.collect() to compute domain from corpus.main_url.
  • Update existing RSS/Atom collector tests to build Corpus with main_url.
  • Add new tests ensuring collected document URLs follow the corpus domain even when the feed URL domain differs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
welearn_datastack/collectors/rss_collector.py Switches domain derivation to corpus.main_url for RSS link filtering.
welearn_datastack/collectors/atom_collector.py Switches domain derivation to corpus.main_url for Atom link filtering.
tests/url_collector/test_rss_collector.py Extends setup to include main_url and adds a “different domain” test.
tests/url_collector/test_atom_collector.py Extends setup to include main_url and adds a “different domain” test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@lpi-tn lpi-tn merged commit ed5e081 into main Mar 3, 2026
10 of 11 checks passed
@lpi-tn lpi-tn deleted the Fix/rss-atom-domain branch March 3, 2026 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants