Fix/rss atom domain by lpi-tn · Pull Request #113 · CyberCRI/welearn-datastack

lpi-tn · 2026-03-03T14:53:43Z

This pull request updates the logic for determining the domain in both the Atom and RSS URL collectors to use the main_url from the Corpus object, rather than the feed URL. It also expands the test coverage to ensure correct behavior when the corpus domain differs from the feed domain.

Domain resolution changes:

Updated the collect methods in both AtomURLCollector (welearn_datastack/collectors/atom_collector.py) and RssURLCollector (welearn_datastack/collectors/rss_collector.py) to derive the domain from corpus.main_url instead of feed_url. This ensures that collected document URLs are always based on the intended corpus domain. [1] [2]

Test improvements:

Modified test setup in test_atom_collector.py and test_rss_collector.py to initialize the Corpus with a main_url field, aligning with the new domain resolution logic. [1] [2]
Added new tests in both test_atom_collector.py and test_rss_collector.py to verify that URL collection works correctly when the corpus domain is different from the feed domain. These tests check that the resulting document URLs use the main_url from the corpus. [1] [2]

…ctor with different domains

Copilot

Pull request overview

This PR changes RSS/Atom URL collection so the “accepted domain” is derived from Corpus.main_url rather than from the feed URL, and adds tests to cover cases where the corpus domain differs from the feed domain.

Changes:

Update RssURLCollector.collect() and AtomURLCollector.collect() to compute domain from corpus.main_url.
Update existing RSS/Atom collector tests to build Corpus with main_url.
Add new tests ensuring collected document URLs follow the corpus domain even when the feed URL domain differs.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
welearn_datastack/collectors/rss_collector.py	Switches domain derivation to `corpus.main_url` for RSS link filtering.
welearn_datastack/collectors/atom_collector.py	Switches domain derivation to `corpus.main_url` for Atom link filtering.
tests/url_collector/test_rss_collector.py	Extends setup to include `main_url` and adds a “different domain” test.
tests/url_collector/test_atom_collector.py	Extends setup to include `main_url` and adds a “different domain” test.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

welearn_datastack/collectors/rss_collector.py

welearn_datastack/collectors/atom_collector.py

tests/url_collector/test_rss_collector.py

tests/url_collector/test_atom_collector.py

tests/url_collector/test_rss_collector.py

tests/url_collector/test_atom_collector.py

lpi-tn added 2 commits March 3, 2026 15:32

fix(collectors): update domain extraction to use main_url from corpus

8a73df3

test(atom_collectors): add tests for AtomURLCollector and RSSURLColle…

3a96d43

…ctor with different domains

lpi-tn requested review from Copilot, jmsevin and sandragjacinto March 3, 2026 14:53

Copilot started reviewing on behalf of lpi-tn March 3, 2026 14:54 View session

sandragjacinto approved these changes Mar 3, 2026

View reviewed changes

Copilot AI reviewed Mar 3, 2026

View reviewed changes

lpi-tn merged commit ed5e081 into main Mar 3, 2026
10 of 11 checks passed

lpi-tn deleted the Fix/rss-atom-domain branch March 3, 2026 15:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/rss atom domain#113

Fix/rss atom domain#113
lpi-tn merged 2 commits intomainfrom
Fix/rss-atom-domain

lpi-tn commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

lpi-tn commented Mar 3, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants