Skip to content

Move datasets to delete first in line#14

Merged
jbrown-xentity merged 1 commit into
datagovfrom
bugfix/dataset-renaming
Oct 25, 2021
Merged

Move datasets to delete first in line#14
jbrown-xentity merged 1 commit into
datagovfrom
bugfix/dataset-renaming

Conversation

@jbrown-xentity
Copy link
Copy Markdown

We have reports of datasets that get re-harvested with an extra 1 in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.

We have reports of datasets that get re-harvested with an extra `1` in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.
@jbrown-xentity jbrown-xentity requested a review from a team October 22, 2021 19:54
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 22, 2021

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.16%. Comparing base (9354d52) to head (77a8b0f).
⚠️ Report is 7 commits behind head on datagov.

Files with missing lines Patch % Lines
ckanext/spatial/harvesters/waf.py 0.00% 5 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.
Additional details and impacted files
@@           Coverage Diff            @@
##           datagov      #14   +/-   ##
========================================
  Coverage    42.16%   42.16%           
========================================
  Files           46       46           
  Lines         3166     3166           
========================================
  Hits          1335     1335           
  Misses        1831     1831           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@nickumia-reisys
Copy link
Copy Markdown

Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset?

@jbrown-xentity
Copy link
Copy Markdown
Author

Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset?

The issue is with the data provided itself, especially CSDGM/FGDC or ISO without a unique identifier. These WAF harvest sources can't assume the unique identifier is there because it's not required, so it uses a combination of things to test if it's "seen" this harvest object before. If the URL of the source/WAF changes, or if the title changes, then the harvester can't track it and assumes it's a "new" dataset, and doesn't find the "old" dataset and removes it. The problem is the order; since it currently removes data last then the "new" dataset has a name collision with the old one and we get a URL change for downstream users of data.gov. If we fully remove the dataset before adding the new one, the downtime is minimal and the URL should stay the same.
Upstream removed all harvest tests, and replicating those would be complex (to say the least). We'll validate that the harvesters still work in catalog.data.gov, but replicating this issue as a test would require writing custom timing code to edit the nginx harvest endpoint after first harvesting it, and then validating: a lot of work for a small issue.

@jbrown-xentity jbrown-xentity merged commit 3828c6e into datagov Oct 25, 2021
@jbrown-xentity jbrown-xentity deleted the bugfix/dataset-renaming branch October 25, 2021 16:55
@nickumia-reisys
Copy link
Copy Markdown

@FuhuXia @jbrown-xentity Did this fix the name changing issue?

@jbrown-xentity
Copy link
Copy Markdown
Author

Unknown. We could do an analysis to see how many of these exist out there, using the api (checking if last character is a 1, and scanning for how many entries, when they were created, etc). Or we could test this manually by harvesting CSDGM locally, changing the file name, and then re-harvesting to see what happens.

@jbrown-xentity
Copy link
Copy Markdown
Author

See upstream PR and comments here: ckan#261

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants