Move datasets to delete first in line by jbrown-xentity · Pull Request #14 · GSA/ckanext-spatial

jbrown-xentity · 2021-10-22T19:54:07Z

We have reports of datasets that get re-harvested with an extra 1 in the URL. We have confirmed these reports.
It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.

We have reports of datasets that get re-harvested with an extra `1` in the URL. We have confirmed these reports. It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances. This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.

codecov-commenter · 2021-10-22T19:58:46Z

⚠️ Please install the to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

❌ Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 42.16%. Comparing base (9354d52) to head (77a8b0f).
⚠️ Report is 7 commits behind head on datagov.

Files with missing lines	Patch %	Lines
ckanext/spatial/harvesters/waf.py	0.00%	5 Missing ⚠️
❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@           Coverage Diff            @@
##           datagov      #14   +/-   ##
========================================
  Coverage    42.16%   42.16%           
========================================
  Files           46       46           
  Lines         3166     3166           
========================================
  Hits          1335     1335           
  Misses        1831     1831

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

nickumia-reisys · 2021-10-25T14:37:03Z

Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset?

jbrown-xentity · 2021-10-25T16:55:07Z

Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset?

The issue is with the data provided itself, especially CSDGM/FGDC or ISO without a unique identifier. These WAF harvest sources can't assume the unique identifier is there because it's not required, so it uses a combination of things to test if it's "seen" this harvest object before. If the URL of the source/WAF changes, or if the title changes, then the harvester can't track it and assumes it's a "new" dataset, and doesn't find the "old" dataset and removes it. The problem is the order; since it currently removes data last then the "new" dataset has a name collision with the old one and we get a URL change for downstream users of data.gov. If we fully remove the dataset before adding the new one, the downtime is minimal and the URL should stay the same.
Upstream removed all harvest tests, and replicating those would be complex (to say the least). We'll validate that the harvesters still work in catalog.data.gov, but replicating this issue as a test would require writing custom timing code to edit the nginx harvest endpoint after first harvesting it, and then validating: a lot of work for a small issue.

nickumia-reisys · 2022-10-24T12:13:43Z

@FuhuXia @jbrown-xentity Did this fix the name changing issue?

jbrown-xentity · 2022-10-24T15:15:35Z

Unknown. We could do an analysis to see how many of these exist out there, using the api (checking if last character is a 1, and scanning for how many entries, when they were created, etc). Or we could test this manually by harvesting CSDGM locally, changing the file name, and then re-harvesting to see what happens.

jbrown-xentity · 2022-10-24T15:17:12Z

See upstream PR and comments here: ckan#261

jbrown-xentity requested a review from a team October 22, 2021 19:54

jbrown-xentity merged commit 3828c6e into datagov Oct 25, 2021

jbrown-xentity deleted the bugfix/dataset-renaming branch October 25, 2021 16:55

nickumia-reisys mentioned this pull request Oct 24, 2022

Reintegrate ckanext-spatial upstream into catalog.data.gov GSA/data.gov#3938

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move datasets to delete first in line#14

Move datasets to delete first in line#14
jbrown-xentity merged 1 commit into
datagovfrom
bugfix/dataset-renaming

jbrown-xentity commented Oct 22, 2021

Uh oh!

codecov-commenter commented Oct 22, 2021 •

edited

Loading

Uh oh!

nickumia-reisys commented Oct 25, 2021

Uh oh!

jbrown-xentity commented Oct 25, 2021

Uh oh!

nickumia-reisys commented Oct 24, 2022

Uh oh!

jbrown-xentity commented Oct 24, 2022

Uh oh!

jbrown-xentity commented Oct 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jbrown-xentity commented Oct 22, 2021

Uh oh!

codecov-commenter commented Oct 22, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

nickumia-reisys commented Oct 25, 2021

Uh oh!

jbrown-xentity commented Oct 25, 2021

Uh oh!

nickumia-reisys commented Oct 24, 2022

Uh oh!

jbrown-xentity commented Oct 24, 2022

Uh oh!

jbrown-xentity commented Oct 24, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Oct 22, 2021 •

edited

Loading