Move datasets to delete first in line#14
Conversation
We have reports of datasets that get re-harvested with an extra `1` in the URL. We have confirmed these reports. It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances. This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## datagov #14 +/- ##
========================================
Coverage 42.16% 42.16%
========================================
Files 46 46
Lines 3166 3166
========================================
Hits 1335 1335
Misses 1831 1831 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Do we want to write a test to make sure that the dataset url doesn't have a '1' in it after this sequence of operations? Or do we suspect this is a short term fix with upstream fixing the real problem of replacing a dataset? |
The issue is with the data provided itself, especially CSDGM/FGDC or ISO without a unique identifier. These WAF harvest sources can't assume the unique identifier is there because it's not required, so it uses a combination of things to test if it's "seen" this harvest object before. If the URL of the source/WAF changes, or if the title changes, then the harvester can't track it and assumes it's a "new" dataset, and doesn't find the "old" dataset and removes it. The problem is the order; since it currently removes data last then the "new" dataset has a name collision with the old one and we get a URL change for downstream users of data.gov. If we fully remove the dataset before adding the new one, the downtime is minimal and the URL should stay the same. |
|
@FuhuXia @jbrown-xentity Did this fix the name changing issue? |
|
Unknown. We could do an analysis to see how many of these exist out there, using the api (checking if last character is a 1, and scanning for how many entries, when they were created, etc). Or we could test this manually by harvesting CSDGM locally, changing the file name, and then re-harvesting to see what happens. |
|
See upstream PR and comments here: ckan#261 |
We have reports of datasets that get re-harvested with an extra
1in the URL. We have confirmed these reports.It seems the harvest is doing the best it can to diagnose if this is a new dataset or not; but still failing in some circumstances.
This probably won't fix the bug; however it will mitigate it. By hopefully running through the datasets removal first, if the spatial harvester is essentially doing a "delete and add" when it should be replacing, then the name of the new dataset won't collide with the one that is marked for deleted but still in the system.