Fixing #894 and correcting tests#901
Fixing #894 and correcting tests#901gavishpoddar wants to merge 28 commits intoscrapinghub:masterfrom gavishpoddar:master
Conversation
It seems removing this line resolves the issue #894 : search_dates() wrong result for 12:xx am
Codecov Report
@@ Coverage Diff @@
## master #901 +/- ##
=======================================
Coverage 98.26% 98.26%
=======================================
Files 231 231
Lines 2597 2599 +2
=======================================
+ Hits 2552 2554 +2
Misses 45 45
Continue to review full report at Codecov.
|
noviluni
left a comment
There was a problem hiding this comment.
I added some doubts I have. I'm not also sure if we can just modify the parse_item and the parse_found_objects signature directly...
|
Put together The German language date parsing needs some correction but this PR fixes |
Co-authored-by: Marc Hernández <noviluni@gmail.com>
|
Hi, @noviluni I have made the changes, and thanks for the review |
dateparser/search/search.py
Outdated
| if language == "de": | ||
| item = item.replace('am', '') |
There was a problem hiding this comment.
| if language == "de": | |
| item = item.replace('am', '') | |
| if language == "de": | |
| # Replace "am" because "am 8" means "on the 8th" but "8 am" actually | |
| # means "8 am" | |
| item = item.replace('am', '') |
There was a problem hiding this comment.
Hmmmm... Maybe we can use regex (re.sub) to remove it only when preceding a number? 🤔
There was a problem hiding this comment.
Hi @noviluni, I have made some can you please check and recommend
| ('June 5 am utc', datetime.datetime(2023, 6, datetime.datetime.utcnow().day, 5, 0, tzinfo=pytz.utc)), | ||
| ('June 23th 5 pm EST', datetime.datetime(2023, 6, 23, 17, 0, tzinfo=pytz.timezone("EST"))), | ||
| ('May 31', datetime.datetime(2023, 5, 31, 0, 0)), | ||
| ('8am UTC', datetime.datetime(2023, 8, 31, 0, 0, tzinfo=pytz.utc))]), | ||
| ('8am UTC', datetime.datetime(2023, 5, 31, 8, 0, tzinfo=pytz.utc))]), |
There was a problem hiding this comment.
I think the parsing of 8am UTC is a change for the better, but I’m not sure about June 5 am utc. The previous interpretation seems OK to me, “June 5th in the morning UTC”. In fact, I think it may even be better, because I suspect leaving a unit missing in the middle (i.e. having month and hour, but missing day) may be less likely than the other interpretation (month and day, am meaning sometime in the morning).
I checked and that test string was added as part of a performance improvement, #881, and I imagine it does not come from an actual case found in the internet, which makes it hard to argue either way.
There was a problem hiding this comment.
There was a problem hiding this comment.
Well, I guess you are right that fixing it is out of the scope of this change and should be a separate issue. It was just bad luck that your change broke it, or rather it was just dumb luck that it was working in the first place.
However, what I would do here is move that scenario into a separate test, with the previous value, and mark the test as an expected failure (xfail), to make it clear that the expected outcome is still the old one, only that Dateparser does not support it yet.
There was a problem hiding this comment.
Great after this PR is merged I will create an issue and create a supporting PR
There was a problem hiding this comment.
Don’t forget the xfail part, though, I do think that needs to be addressed in this pull request, instead of changing the test expectations.
There was a problem hiding this comment.
I am working on it now and by the time you create separate test, with the previous value, and mark the test as an expected failure (xfail) it should be done
There was a problem hiding this comment.
by the time you create separate test
I was hoping for you to do that, as it needs to be done as part of this pull request (which is the one that causes the test to fail).
There was a problem hiding this comment.
It seems it is broken throughout search_dates and parse function as with other cases like
June 5th pm EST ---> [('June 5th pm EST', datetime.datetime(2021, 6, 24, 17, 0, tzinfo=<StaticTzInfo 'EST'>))]
or
June 5th pm EST ---> ('June 9 pm', datetime.datetime(2021, 6, 24, 21, 0))]
Which is incorrect. I am working on the patch but it will require some time and significant change in other parts of the library and many tests will be broken because they are also incorrect. Or the current interpretation is correct.
|
Thanks, for your support and suggestions these improvements are now made on PR #945. |
Hi, @Gallaecio ,
I have fixed #894 and submitted PR #900 but it failed because the tests didn't supported
amso tests failed.In this PR the tests and #894 is fixed.