Skip to content

CLDR-18980 Tweak constructed interval formats#5773

Open
macchiati wants to merge 13 commits into
mainfrom
CLDR-18980-Tweak-constructed-interval-formats
Open

CLDR-18980 Tweak constructed interval formats#5773
macchiati wants to merge 13 commits into
mainfrom
CLDR-18980-Tweak-constructed-interval-formats

Conversation

@macchiati
Copy link
Copy Markdown
Member

@macchiati macchiati commented May 27, 2026

CLDR-18980

See #5775 for the Info Hub changes (easier to do that in Github Web).

The main changes are:

tools/cldr-code/src/main/java/org/unicode/cldr/test/CheckDates.java

  • significant fixes to checkNumericSeparators, checkForNumericSeparator, mostly improvements to messages but also cleanup
  • Now "// Check against constructed interval"

tools/cldr-code/src/main/java/org/unicode/cldr/test/ConsoleCheckCLDR.java

  • just added an environment variable to allow getting rid of the file location messages.

tools/cldr-code/src/main/java/org/unicode/cldr/util/CldrIntervalFormat.java

  • major restructuring to catch edge cases, and make the code clearer.
  • now splits the separator to handle better literals adjacent to separators

tools/cldr-code/src/main/java/org/unicode/cldr/util/DatetimeUtilities.java

  • just moved some stuff to CldrIntervalFormat

tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestDateOrder.java

  • added some tests
  • cleanup

Major refactoring. Also fixed the check for datetime numeric separators, and a bunch of the error messages.

Note

The constructed intervals can't always be right, because we can't necessarily determine whether …M#y… (where # is some punctuation or letter) is a separator or a terminator. If a separator, we want to remove it if it ends up next to the interval separator (eg '–'); if it is an adjunct like 月, we want to keep it.

However, the comparison is very useful, because it points up many gratuitous inconsistencies between the interval formats and available formats. Here is a spreadsheet of differences, with a Key.

Comparison

https://docs.google.com/spreadsheets/d/12eGl72fX4cVgOgIKk2FUPWWUtBX31WG4yLFa8juxIGs/edit

There are warning samples in the tab warning samples.

https://docs.google.com/spreadsheets/d/12eGl72fX4cVgOgIKk2FUPWWUtBX31WG4yLFa8juxIGs/edit?gid=789063264#gid=789063264

  • This PR completes the ticket.

ALLOW_MANY_COMMITS=true

@macchiati
Copy link
Copy Markdown
Member Author

Making progress, but have to iron out the tests.

@macchiati
Copy link
Copy Markdown
Member Author

@srl295
cc: @btangmu

Steven, could you look at this? I don't understand the build failures here.

When I run Maven test locally, it builds and all the tests complete with no errors, except for one that is completely independent of my changes, a problem with mul.

CLDR-18296 <https://unicode-org.atlassian.net/browse/CLDR-18296>
  - CLDR/TestSubdivisions/TestSubdivisionLocales (common/subdivisions has subdivision not in common/main)
 (Use -allKnownIssues to show all known issue sites) 

<< ALL 517 TESTS PASSED >>
...
  XmlDataSourceTest {
    TestBadElementNesting (0.010s) Passed
    TestDoubleQuotesDisallowedAsAttributeValue (0.001s) Passed
    TestNoDtdVersionPath (0.001s) Passed
    TestSimple (0.003s) Passed
  } (0.015s) Passed
} (0.952s) Passed

<< ALL 34 TESTS PASSED >>
...
  TestXPathTable {
    TestNonDistinguishing (0.000s) Passed
    TestPutGet (0.000s) Passed
    TestRemoveDraftAltProposed (0.003s) Passed
  } (0.003s) Passed
} (0.118s) Passed

<< ALL 34 TESTS PASSED >>
...
[INFO] �[1;32mTests run: �[0;1;32m3�[m, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.040 s -- in �[1morg.unicode.cldr.web.TestKeepLoggedInManager�[m
[INFO] Running org.unicode.cldr.web.�[1mTestClaGithubList�[m
May 28, 2026 11:12:59 AM org.unicode.cldr.web.ClaGithubList readSigners
INFO: Read 3 signatures
[INFO] �[1;32mTests run: �[0;1;32m3�[m, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.023 s -- in �[1morg.unicode.cldr.web.TestClaGithubList�[m...

But then

[INFO] Running org.unicode.cldr.web.�[1mTestSearchManager�[m
SurveyThread: in UNITTEST, spinning up a new DefaultThreadFactory
May 28, 2026 11:12:58 AM org.unicode.cldr.web.SearchManager getSearch
WARNING: Search uizk6r_bmtheQdAxyy5Bvb1a8P8 had exception Unable to determine the source directory for locale mul
java.util.concurrent.ExecutionException: org.unicode.cldr.util.SimpleFactory$NoSourceDirectoryException: Unable to determine the source directory for locale mul
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
...
Caused by: org.unicode.cldr.util.SimpleFactory$NoSourceDirectoryException: Unable to determine the source directory for locale mul
	at org.unicode.cldr.util.SimpleFactory.handleMake(SimpleFactory.java:561)

@srl295
Copy link
Copy Markdown
Member

srl295 commented May 28, 2026

The failure i see is missing "flen" in

Error: /home/runner/work/cldr/cldr/tools/cldr-code/src/main/java/org/unicode/cldr/util/CldrIntervalFormat.java:[266,52] cannot find symbol

Mul unrelated

@macchiati
Copy link
Copy Markdown
Member Author

Thanks that is probably it; fixed.

Very strange that Maven test didn't complain on my local system.

srl295
srl295 previously approved these changes May 30, 2026
Copy link
Copy Markdown
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving but comments

  • we've been keeping URLs in CLDRURLS.java. Should cleanup and move the new items there.
  • why did you need to turn off locations in check, and why via a environment variable and not an option?

srl295
srl295 previously approved these changes May 30, 2026
@macchiati
Copy link
Copy Markdown
Member Author

Approving but comments

  • we've been keeping URLs in CLDRURLS.java. Should cleanup and move the new items there.

There is definitely more cleanup of the code that would be good. I tried to confine myself to what was directly involved. But if I hit this again, I can fix those if you point them out.

  • why did you need to turn off locations in check, and why via a environment variable and not an option?

It is just too painful to wade through the redundant red lines. I didn't want to turn them off by default, and wanted to talk a bit about the best way to do it.

@srl295
Copy link
Copy Markdown
Member

srl295 commented May 30, 2026

Approving but comments

  • we've been keeping URLs in CLDRURLS.java. Should cleanup and move the new items there.

There is definitely more cleanup of the code that would be good. I tried to confine myself to what was directly involved. But if I hit this again, I can fix those if you point them out.

Thinking about maintanability and finding those URLs in the same place.

  • why did you need to turn off locations in check, and why via a environment variable and not an option?

It is just too painful to wade through the redundant red lines. I didn't want to turn them off by default, and wanted to talk a bit about the best way to do it.

Makes sense. Maybe an option to aggregate by type would be better, i.e. only show each subtype once and then give a count.

@macchiati
Copy link
Copy Markdown
Member Author

Good ideas for the future.

@macchiati macchiati marked this pull request as ready for review May 31, 2026 04:08
@macchiati
Copy link
Copy Markdown
Member Author

I think this is read to merge (may need to tweak for tests).

@AEApple
Copy link
Copy Markdown
Contributor

AEApple commented Jun 1, 2026

Happy to finish reviewing the docs update for #5775 after this PR is merged and available to check on Smoke test and/or on Staging so that I can compare the docs against the final version.

@macchiati
Copy link
Copy Markdown
Member Author

I need a review on this today

Copy link
Copy Markdown
Member

@srl295 srl295 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for now. Lots of concern about freezing "what locales do now" into a test, that seems ripe for confusion later.

public static final UnicodeSet SEPARATOR_SPACINGS =
new UnicodeSet("[\\u0020\\u00A0\\u202F\\u2009]").freeze();

public static final UnicodeSet SEPARATORS_WITH_SPACES =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this used?

Comment on lines +163 to +164
static final Pattern normal =
Pattern.compile(SEPARATOR_SPACINGS + "?" + SEPARATORS + SEPARATOR_SPACINGS + "?");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should have more explanation

return firstPattern + separatorString + secondPattern;
}

/** These may need to be adjusted if new locales do something radically different. */
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just what I was going to say. It's not great having yet more static lists buried in tests.

@srl295
Copy link
Copy Markdown
Member

srl295 commented Jun 1, 2026

I need a review on this today

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants