Skip to content

CLDR-19466 Generate Solid Blueprint for Decimal and Compact Decimal Format Testing Based on ICU4J#5709

Open
younies wants to merge 1 commit into
unicode-org:mainfrom
younies:decimalformatter-ai
Open

CLDR-19466 Generate Solid Blueprint for Decimal and Compact Decimal Format Testing Based on ICU4J#5709
younies wants to merge 1 commit into
unicode-org:mainfrom
younies:decimalformatter-ai

Conversation

@younies
Copy link
Copy Markdown
Member

@younies younies commented May 11, 2026

CLDR-19466

  • This PR completes the ticket.

Summary

Implementint a solid Blueprint data generation and verification tool for decimal and compact decimal formats in CLDR, based on ICU4J.

Key Changes

  • GenerateDecimalFormatTestData.java: Added a new tool to generate expected formatted numbers across various locales and dimensions (input decimal, compactness ... etc.).

Next Steps

  • Add the needed dimensions to test both decimal and compact decimal numbers.

ALLOW_MANY_COMMITS=true

@younies younies requested a review from sffc May 11, 2026 09:30
Comment thread common/testData/decimal/all_locales.tsv Outdated
@younies younies force-pushed the decimalformatter-ai branch from 08b1a58 to 3cf434d Compare May 11, 2026 09:46
@jira-pull-request-webhook
Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateDecimalFormatTestData.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestDecimalFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/cldr that referenced this pull request May 11, 2026
@younies younies force-pushed the decimalformatter-ai branch from 3cf434d to bc739b3 Compare May 11, 2026 09:48
@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/cldr that referenced this pull request May 11, 2026
@younies younies force-pushed the decimalformatter-ai branch from bc739b3 to eb1f344 Compare May 11, 2026 10:06
@jira-pull-request-webhook
Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • tools/pom.xml is no longer changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

younies added a commit to younies/cldr that referenced this pull request May 11, 2026
@younies younies force-pushed the decimalformatter-ai branch from eb1f344 to 4c6dc7c Compare May 11, 2026 12:10
@jira-pull-request-webhook
Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • common/testData/decimal/decimal_all_locales.tsv is different
  • common/testData/decimal/decimal_all_numbers.tsv is different
  • common/testData/decimal/decimal_random_5percent.tsv is different
  • common/testData/decimal/decimal.tsv is different
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateDecimalFormatTestData.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestDecimalFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@younies younies changed the title CLDR-19466 Generate and verify test data for decimal and compact decimal formats CLDR-19466 Generate and verify test data for decimal and compact decimal formats based on ICU4J May 11, 2026
younies added a commit to younies/cldr that referenced this pull request May 11, 2026
@younies younies force-pushed the decimalformatter-ai branch from 4c6dc7c to bc95f39 Compare May 11, 2026 12:25
@jira-pull-request-webhook
Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateDecimalFormatTestData.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestDecimalFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@younies younies requested a review from sffc May 11, 2026 15:15
@younies younies changed the title CLDR-19466 Generate and verify test data for decimal and compact decimal formats based on ICU4J CLDR-19466 Generate Solid Blueprint for Decimal and Compact Decimal Format Testing Based on ICU4J May 13, 2026
younies added a commit to younies/cldr that referenced this pull request May 13, 2026
@younies younies force-pushed the decimalformatter-ai branch from bc95f39 to 6a534e5 Compare May 13, 2026 08:36
@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

Comment thread common/testData/decimal/all_locales.tsv Outdated
Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is still not including a thoroughly explored dimension for number formatting options.

younies added a commit to younies/cldr that referenced this pull request May 22, 2026
@younies younies force-pushed the decimalformatter-ai branch from ac9f9d8 to a52fcc5 Compare May 22, 2026 10:15
@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@younies younies requested a review from sffc May 22, 2026 10:15
Comment on lines +459 to +461
// Write to JSON
writeJson(allLocalesCases, "all_locales.json");
writeJson(allNumbersCases, "all_numbers.json");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: please include all 5 files (the blue cube, the three red rectangular prisms, and 2% of the white cube)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(this is not done yet)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as we have discussed online, that the other dimension are too small to have a core subset, and I am postponing the 2% for now until completing all the test cases

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, you are not understanding my comment. I am expecting this PR to contain 3 nonoverlapping TSV files.

Suggestion: call them dimensionless.tsv, dimensionless_all_locales.tsv, and dimensionless_all_numbers.tsv. Then when we add currency and unit, we can create files specific to those features.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you mean by dimensionleass.tsv the core data ?

This means, we will have three files:

  1. core.tsv
  2. extended_all_locales.tsv or exteneded_locales.tsv
  3. extended_all_numbers or extended_numbers.tsv

right ?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggested dimensionless because then when we add currency and unit, we add them as different files. This file contains all the unitless / dimensionless data. (except for percent, which is dimensionless but not unitless)

Copy link
Copy Markdown
Member Author

@younies younies May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

numbers is still a dimension,

we can say in the future, currency_core.tsv, currency_extended.tsv and so on

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems nice to have

Core All Locales All Numbers
dimensionless.tsv dimensionless_all_locales.tsv dimensionless_all_numbers.tsv
currency.tsv currency_all_locales.tsv currency_all_numbers.tsv
units.tsv units_all_locales.tsv units_all_numbers.tsv

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you insist on the name dimensionless_core.tsv instead of just core.tsv, I will change the names

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main worry is that you don't have a good scheme for how we extend the file names when we support the other feature sets like currency and units.

Comment on lines +442 to +444
// Write Core to TSV (for showing)
writeTsv(coreCases, "all_locales.tsv");
writeTsv(coreCases, "all_numbers.tsv");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue: I don't like duplicating the testdata in both JSON and TSV. I would rather just come up with a good syntax that is both readable and expressive.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a small amount of duplication between the files. The .json file contains the full dataset, while the .tsv file is a subset used to display a sample of the output in a table.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. I think I still don't like having both formats. I think we are pretty close to having a good TSV format so I think we should stick with that. It is also less data size on disk.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree; the tsv format is simpler to read and more compact also.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay, I have removed the .json files and kept all the generated tests in the .tsv files

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the TSV makes your browser crash, then we could split it to more files.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like the TSV is too big for the preview to work, either. To remediate this:

  1. The core set of locales is too big. I think it should be 4-8 diverse locales.
  2. The core set of numbers is too big. I think it should be 3-6 diverse numbers.
  3. If the TSV is still too big to view, then split it into multiple partitions, such as locales_aa_ez for locales from aa to ez, etc.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the core set is very reasonable because it cover variety of numbers and locales that you can review, now you can open core.tsv and also in view mode, you can see the full table

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TSV files are still too big to view in preview mode.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ please verify that the TSVs render as tables in preview mode. If they don't then they are too big and we need to make them smaller.

@younies younies force-pushed the decimalformatter-ai branch from a52fcc5 to 4455541 Compare May 26, 2026 13:54
@jira-pull-request-webhook
Copy link
Copy Markdown

Notice: the branch changed across the force-push!

  • common/testData/decimal/all_locales.json is different
  • common/testData/decimal/all_locales.tsv is different
  • common/testData/decimal/all_numbers.json is different
  • common/testData/decimal/all_numbers.tsv is different
  • tools/cldr-code/src/main/java/org/unicode/cldr/tool/GenerateDecimalFormatTestData.java is different
  • tools/cldr-code/src/test/java/org/unicode/cldr/unittest/TestDecimalFormat.java is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

@younies younies requested a review from sffc May 26, 2026 16:26
return label;
}

public static NumberFormat fromLabel(String label) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you doing this? All enums have .valueOf(String source)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because we will need to use .valueOf(source. toUpperCase()) and also handle empty,

but it is okay, I have removed it.

com.ibm.icu.number.LocalizedNumberFormatter lnf;
switch (format) {
case DECIMAL:
if (length == Dimensions.FormatLength.SHORT) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use embedded switch instead of if statements

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,1201 @@
locale number_format format_length input expected
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm. If all lines have the same number of tabs, github should display this as a table.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It displays as a table in preview mode, which appears to not be available currently in the PR view.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also see it by clicking the three dots and choosing View file.
Screenshot 2026-05-27 at 08 02 06

Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

Copy link
Copy Markdown
Member Author

@younies younies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

@younies younies requested review from macchiati and sffc May 28, 2026 10:46
@younies younies requested a review from sffc May 28, 2026 12:18
* Locales used for focused tests: major languages, several scripts, RTL locales, and
* regional shapes.
*/
private static final ImmutableSet<String> CORE_LOCALES =
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry but this is simply not a good set of locales for decimal formatting.

For example, you don't include any that use a non-latin numbering system by default.

I think a list should contain at least:

  • bn-BG because it uses non-latin by default
  • de-CH because it has interesting separators
  • Some Arabic locale, maybe ar-EG, because it uses RTL control characters
  • pr-PT because it does interesting things with currencies (not necessarily needed for decimal)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

de_CH and ar_EG are already there

I have added bn and pr-PT

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add comments justifying why each locale is included in the core set.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ not done yet. If you cannot justify a locale, then please delete it from the core set

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

* dimensions. For instance, decimal formatting does not need a currency dimension; currency
* formatting does.
*/
public static final class Dimensions {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Don't declare the inner classes public unless you actually need them across files.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dimensions: This class must remain public. It is referenced in TestDecimalFormat.java (which is in a different package, org.unicode.cldr.unittest).

The test uses Dimensions.NumberFormat and Dimensions.FormatLength to parse the TSV test data and call the formatting method.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok. I question whether that test class is necessary. You are generating the test data, so obviously the test data is tested.

Copy link
Copy Markdown
Member Author

@younies younies May 29, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes and it is just extra layer that the test data in the right order , in the right place, for example, if there is a slight shuffle happen in any column, it should be caught here .

@younies younies requested a review from sffc May 28, 2026 13:31
Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GitHub isn't showing all of the comments

// Fifth dimension: numbers
/** Core subset of test numbers for focused, high-signal testing. */
public static final ImmutableSet<Double> CORE_NUMBERS =
ImmutableSet.of(0.0, 1.2, 1234567.0, -1234.56);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please include at least one number that is only a fraction, like 0.00831765

Copy link
Copy Markdown
Member

@sffc sffc May 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, please make the last digit of the other two numbers be 5, since that is more interesting for rounding. For example, 1234565.0 and -1230.05 (interior zeros are interesting)

@macchiati
Copy link
Copy Markdown
Member

For the bigger test files, we just have 1 file per locale. That makes it easier to manage, and you should do that here

@younies younies requested a review from sffc May 28, 2026 15:34
@younies
Copy link
Copy Markdown
Member Author

younies commented May 28, 2026

For the bigger test files, we just have 1 file per locale. That makes it easier to manage, and you should do that here

@macchiati done

Copy link
Copy Markdown
Member

@sffc sffc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went over this on a call with @younies.

I want the testdata to be streamlined enough that we don't need to fall back into splitting by locale. What we do in datetime is to include only the modern locales; I think we can do that here, too. It will make a single TSV file with ~5000 lines, which is well within the limits of good performance on GitHub.

@younies
Copy link
Copy Markdown
Member Author

younies commented May 31, 2026

Now, as we are using just the modern locales and the files size are manageable, the extended test data is going to be in one file.

@younies younies force-pushed the decimalformatter-ai branch from 919b58d to 2894672 Compare May 31, 2026 11:40
@jira-pull-request-webhook
Copy link
Copy Markdown

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@younies younies requested a review from sffc May 31, 2026 11:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants