CLDR-19466 Generate Solid Blueprint for Decimal and Compact Decimal Format Testing Based on ICU4J#5709
CLDR-19466 Generate Solid Blueprint for Decimal and Compact Decimal Format Testing Based on ICU4J#5709younies wants to merge 1 commit into
Conversation
08b1a58 to
3cf434d
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
3cf434d to
bc739b3
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
bc739b3 to
eb1f344
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
eb1f344 to
4c6dc7c
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
4c6dc7c to
bc95f39
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
…ormat Testing Based on ICU4J See unicode-org#5709
bc95f39 to
6a534e5
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
sffc
left a comment
There was a problem hiding this comment.
This is still not including a thoroughly explored dimension for number formatting options.
…ormat Testing Based on ICU4J See unicode-org#5709
ac9f9d8 to
a52fcc5
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
| // Write to JSON | ||
| writeJson(allLocalesCases, "all_locales.json"); | ||
| writeJson(allNumbersCases, "all_numbers.json"); |
There was a problem hiding this comment.
Issue: please include all 5 files (the blue cube, the three red rectangular prisms, and 2% of the white cube)
There was a problem hiding this comment.
as we have discussed online, that the other dimension are too small to have a core subset, and I am postponing the 2% for now until completing all the test cases
There was a problem hiding this comment.
Sorry, you are not understanding my comment. I am expecting this PR to contain 3 nonoverlapping TSV files.
Suggestion: call them dimensionless.tsv, dimensionless_all_locales.tsv, and dimensionless_all_numbers.tsv. Then when we add currency and unit, we can create files specific to those features.
There was a problem hiding this comment.
do you mean by dimensionleass.tsv the core data ?
This means, we will have three files:
core.tsvextended_all_locales.tsvorexteneded_locales.tsvextended_all_numbersorextended_numbers.tsv
right ?
There was a problem hiding this comment.
I suggested dimensionless because then when we add currency and unit, we add them as different files. This file contains all the unitless / dimensionless data. (except for percent, which is dimensionless but not unitless)
There was a problem hiding this comment.
numbers is still a dimension,
we can say in the future, currency_core.tsv, currency_extended.tsv and so on
There was a problem hiding this comment.
It seems nice to have
| Core | All Locales | All Numbers |
|---|---|---|
dimensionless.tsv |
dimensionless_all_locales.tsv |
dimensionless_all_numbers.tsv |
currency.tsv |
currency_all_locales.tsv |
currency_all_numbers.tsv |
units.tsv |
units_all_locales.tsv |
units_all_numbers.tsv |
There was a problem hiding this comment.
if you insist on the name dimensionless_core.tsv instead of just core.tsv, I will change the names
There was a problem hiding this comment.
My main worry is that you don't have a good scheme for how we extend the file names when we support the other feature sets like currency and units.
| // Write Core to TSV (for showing) | ||
| writeTsv(coreCases, "all_locales.tsv"); | ||
| writeTsv(coreCases, "all_numbers.tsv"); |
There was a problem hiding this comment.
Issue: I don't like duplicating the testdata in both JSON and TSV. I would rather just come up with a good syntax that is both readable and expressive.
There was a problem hiding this comment.
There is a small amount of duplication between the files. The .json file contains the full dataset, while the .tsv file is a subset used to display a sample of the output in a table.
There was a problem hiding this comment.
I see. I think I still don't like having both formats. I think we are pretty close to having a good TSV format so I think we should stick with that. It is also less data size on disk.
There was a problem hiding this comment.
I agree; the tsv format is simpler to read and more compact also.
There was a problem hiding this comment.
okay, I have removed the .json files and kept all the generated tests in the .tsv files
There was a problem hiding this comment.
If the TSV makes your browser crash, then we could split it to more files.
There was a problem hiding this comment.
It looks like the TSV is too big for the preview to work, either. To remediate this:
- The core set of locales is too big. I think it should be 4-8 diverse locales.
- The core set of numbers is too big. I think it should be 3-6 diverse numbers.
- If the TSV is still too big to view, then split it into multiple partitions, such as
locales_aa_ezfor locales from aa to ez, etc.
There was a problem hiding this comment.
I think the core set is very reasonable because it cover variety of numbers and locales that you can review, now you can open core.tsv and also in view mode, you can see the full table
There was a problem hiding this comment.
The TSV files are still too big to view in preview mode.
There was a problem hiding this comment.
^ please verify that the TSVs render as tables in preview mode. If they don't then they are too big and we need to make them smaller.
a52fcc5 to
4455541
Compare
|
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
| return label; | ||
| } | ||
|
|
||
| public static NumberFormat fromLabel(String label) { |
There was a problem hiding this comment.
Why are you doing this? All enums have .valueOf(String source)
There was a problem hiding this comment.
because we will need to use .valueOf(source. toUpperCase()) and also handle empty,
but it is okay, I have removed it.
| com.ibm.icu.number.LocalizedNumberFormatter lnf; | ||
| switch (format) { | ||
| case DECIMAL: | ||
| if (length == Dimensions.FormatLength.SHORT) { |
There was a problem hiding this comment.
Use embedded switch instead of if statements
| @@ -0,0 +1,1201 @@ | |||
| locale number_format format_length input expected | |||
There was a problem hiding this comment.
Hmmm. If all lines have the same number of tabs, github should display this as a table.
There was a problem hiding this comment.
It displays as a table in preview mode, which appears to not be available currently in the PR view.
| * Locales used for focused tests: major languages, several scripts, RTL locales, and | ||
| * regional shapes. | ||
| */ | ||
| private static final ImmutableSet<String> CORE_LOCALES = |
There was a problem hiding this comment.
Sorry but this is simply not a good set of locales for decimal formatting.
For example, you don't include any that use a non-latin numbering system by default.
I think a list should contain at least:
- bn-BG because it uses non-latin by default
- de-CH because it has interesting separators
- Some Arabic locale, maybe ar-EG, because it uses RTL control characters
- pr-PT because it does interesting things with currencies (not necessarily needed for decimal)
There was a problem hiding this comment.
de_CH and ar_EG are already there
I have added bn and pr-PT
There was a problem hiding this comment.
Please add comments justifying why each locale is included in the core set.
There was a problem hiding this comment.
^ not done yet. If you cannot justify a locale, then please delete it from the core set
| * dimensions. For instance, decimal formatting does not need a currency dimension; currency | ||
| * formatting does. | ||
| */ | ||
| public static final class Dimensions { |
There was a problem hiding this comment.
Nit: Don't declare the inner classes public unless you actually need them across files.
There was a problem hiding this comment.
Dimensions: This class must remain public. It is referenced in TestDecimalFormat.java (which is in a different package, org.unicode.cldr.unittest).
The test uses Dimensions.NumberFormat and Dimensions.FormatLength to parse the TSV test data and call the formatting method.
There was a problem hiding this comment.
ok. I question whether that test class is necessary. You are generating the test data, so obviously the test data is tested.
There was a problem hiding this comment.
Yes and it is just extra layer that the test data in the right order , in the right place, for example, if there is a slight shuffle happen in any column, it should be caught here .
sffc
left a comment
There was a problem hiding this comment.
GitHub isn't showing all of the comments
| // Fifth dimension: numbers | ||
| /** Core subset of test numbers for focused, high-signal testing. */ | ||
| public static final ImmutableSet<Double> CORE_NUMBERS = | ||
| ImmutableSet.of(0.0, 1.2, 1234567.0, -1234.56); |
There was a problem hiding this comment.
Please include at least one number that is only a fraction, like 0.00831765
There was a problem hiding this comment.
Also, please make the last digit of the other two numbers be 5, since that is more interesting for rounding. For example, 1234565.0 and -1230.05 (interior zeros are interesting)
|
For the bigger test files, we just have 1 file per locale. That makes it easier to manage, and you should do that here |
@macchiati done |
sffc
left a comment
There was a problem hiding this comment.
I went over this on a call with @younies.
I want the testdata to be streamlined enough that we don't need to fall back into splitting by locale. What we do in datetime is to include only the modern locales; I think we can do that here, too. It will make a single TSV file with ~5000 lines, which is well within the limits of good performance on GitHub.
|
Now, as we are using just the modern locales and the files size are manageable, the extended test data is going to be in one file. |
…ormat Testing Based on ICU4J See unicode-org#5709
919b58d to
2894672
Compare
|
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |

CLDR-19466
Summary
Implementint a solid Blueprint data generation and verification tool for decimal and compact decimal formats in CLDR, based on ICU4J.
Key Changes
Next Steps
ALLOW_MANY_COMMITS=true