-
-
Notifications
You must be signed in to change notification settings - Fork 33.8k
gh-74902: Add Unicode Grapheme Cluster Break algorithm #143076
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -184,6 +184,28 @@ following functions: | |||||
| '0041 0303' | ||||||
|
|
||||||
|
|
||||||
| .. function:: grapheme_cluster_break(chr, /) | ||||||
|
|
||||||
| Returns the Grapheme_Cluster_Break property assigned to the character. | ||||||
|
|
||||||
| .. versionadded:: next | ||||||
|
|
||||||
|
|
||||||
| .. function:: indic_conjunct_break(chr, /) | ||||||
|
|
||||||
| Returns the Indic_Conjunct_Break property assigned to the character. | ||||||
|
|
||||||
| .. versionadded:: next | ||||||
|
|
||||||
|
|
||||||
| .. function:: extended_pictographic(chr, /) | ||||||
|
|
||||||
| Returns ``True`` if the character has the Extended_Pictographic property, | ||||||
| ``False`` otherwise. | ||||||
|
|
||||||
| .. versionadded:: next | ||||||
|
|
||||||
|
|
||||||
| .. function:: normalize(form, unistr, /) | ||||||
|
|
||||||
| Return the normal form *form* for the Unicode string *unistr*. Valid values for | ||||||
|
|
@@ -225,6 +247,24 @@ following functions: | |||||
| .. versionadded:: 3.8 | ||||||
|
|
||||||
|
|
||||||
| .. function:: iter_graphemes(unistr, start=0, end=sys.maxsize, /) | ||||||
|
|
||||||
| Returns an iterator to iterate over grapheme clusters. | ||||||
| With optional *start*, iteration beginning at that position. | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| With optional *end*, iteration stops at that position. | ||||||
|
|
||||||
| Converting an emitted item to string returns a substring corresponding to | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
| the grapheme cluster. | ||||||
| Its ``start`` and ``end`` attributes denote the start and the end of | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
It is unnecessary here. |
||||||
| the grapheme cluster. | ||||||
|
|
||||||
| It uses extended grapheme cluster rules defined by Unicode | ||||||
| Standard Annex #29, `"Unicode Text Segmentation" | ||||||
| <https://www.unicode.org/reports/tr29/>`_. | ||||||
|
|
||||||
| .. versionadded:: next | ||||||
|
|
||||||
|
|
||||||
| In addition, the module exposes the following constant: | ||||||
|
|
||||||
| .. data:: unidata_version | ||||||
|
|
@@ -234,7 +274,7 @@ In addition, the module exposes the following constant: | |||||
|
|
||||||
| .. data:: ucd_3_2_0 | ||||||
|
|
||||||
| This is an object that has the same methods as the entire module, but uses the | ||||||
| This is an object that has most of the methods of the entire module, but uses the | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This sentence is not fully right, but I can’t find the right suggestion with both «most of» and «same as». |
||||||
| Unicode database version 3.2 instead, for applications that require this | ||||||
| specific version of the Unicode database (such as IDNA). | ||||||
|
|
||||||
|
|
||||||
| Original file line number | Diff line number | Diff line change | ||||
|---|---|---|---|---|---|---|
|
|
@@ -616,6 +616,216 @@ def test_isxidcontinue(self): | |||||
| self.assertRaises(TypeError, self.db.isxidcontinue) | ||||||
| self.assertRaises(TypeError, self.db.isxidcontinue, 'xx') | ||||||
|
|
||||||
| def test_grapheme_cluster_break(self): | ||||||
| gcb = self.db.grapheme_cluster_break | ||||||
| self.assertEqual(gcb(' '), 'Other') | ||||||
| self.assertEqual(gcb('x'), 'Other') | ||||||
| self.assertEqual(gcb('\U0010FFFF'), 'Other') | ||||||
| self.assertEqual(gcb('\r'), 'CR') | ||||||
| self.assertEqual(gcb('\n'), 'LF') | ||||||
| self.assertEqual(gcb('\0'), 'Control') | ||||||
| self.assertEqual(gcb('\t'), 'Control') | ||||||
| self.assertEqual(gcb('\x1F'), 'Control') | ||||||
| self.assertEqual(gcb('\x7F'), 'Control') | ||||||
| self.assertEqual(gcb('\x9F'), 'Control') | ||||||
| self.assertEqual(gcb('\U000E0001'), 'Control') | ||||||
| self.assertEqual(gcb('\u0300'), 'Extend') | ||||||
| self.assertEqual(gcb('\u200C'), 'Extend') | ||||||
| self.assertEqual(gcb('\U000E01EF'), 'Extend') | ||||||
| self.assertEqual(gcb('\u1159'), 'L') | ||||||
| self.assertEqual(gcb('\u11F9'), 'T') | ||||||
| self.assertEqual(gcb('\uD788'), 'LV') | ||||||
| self.assertEqual(gcb('\uD7A3'), 'LVT') | ||||||
| # New in 5.0.0 | ||||||
| self.assertEqual(gcb('\u05BA'), 'Extend') | ||||||
| self.assertEqual(gcb('\u20EF'), 'Extend') | ||||||
| # New in 5.1.0 | ||||||
| self.assertEqual(gcb('\u2064'), 'Control') | ||||||
| self.assertEqual(gcb('\uAA4D'), 'SpacingMark') | ||||||
| # New in 5.2.0 | ||||||
| self.assertEqual(gcb('\u0816'), 'Extend') | ||||||
| self.assertEqual(gcb('\uA97C'), 'L') | ||||||
| self.assertEqual(gcb('\uD7C6'), 'V') | ||||||
| self.assertEqual(gcb('\uD7FB'), 'T') | ||||||
| # New in 6.0.0 | ||||||
| self.assertEqual(gcb('\u093A'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00011002'), 'SpacingMark') | ||||||
| # New in 6.1.0 | ||||||
| self.assertEqual(gcb('\U000E0FFF'), 'Control') | ||||||
| self.assertEqual(gcb('\U00016F7E'), 'SpacingMark') | ||||||
| # New in 6.2.0 | ||||||
| self.assertEqual(gcb('\U0001F1E6'), 'Regional_Indicator') | ||||||
| self.assertEqual(gcb('\U0001F1FF'), 'Regional_Indicator') | ||||||
| # New in 6.3.0 | ||||||
| self.assertEqual(gcb('\u180E'), 'Control') | ||||||
| self.assertEqual(gcb('\u1A1B'), 'Extend') | ||||||
| # New in 7.0.0 | ||||||
| self.assertEqual(gcb('\u0E33'), 'SpacingMark') | ||||||
| self.assertEqual(gcb('\u0EB3'), 'SpacingMark') | ||||||
| self.assertEqual(gcb('\U0001BCA3'), 'Control') | ||||||
| self.assertEqual(gcb('\U0001E8D6'), 'Extend') | ||||||
| self.assertEqual(gcb('\U0001163E'), 'SpacingMark') | ||||||
| # New in 8.0.0 | ||||||
| self.assertEqual(gcb('\u08E3'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00011726'), 'SpacingMark') | ||||||
| # New in 9.0.0 | ||||||
| self.assertEqual(gcb('\u0600'), 'Prepend') | ||||||
| self.assertEqual(gcb('\U000E007F'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00011CB4'), 'SpacingMark') | ||||||
| self.assertEqual(gcb('\u200D'), 'ZWJ') | ||||||
| # New in 10.0.0 | ||||||
| self.assertEqual(gcb('\U00011D46'), 'Prepend') | ||||||
| self.assertEqual(gcb('\U00011D47'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00011A97'), 'SpacingMark') | ||||||
| # New in 11.0.0 | ||||||
| self.assertEqual(gcb('\U000110CD'), 'Prepend') | ||||||
| self.assertEqual(gcb('\u07FD'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00011EF6'), 'SpacingMark') | ||||||
| # New in 12.0.0 | ||||||
| self.assertEqual(gcb('\U00011A84'), 'Prepend') | ||||||
| self.assertEqual(gcb('\U00013438'), 'Control') | ||||||
| self.assertEqual(gcb('\U0001E2EF'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00016F87'), 'SpacingMark') | ||||||
| # New in 13.0.0 | ||||||
| self.assertEqual(gcb('\U00011941'), 'Prepend') | ||||||
| self.assertEqual(gcb('\U00016FE4'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00011942'), 'SpacingMark') | ||||||
| # New in 14.0.0 | ||||||
| self.assertEqual(gcb('\u0891'), 'Prepend') | ||||||
| self.assertEqual(gcb('\U0001E2AE'), 'Extend') | ||||||
| # New in 15.0.0 | ||||||
| self.assertEqual(gcb('\U00011F02'), 'Prepend') | ||||||
| self.assertEqual(gcb('\U0001343F'), 'Control') | ||||||
| self.assertEqual(gcb('\U0001E4EF'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00011F3F'), 'SpacingMark') | ||||||
| # New in 16.0.0 | ||||||
| self.assertEqual(gcb('\U000113D1'), 'Prepend') | ||||||
| self.assertEqual(gcb('\U0001E5EF'), 'Extend') | ||||||
| self.assertEqual(gcb('\U0001612C'), 'SpacingMark') | ||||||
| self.assertEqual(gcb('\U00016D63'), 'V') | ||||||
| # New in 17.0.0 | ||||||
| self.assertEqual(gcb('\u1AEB'), 'Extend') | ||||||
| self.assertEqual(gcb('\U00011B67'), 'SpacingMark') | ||||||
|
|
||||||
| self.assertRaises(TypeError, gcb) | ||||||
| self.assertRaises(TypeError, gcb, b'x') | ||||||
| self.assertRaises(TypeError, gcb, 120) | ||||||
| self.assertRaises(TypeError, gcb, '') | ||||||
| self.assertRaises(TypeError, gcb, 'xx') | ||||||
|
|
||||||
| def test_indic_conjunct_break(self): | ||||||
| incb = self.db.indic_conjunct_break | ||||||
| self.assertEqual(incb(' '), 'None') | ||||||
| self.assertEqual(incb('x'), 'None') | ||||||
| self.assertEqual(incb('\U0010FFFF'), 'None') | ||||||
| # New in 15.1.0 | ||||||
| self.assertEqual(incb('\u094D'), 'Linker') | ||||||
| self.assertEqual(incb('\u0D4D'), 'Linker') | ||||||
| self.assertEqual(incb('\u0915'), 'Consonant') | ||||||
| self.assertEqual(incb('\u0D3A'), 'Consonant') | ||||||
| self.assertEqual(incb('\u0300'), 'Extend') | ||||||
| self.assertEqual(incb('\U0001E94A'), 'Extend') | ||||||
| # New in 16.0.0 | ||||||
| self.assertEqual(incb('\u034F'), 'Extend') | ||||||
| self.assertEqual(incb('\U000E01EF'), 'Extend') | ||||||
| # New in 17.0.0 | ||||||
| self.assertEqual(incb('\u1039'), 'Linker') | ||||||
| self.assertEqual(incb('\U00011F42'), 'Linker') | ||||||
| self.assertEqual(incb('\u1000'), 'Consonant') | ||||||
| self.assertEqual(incb('\U00011F33'), 'Consonant') | ||||||
| self.assertEqual(incb('\U0001E6F5'), 'Extend') | ||||||
|
|
||||||
| self.assertRaises(TypeError, incb) | ||||||
| self.assertRaises(TypeError, incb, b'x') | ||||||
| self.assertRaises(TypeError, incb, 120) | ||||||
| self.assertRaises(TypeError, incb, '') | ||||||
| self.assertRaises(TypeError, incb, 'xx') | ||||||
|
|
||||||
| def test_extended_pictographic(self): | ||||||
| ext_pict = self.db.extended_pictographic | ||||||
| self.assertIs(ext_pict(' '), False) | ||||||
| self.assertIs(ext_pict('x'), False) | ||||||
| self.assertIs(ext_pict('\U0010FFFF'), False) | ||||||
| # New in 13.0.0 | ||||||
| self.assertIs(ext_pict('\xA9'), True) | ||||||
| self.assertIs(ext_pict('\u203C'), True) | ||||||
| self.assertIs(ext_pict('\U0001FAD6'), True) | ||||||
| self.assertIs(ext_pict('\U0001FFFD'), True) | ||||||
| # New in 17.0.0 | ||||||
| self.assertIs(ext_pict('\u2388'), False) | ||||||
| self.assertIs(ext_pict('\U0001FA6D'), False) | ||||||
|
|
||||||
| self.assertRaises(TypeError, ext_pict) | ||||||
| self.assertRaises(TypeError, ext_pict, b'x') | ||||||
| self.assertRaises(TypeError, ext_pict, 120) | ||||||
| self.assertRaises(TypeError, ext_pict, '') | ||||||
| self.assertRaises(TypeError, ext_pict, 'xx') | ||||||
|
|
||||||
| def test_grapheme_break(self): | ||||||
| def graphemes(*args): | ||||||
| return list(map(str, self.db.iter_graphemes(*args))) | ||||||
|
|
||||||
| self.assertRaises(TypeError, self.db.iter_graphemes) | ||||||
| self.assertRaises(TypeError, self.db.iter_graphemes, b'x') | ||||||
| self.assertRaises(TypeError, self.db.iter_graphemes, 'x', 0, 0, 0) | ||||||
|
|
||||||
| self.assertEqual(graphemes(''), []) | ||||||
| self.assertEqual(graphemes('abcd'), ['a', 'b', 'c', 'd']) | ||||||
| self.assertEqual(graphemes('abcd', 1), ['b', 'c', 'd']) | ||||||
| self.assertEqual(graphemes('abcd', 1, 3), ['b', 'c']) | ||||||
| self.assertEqual(graphemes('abcd', -3), ['b', 'c', 'd']) | ||||||
| self.assertEqual(graphemes('abcd', 1, -1), ['b', 'c']) | ||||||
| self.assertEqual(graphemes('abcd', 3, 1), []) | ||||||
| self.assertEqual(graphemes('abcd', 5), []) | ||||||
| self.assertEqual(graphemes('abcd', 0, 5), ['a', 'b', 'c', 'd']) | ||||||
| self.assertEqual(graphemes('abcd', -5), ['a', 'b', 'c', 'd']) | ||||||
| self.assertEqual(graphemes('abcd', 0, -5), []) | ||||||
| # GB3 | ||||||
| self.assertEqual(graphemes('\r\n'), ['\r\n']) | ||||||
| # GB4 | ||||||
| self.assertEqual(graphemes('\r\u0308'), ['\r', '\u0308']) | ||||||
| self.assertEqual(graphemes('\n\u0308'), ['\n', '\u0308']) | ||||||
| self.assertEqual(graphemes('\0\u0308'), ['\0', '\u0308']) | ||||||
| # GB5 | ||||||
| self.assertEqual(graphemes('\u06dd\r'), ['\u06dd', '\r']) | ||||||
| self.assertEqual(graphemes('\u06dd\n'), ['\u06dd', '\n']) | ||||||
| self.assertEqual(graphemes('\u06dd\0'), ['\u06dd', '\0']) | ||||||
| # GB6 | ||||||
| self.assertEqual(graphemes('\u1100\u1160'), ['\u1100\u1160']) | ||||||
| self.assertEqual(graphemes('\u1100\uAC00'), ['\u1100\uAC00']) | ||||||
| self.assertEqual(graphemes('\u1100\uAC01'), ['\u1100\uAC01']) | ||||||
| # GB7 | ||||||
| self.assertEqual(graphemes('\uAC00\u1160'), ['\uAC00\u1160']) | ||||||
| self.assertEqual(graphemes('\uAC00\u11A8'), ['\uAC00\u11A8']) | ||||||
| self.assertEqual(graphemes('\u1160\u1160'), ['\u1160\u1160']) | ||||||
| self.assertEqual(graphemes('\u1160\u11A8'), ['\u1160\u11A8']) | ||||||
| # GB8 | ||||||
| self.assertEqual(graphemes('\uAC01\u11A8'), ['\uAC01\u11A8']) | ||||||
| self.assertEqual(graphemes('\u11A8\u11A8'), ['\u11A8\u11A8']) | ||||||
| # GB9 | ||||||
| self.assertEqual(graphemes('a\u0300'), ['a\u0300']) | ||||||
| self.assertEqual(graphemes('a\u200D'), ['a\u200D']) | ||||||
| # GB9a | ||||||
| self.assertEqual(graphemes('\u0905\u0903'), ['\u0905\u0903']) | ||||||
| # GB9b | ||||||
| self.assertEqual(graphemes('\u06dd\u0661'), ['\u06dd\u0661']) | ||||||
| # GB9c | ||||||
| self.assertEqual(graphemes('\u0915\u094d\u0924'), | ||||||
| ['\u0915\u094d\u0924']) | ||||||
| self.assertEqual(graphemes('\u0915\u094D\u094D\u0924'), | ||||||
| ['\u0915\u094D\u094D\u0924']) | ||||||
| self.assertEqual(graphemes('\u0915\u094D\u0924\u094D\u092F'), | ||||||
| ['\u0915\u094D\u0924\u094D\u092F']) | ||||||
| # GB11 | ||||||
| self.assertEqual(graphemes( | ||||||
| '\U0001F9D1\U0001F3FE\u200D\u2764\uFE0F' | ||||||
| '\u200D\U0001F48B\u200D\U0001F9D1\U0001F3FC'), | ||||||
| ['\U0001F9D1\U0001F3FE\u200D\u2764\uFE0F' | ||||||
| '\u200D\U0001F48B\u200D\U0001F9D1\U0001F3FC']) | ||||||
| # GB11 | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Above is also |
||||||
| self.assertEqual(graphemes('\U0001F1FA\U0001F1E6'), | ||||||
| ['\U0001F1FA\U0001F1E6']) | ||||||
|
|
||||||
|
|
||||||
| class Unicode_3_2_0_FunctionsTest(UnicodeFunctionsTest): | ||||||
| db = unicodedata.ucd_3_2_0 | ||||||
|
|
@@ -624,6 +834,11 @@ class Unicode_3_2_0_FunctionsTest(UnicodeFunctionsTest): | |||||
| if quicktest else | ||||||
| 'f217b8688d7bdff31db4207e078a96702f091597') | ||||||
|
|
||||||
| test_grapheme_cluster_break = None | ||||||
| test_indic_conjunct_break = None | ||||||
| test_extended_pictographic = None | ||||||
| test_grapheme_break = None | ||||||
|
|
||||||
|
|
||||||
| class UnicodeMiscTest(unittest.TestCase): | ||||||
| db = unicodedata | ||||||
|
|
@@ -848,5 +1063,61 @@ class MyStr(str): | |||||
| self.assertIs(type(normalize(form, MyStr(input_str))), str) | ||||||
|
|
||||||
|
|
||||||
| class GraphemeBreakTest(unittest.TestCase): | ||||||
| @staticmethod | ||||||
| def check_version(testfile): | ||||||
| hdr = testfile.readline() | ||||||
| return unicodedata.unidata_version in hdr | ||||||
|
|
||||||
| @requires_resource('network') | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should it not be |
||||||
| def test_grapheme_break(self): | ||||||
| TESTDATAFILE = "auxiliary/GraphemeBreakTest.txt" | ||||||
| TESTDATAURL = f"https://www.unicode.org/Public/{unicodedata.unidata_version}/ucd/{TESTDATAFILE}" | ||||||
|
|
||||||
| # Hit the exception early | ||||||
| try: | ||||||
| testdata = open_urlresource(TESTDATAURL, encoding="utf-8", | ||||||
| check=self.check_version) | ||||||
| except PermissionError: | ||||||
| self.skipTest(f"Permission error when downloading {TESTDATAURL} " | ||||||
| f"into the test data directory") | ||||||
| except (OSError, HTTPException) as exc: | ||||||
| self.skipTest(f"Failed to download {TESTDATAURL}: {exc}") | ||||||
|
|
||||||
| with testdata: | ||||||
| self.run_grapheme_break_tests(testdata, unicodedata) | ||||||
|
|
||||||
| def run_grapheme_break_tests(self, testdata, ucd): | ||||||
| for line in testdata: | ||||||
| line, _, comment = line.partition('#') | ||||||
| line = line.strip() | ||||||
| if not line: | ||||||
| continue | ||||||
| comment = comment.strip() | ||||||
|
|
||||||
| chunks = [] | ||||||
| breaks = [] | ||||||
| pos = 0 | ||||||
| for field in line.replace('×', ' ').split(): | ||||||
| if field == '÷': | ||||||
| chunks.append('') | ||||||
| breaks.append(pos) | ||||||
| else: | ||||||
| chunks[-1] += chr(int(field, 16)) | ||||||
| pos += 1 | ||||||
| self.assertEqual(chunks.pop(), '', line) | ||||||
| input = ''.join(chunks) | ||||||
| with self.subTest(line): | ||||||
| result = list(unicodedata.iter_graphemes(input)) | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you mean to use the passed
Suggested change
|
||||||
| self.assertEqual(list(map(str, result)), chunks, comment) | ||||||
| self.assertEqual([x.start for x in result], breaks[:-1], comment) | ||||||
| self.assertEqual([x.end for x in result], breaks[1:], comment) | ||||||
| for i in range(1, len(breaks) - 1): | ||||||
| result = list(unicodedata.iter_graphemes(input, breaks[i])) | ||||||
|
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Continues above. |
||||||
| self.assertEqual(list(map(str, result)), chunks[i:], comment) | ||||||
| self.assertEqual([x.start for x in result], breaks[i:-1], comment) | ||||||
| self.assertEqual([x.end for x in result], breaks[i+1:], comment) | ||||||
|
|
||||||
|
|
||||||
| if __name__ == "__main__": | ||||||
| unittest.main() | ||||||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,8 @@ | ||
| Add the :func:`~unicodedata.iter_graphemes` function in the | ||
| :mod:`unicodedata` module to iterate over grapheme clusters according to | ||
| rules defined in `Unicode Standard Annex #29, "Unicode Text Segmentation" | ||
| <https://www.unicode.org/reports/tr29/>`_. Add | ||
| :func:`~unicodedata.grapheme_cluster_break`, | ||
| :func:`~unicodedata.indic_conjunct_break` and | ||
| :func:`~unicodedata.extended_pictographic` functions to get the properties | ||
| of the character which are related to the above algorithm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The order of functions in this file doesn’t seem to be alphabetical or topical.
I think another ticket should be created to add a quick links table at the top.