From 5472ecd25ff2f96c3e055efe8e84950e32c00b2a Mon Sep 17 00:00:00 2001 From: Tatsunori Uchino Date: Tue, 18 Mar 2025 22:22:16 +0900 Subject: [PATCH 1/7] Add a note on flankingness around ill-formed code unit subsequences --- spec.txt | 2 ++ 1 file changed, 2 insertions(+) diff --git a/spec.txt b/spec.txt index d76255e0..7aa167d8 100644 --- a/spec.txt +++ b/spec.txt @@ -6185,6 +6185,8 @@ followed by [Unicode whitespace] or a [Unicode punctuation character]. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace. +(Note: If the [delimiter run] adjoins [ill-formed code unit subsequences](https://www.unicode.org/glossary/#ill_formed_code_unit_subsequence) (including isolated surrogate code units), both whether the [delimiter run] is a [left-flanking delimiter run] and whether it is a [right-flanking delimiter run] are [unspecified](http://eel.is/c++draft/defns.unspecified).) + Here are some examples of delimiter runs. - left-flanking but not right-flanking: From faa49a0d9a9f27d0f290adae223256f75e0a2c16 Mon Sep 17 00:00:00 2001 From: Tatsunori Uchino Date: Sun, 27 Apr 2025 21:50:45 +0900 Subject: [PATCH 2/7] Move to the definition of character --- spec.txt | 20 ++++++++++++++------ 1 file changed, 14 insertions(+), 6 deletions(-) diff --git a/spec.txt b/spec.txt index 7aa167d8..6404656b 100644 --- a/spec.txt +++ b/spec.txt @@ -294,10 +294,20 @@ In the examples, the `→` character is used to represent tabs. Any sequence of [characters] is a valid CommonMark document. -A [character](@) is a Unicode code point. Although some -code points (for example, combining accents) do not correspond to -characters in an intuitive sense, all code points count as characters -for purposes of this spec. +A [character](@) is an +[Unicode encoded character](https://www.unicode.org/glossary/#encoded_character) +(or [assigned character](https://www.unicode.org/glossary/#assigned_character)). +Although some code points (for example, combining accents) do not correspond to +characters in an intuitive sense, all encoded characters count as characters +for purposes of this spec. However, +[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point), +[reserved code points](https://www.unicode.org/glossary/#reserved_code_point), +or [Unicode noncharacters](https://www.unicode.org/glossary/#noncharacter) +are not included. If an implementation meets a code point that is not +included as a character or an +[ill-formed code unit subsequence](https://www.unicode.org/glossary/#ill-formed_code_unit_subsequence) +at the place where it expects a character, the behavior is +[unspecified](http://eel.is/c++draft/defns.unspecified). This spec does not specify an encoding; it thinks of lines as composed of [characters] rather than bytes. A conforming parser may be limited @@ -6185,8 +6195,6 @@ followed by [Unicode whitespace] or a [Unicode punctuation character]. For purposes of this definition, the beginning and the end of the line count as Unicode whitespace. -(Note: If the [delimiter run] adjoins [ill-formed code unit subsequences](https://www.unicode.org/glossary/#ill_formed_code_unit_subsequence) (including isolated surrogate code units), both whether the [delimiter run] is a [left-flanking delimiter run] and whether it is a [right-flanking delimiter run] are [unspecified](http://eel.is/c++draft/defns.unspecified).) - Here are some examples of delimiter runs. - left-flanking but not right-flanking: From 4eaa0e4e651e7de0af23b3569c912215cea6746d Mon Sep 17 00:00:00 2001 From: Tatsunori Uchino Date: Sun, 27 Apr 2025 21:54:13 +0900 Subject: [PATCH 3/7] Make the behavior undefined --- spec.txt | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/spec.txt b/spec.txt index 6404656b..9fc60e88 100644 --- a/spec.txt +++ b/spec.txt @@ -307,7 +307,7 @@ are not included. If an implementation meets a code point that is not included as a character or an [ill-formed code unit subsequence](https://www.unicode.org/glossary/#ill-formed_code_unit_subsequence) at the place where it expects a character, the behavior is -[unspecified](http://eel.is/c++draft/defns.unspecified). +[undefined](https://eel.is/c++draft/defns.undefined). This spec does not specify an encoding; it thinks of lines as composed of [characters] rather than bytes. A conforming parser may be limited From 266166c35ca6ee3961382efe31dc02502cca62cf Mon Sep 17 00:00:00 2001 From: Tatsunori Uchino Date: Thu, 15 May 2025 23:25:05 +0900 Subject: [PATCH 4/7] Consider non-Unicode character sets --- spec.txt | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/spec.txt b/spec.txt index 9fc60e88..a218d004 100644 --- a/spec.txt +++ b/spec.txt @@ -303,8 +303,8 @@ for purposes of this spec. However, [surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point), [reserved code points](https://www.unicode.org/glossary/#reserved_code_point), or [Unicode noncharacters](https://www.unicode.org/glossary/#noncharacter) -are not included. If an implementation meets a code point that is not -included as a character or an +are not included. If an implementation meets a code unit that is not +a part of a character, for example, a part of [ill-formed code unit subsequence](https://www.unicode.org/glossary/#ill-formed_code_unit_subsequence) at the place where it expects a character, the behavior is [undefined](https://eel.is/c++draft/defns.undefined). From 1f046ea98f7b6261a1697aa4778384df10ab3633 Mon Sep 17 00:00:00 2001 From: Tatsunori Uchino Date: Thu, 15 May 2025 23:27:10 +0900 Subject: [PATCH 5/7] Follow HTML LIving Standard in numeric character reference --- spec.txt | 18 +++++++++++------- 1 file changed, 11 insertions(+), 7 deletions(-) diff --git a/spec.txt b/spec.txt index a218d004..192c79d2 100644 --- a/spec.txt +++ b/spec.txt @@ -671,9 +671,12 @@ references and their corresponding code points. references](@) consist of `&#` + a string of 1--7 arabic digits + `;`. A numeric character reference is parsed as the corresponding -Unicode character. Invalid Unicode code points will be replaced by -the REPLACEMENT CHARACTER (`U+FFFD`). For security reasons, -the code point `U+0000` will also be replaced by `U+FFFD`. +number. The parsed number is replaced with +another Unicode scalar value according to +[the rules stipulated in HTML Living Standard](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state) +if applicable. For example, +[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point) +and the code point `U+0000` will be replaced by `U+FFFD`. ```````````````````````````````` example # Ӓ Ϡ � @@ -685,8 +688,9 @@ the code point `U+0000` will also be replaced by `U+FFFD`. [Hexadecimal numeric character references](@) consist of `&#` + either `X` or `x` + a string of 1-6 hexadecimal digits + `;`. -They too are parsed as the corresponding Unicode character (this -time specified with a hexadecimal numeral instead of decimal). +They too are parsed and sanitized as the corresponding Unicode scalar value +according to the rules of HTML Living Standard +(this time specified with a hexadecimal numeral instead of decimal). ```````````````````````````````` example " ആ ಫ @@ -710,7 +714,7 @@ Here are some nonentities: ```````````````````````````````` -Although HTML5 does accept some entity references +Although HTML Living Standard does accept some entity references without a trailing semicolon (such as `©`), these are not recognized here, because it makes the grammar too ambiguous: @@ -721,7 +725,7 @@ recognized here, because it makes the grammar too ambiguous: ```````````````````````````````` -Strings that are not on the list of HTML5 named entities are not +Strings that are not on the list of HTML Live Standard named entities are not recognized as entity references either: ```````````````````````````````` example From a96e926587f02a749c26b2605e0878ef5405abe1 Mon Sep 17 00:00:00 2001 From: Tatsunori Uchino Date: Sun, 1 Jun 2025 23:07:02 +0900 Subject: [PATCH 6/7] =?UTF-8?q?HTML=20Living=20Standard=20=E2=86=92=20HTML?= =?UTF-8?q?5?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Delegates it to another PR. --- spec.txt | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/spec.txt b/spec.txt index 192c79d2..763c1258 100644 --- a/spec.txt +++ b/spec.txt @@ -673,7 +673,7 @@ consist of `&#` + a string of 1--7 arabic digits + `;`. A numeric character reference is parsed as the corresponding number. The parsed number is replaced with another Unicode scalar value according to -[the rules stipulated in HTML Living Standard](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state) +[the rules stipulated in HTML5](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state) if applicable. For example, [surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point) and the code point `U+0000` will be replaced by `U+FFFD`. @@ -689,7 +689,7 @@ and the code point `U+0000` will be replaced by `U+FFFD`. references](@) consist of `&#` + either `X` or `x` + a string of 1-6 hexadecimal digits + `;`. They too are parsed and sanitized as the corresponding Unicode scalar value -according to the rules of HTML Living Standard +according to the rules of HTML5 (this time specified with a hexadecimal numeral instead of decimal). ```````````````````````````````` example @@ -714,7 +714,7 @@ Here are some nonentities: ```````````````````````````````` -Although HTML Living Standard does accept some entity references +Although HTML5 does accept some entity references without a trailing semicolon (such as `©`), these are not recognized here, because it makes the grammar too ambiguous: @@ -725,7 +725,7 @@ recognized here, because it makes the grammar too ambiguous: ```````````````````````````````` -Strings that are not on the list of HTML Live Standard named entities are not +Strings that are not on the list of HTML5 named entities are not recognized as entity references either: ```````````````````````````````` example From a8afc0ceff2965e303429838c442a48e8dbdf3ce Mon Sep 17 00:00:00 2001 From: Tatsunori Uchino Date: Sun, 1 Jun 2025 23:41:03 +0900 Subject: [PATCH 7/7] Revert introduction of HTML codepoint replacing rule --- spec.txt | 11 ++++------- 1 file changed, 4 insertions(+), 7 deletions(-) diff --git a/spec.txt b/spec.txt index 763c1258..1ee49082 100644 --- a/spec.txt +++ b/spec.txt @@ -671,12 +671,10 @@ references and their corresponding code points. references](@) consist of `&#` + a string of 1--7 arabic digits + `;`. A numeric character reference is parsed as the corresponding -number. The parsed number is replaced with -another Unicode scalar value according to -[the rules stipulated in HTML5](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state) -if applicable. For example, -[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point) -and the code point `U+0000` will be replaced by `U+FFFD`. +number. The parsed number will be replaced by +the REPLACEMENT CHARACTER (`U+FFFD`) if it does not represent +an Unicode scalar value. For security reasons, +the code point `U+0000` will also be replaced by `U+FFFD`. ```````````````````````````````` example # Ӓ Ϡ � @@ -689,7 +687,6 @@ and the code point `U+0000` will be replaced by `U+FFFD`. references](@) consist of `&#` + either `X` or `x` + a string of 1-6 hexadecimal digits + `;`. They too are parsed and sanitized as the corresponding Unicode scalar value -according to the rules of HTML5 (this time specified with a hexadecimal numeral instead of decimal). ```````````````````````````````` example