From 5472ecd25ff2f96c3e055efe8e84950e32c00b2a Mon Sep 17 00:00:00 2001
From: Tatsunori Uchino <tats.u@live.jp>
Date: Tue, 18 Mar 2025 22:22:16 +0900
Subject: [PATCH 1/7] Add a note on flankingness around ill-formed code unit
 subsequences

---
 spec.txt | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/spec.txt b/spec.txt
index d76255e0..7aa167d8 100644
--- a/spec.txt
+++ b/spec.txt
@@ -6185,6 +6185,8 @@ followed by [Unicode whitespace] or a [Unicode punctuation character].
 For purposes of this definition, the beginning and the end of
 the line count as Unicode whitespace.
 
+(Note:  If the [delimiter run] adjoins [ill-formed code unit subsequences](https://www.unicode.org/glossary/#ill_formed_code_unit_subsequence) (including isolated surrogate code units), both whether the [delimiter run] is a [left-flanking delimiter run] and whether it is a [right-flanking delimiter run] are [unspecified](http://eel.is/c++draft/defns.unspecified).)
+
 Here are some examples of delimiter runs.
 
   - left-flanking but not right-flanking:

From faa49a0d9a9f27d0f290adae223256f75e0a2c16 Mon Sep 17 00:00:00 2001
From: Tatsunori Uchino <tats.u@live.jp>
Date: Sun, 27 Apr 2025 21:50:45 +0900
Subject: [PATCH 2/7] Move to the definition of character

---
 spec.txt | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/spec.txt b/spec.txt
index 7aa167d8..6404656b 100644
--- a/spec.txt
+++ b/spec.txt
@@ -294,10 +294,20 @@ In the examples, the `→` character is used to represent tabs.
 Any sequence of [characters] is a valid CommonMark
 document.
 
-A [character](@) is a Unicode code point.  Although some
-code points (for example, combining accents) do not correspond to
-characters in an intuitive sense, all code points count as characters
-for purposes of this spec.
+A [character](@) is an
+[Unicode encoded character](https://www.unicode.org/glossary/#encoded_character)
+(or [assigned character](https://www.unicode.org/glossary/#assigned_character)).
+Although some code points (for example, combining accents) do not correspond to
+characters in an intuitive sense, all encoded characters count as characters
+for purposes of this spec. However,
+[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point),
+[reserved code points](https://www.unicode.org/glossary/#reserved_code_point),
+or [Unicode noncharacters](https://www.unicode.org/glossary/#noncharacter)
+are not included. If an implementation meets a code point that is not
+included as a character or an
+[ill-formed code unit subsequence](https://www.unicode.org/glossary/#ill-formed_code_unit_subsequence)
+at the place where it expects a character, the behavior is
+[unspecified](http://eel.is/c++draft/defns.unspecified).
 
 This spec does not specify an encoding; it thinks of lines as composed
 of [characters] rather than bytes.  A conforming parser may be limited
@@ -6185,8 +6195,6 @@ followed by [Unicode whitespace] or a [Unicode punctuation character].
 For purposes of this definition, the beginning and the end of
 the line count as Unicode whitespace.
 
-(Note:  If the [delimiter run] adjoins [ill-formed code unit subsequences](https://www.unicode.org/glossary/#ill_formed_code_unit_subsequence) (including isolated surrogate code units), both whether the [delimiter run] is a [left-flanking delimiter run] and whether it is a [right-flanking delimiter run] are [unspecified](http://eel.is/c++draft/defns.unspecified).)
-
 Here are some examples of delimiter runs.
 
   - left-flanking but not right-flanking:

From 4eaa0e4e651e7de0af23b3569c912215cea6746d Mon Sep 17 00:00:00 2001
From: Tatsunori Uchino <tats.u@live.jp>
Date: Sun, 27 Apr 2025 21:54:13 +0900
Subject: [PATCH 3/7] Make the behavior undefined

---
 spec.txt | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/spec.txt b/spec.txt
index 6404656b..9fc60e88 100644
--- a/spec.txt
+++ b/spec.txt
@@ -307,7 +307,7 @@ are not included. If an implementation meets a code point that is not
 included as a character or an
 [ill-formed code unit subsequence](https://www.unicode.org/glossary/#ill-formed_code_unit_subsequence)
 at the place where it expects a character, the behavior is
-[unspecified](http://eel.is/c++draft/defns.unspecified).
+[undefined](https://eel.is/c++draft/defns.undefined).
 
 This spec does not specify an encoding; it thinks of lines as composed
 of [characters] rather than bytes.  A conforming parser may be limited

From 266166c35ca6ee3961382efe31dc02502cca62cf Mon Sep 17 00:00:00 2001
From: Tatsunori Uchino <tats.u@live.jp>
Date: Thu, 15 May 2025 23:25:05 +0900
Subject: [PATCH 4/7] Consider non-Unicode character sets

---
 spec.txt | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/spec.txt b/spec.txt
index 9fc60e88..a218d004 100644
--- a/spec.txt
+++ b/spec.txt
@@ -303,8 +303,8 @@ for purposes of this spec. However,
 [surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point),
 [reserved code points](https://www.unicode.org/glossary/#reserved_code_point),
 or [Unicode noncharacters](https://www.unicode.org/glossary/#noncharacter)
-are not included. If an implementation meets a code point that is not
-included as a character or an
+are not included. If an implementation meets a code unit that is not
+a part of a character, for example, a part of
 [ill-formed code unit subsequence](https://www.unicode.org/glossary/#ill-formed_code_unit_subsequence)
 at the place where it expects a character, the behavior is
 [undefined](https://eel.is/c++draft/defns.undefined).

From 1f046ea98f7b6261a1697aa4778384df10ab3633 Mon Sep 17 00:00:00 2001
From: Tatsunori Uchino <tats.u@live.jp>
Date: Thu, 15 May 2025 23:27:10 +0900
Subject: [PATCH 5/7] Follow HTML LIving Standard in numeric character
 reference

---
 spec.txt | 18 +++++++++++-------
 1 file changed, 11 insertions(+), 7 deletions(-)

diff --git a/spec.txt b/spec.txt
index a218d004..192c79d2 100644
--- a/spec.txt
+++ b/spec.txt
@@ -671,9 +671,12 @@ references and their corresponding code points.
 references](@)
 consist of `&#` + a string of 1--7 arabic digits + `;`. A
 numeric character reference is parsed as the corresponding
-Unicode character. Invalid Unicode code points will be replaced by
-the REPLACEMENT CHARACTER (`U+FFFD`).  For security reasons,
-the code point `U+0000` will also be replaced by `U+FFFD`.
+number.  The parsed number is replaced with
+another Unicode scalar value according to 
+[the rules stipulated in HTML Living Standard](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state)
+if applicable.  For example,
+[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point)
+and the code point `U+0000` will be replaced by `U+FFFD`.
 
 ```````````````````````````````` example
 &#35; &#1234; &#992; &#0;
@@ -685,8 +688,9 @@ the code point `U+0000` will also be replaced by `U+FFFD`.
 [Hexadecimal numeric character
 references](@) consist of `&#` +
 either `X` or `x` + a string of 1-6 hexadecimal digits + `;`.
-They too are parsed as the corresponding Unicode character (this
-time specified with a hexadecimal numeral instead of decimal).
+They too are parsed and sanitized as the corresponding Unicode scalar value
+according to the rules of HTML Living Standard
+(this time specified with a hexadecimal numeral instead of decimal).
 
 ```````````````````````````````` example
 &#X22; &#XD06; &#xcab;
@@ -710,7 +714,7 @@ Here are some nonentities:
 ````````````````````````````````
 
 
-Although HTML5 does accept some entity references
+Although HTML Living Standard does accept some entity references
 without a trailing semicolon (such as `&copy`), these are not
 recognized here, because it makes the grammar too ambiguous:
 
@@ -721,7 +725,7 @@ recognized here, because it makes the grammar too ambiguous:
 ````````````````````````````````
 
 
-Strings that are not on the list of HTML5 named entities are not
+Strings that are not on the list of HTML Live Standard named entities are not
 recognized as entity references either:
 
 ```````````````````````````````` example

From a96e926587f02a749c26b2605e0878ef5405abe1 Mon Sep 17 00:00:00 2001
From: Tatsunori Uchino <tats.u@live.jp>
Date: Sun, 1 Jun 2025 23:07:02 +0900
Subject: [PATCH 6/7] =?UTF-8?q?HTML=20Living=20Standard=20=E2=86=92=20HTML?=
 =?UTF-8?q?5?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Delegates it to another PR.
---
 spec.txt | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/spec.txt b/spec.txt
index 192c79d2..763c1258 100644
--- a/spec.txt
+++ b/spec.txt
@@ -673,7 +673,7 @@ consist of `&#` + a string of 1--7 arabic digits + `;`. A
 numeric character reference is parsed as the corresponding
 number.  The parsed number is replaced with
 another Unicode scalar value according to 
-[the rules stipulated in HTML Living Standard](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state)
+[the rules stipulated in HTML5](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state)
 if applicable.  For example,
 [surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point)
 and the code point `U+0000` will be replaced by `U+FFFD`.
@@ -689,7 +689,7 @@ and the code point `U+0000` will be replaced by `U+FFFD`.
 references](@) consist of `&#` +
 either `X` or `x` + a string of 1-6 hexadecimal digits + `;`.
 They too are parsed and sanitized as the corresponding Unicode scalar value
-according to the rules of HTML Living Standard
+according to the rules of HTML5
 (this time specified with a hexadecimal numeral instead of decimal).
 
 ```````````````````````````````` example
@@ -714,7 +714,7 @@ Here are some nonentities:
 ````````````````````````````````
 
 
-Although HTML Living Standard does accept some entity references
+Although HTML5 does accept some entity references
 without a trailing semicolon (such as `&copy`), these are not
 recognized here, because it makes the grammar too ambiguous:
 
@@ -725,7 +725,7 @@ recognized here, because it makes the grammar too ambiguous:
 ````````````````````````````````
 
 
-Strings that are not on the list of HTML Live Standard named entities are not
+Strings that are not on the list of HTML5 named entities are not
 recognized as entity references either:
 
 ```````````````````````````````` example

From a8afc0ceff2965e303429838c442a48e8dbdf3ce Mon Sep 17 00:00:00 2001
From: Tatsunori Uchino <tats.u@live.jp>
Date: Sun, 1 Jun 2025 23:41:03 +0900
Subject: [PATCH 7/7] Revert introduction of HTML codepoint replacing rule

---
 spec.txt | 11 ++++-------
 1 file changed, 4 insertions(+), 7 deletions(-)

diff --git a/spec.txt b/spec.txt
index 763c1258..1ee49082 100644
--- a/spec.txt
+++ b/spec.txt
@@ -671,12 +671,10 @@ references and their corresponding code points.
 references](@)
 consist of `&#` + a string of 1--7 arabic digits + `;`. A
 numeric character reference is parsed as the corresponding
-number.  The parsed number is replaced with
-another Unicode scalar value according to 
-[the rules stipulated in HTML5](https://html.spec.whatwg.org/multipage/parsing.html#numeric-character-reference-end-state)
-if applicable.  For example,
-[surrogate code points](https://www.unicode.org/glossary/#surrogate_code_point)
-and the code point `U+0000` will be replaced by `U+FFFD`.
+number.  The parsed number will be replaced by
+the REPLACEMENT CHARACTER (`U+FFFD`) if it does not represent
+an Unicode scalar value. For security reasons,
+the code point `U+0000` will also be replaced by `U+FFFD`.
 
 ```````````````````````````````` example
 &#35; &#1234; &#992; &#0;
@@ -689,7 +687,6 @@ and the code point `U+0000` will be replaced by `U+FFFD`.
 references](@) consist of `&#` +
 either `X` or `x` + a string of 1-6 hexadecimal digits + `;`.
 They too are parsed and sanitized as the corresponding Unicode scalar value
-according to the rules of HTML5
 (this time specified with a hexadecimal numeral instead of decimal).
 
 ```````````````````````````````` example