spec(v3.1): \uXXXX escape, key: [] encoding, §15 smuggling note#51
Merged
Conversation
Closes the JSON representability gap for control characters U+0000-U+001F not covered by \n, \r, \t. Decoders accept either case in hex digits; encoders SHOULD emit lowercase. Lone surrogates U+D800-U+DFFF are rejected; supplementary code points use literal UTF-8 (no surrogate pair recognition). Refs spec#39
Empty object-field arrays now canonically encode as `key: []`; decoders MUST accept both the canonical form and the legacy `key[0]:` for backward compatibility. Bare `key:` remains reserved for empty/nested objects (§8). Inner empty arrays in arrays-of-arrays (§9.2) retain `- [0]:` for parse clarity. This change introduces a new value-token meaning (`[]` after a colon), documented in §4 and §9.1. VERSIONING.md is extended to clarify that a decoder requirement broadening accepted input without invalidating prior encoder output is MINOR. Refs spec#50
Documents that control characters now representable via \uXXXX (§7.1) round-trip verbatim per §17.1 and that encoders MUST NOT strip them during normalization. Pins downstream sanitization (terminals, logs, markup) as a recipient responsibility – TOON is a transport format, not a sanitization boundary. Refs spec#39
- Bump spec version banner to 3.1, README badge to v3.1 and tests-368, package.json to 3.1.0; dates left as YYYY-MM-DD until release day. - §7.2: extend control-character quoting rule to all U+0000–U+001F so strings containing newly representable control chars (NUL, ESC, etc.) are forced to be quoted, closing a gap §7.1 alone could not fix. - §7.1: restrict the MAY-use-\uXXXX clause to non-surrogate code points in U+0000–U+FFFF, removing the self-contradiction with the surrogate rejection rule below. - §5: add an explicit root-form branch for the empty-array sentinel `[]`, so a one-line `[]` document decodes as an empty root array before the single-primitive rule applies. - examples: update primitive-arrays.toon and examples/README.md to the canonical `key: []` form.
- Fill release date 2026-05-18 in SPEC.md, README.md, CHANGELOG.md, and Appendix D (was YYYY-MM-DD placeholder). - §7.1 ABNF: tighten case of `n`, `r`, `t` terminals via `%x6E`, `%x72`, `%x74` so the grammar rejects the same uppercase forms the prose rejects (RFC 5234 quoted literals are case-insensitive by default). - §15: drop the §17.1 cross-reference (which is JSON-interop scoped, not control-character scoped) and reword to "preserved as data values". - §10: explicit cross-reference to §9.1 for empty arrays in list-item object fields so `- data: []` is unambiguously specified. - App B.5: add the `[]` empty-array branch to the key-value parsing sketch so a literal implementation does not fall through to string. - CHANGELOG / App D: correct "arbitrary code points" wording to "BMP code points" — `\uXXXX` is restricted to non-surrogate U+0000-U+FFFF per §7.1 and supplementary code points must use literal UTF-8. - CHANGELOG: add an explicit Changed entry for the encoder `\uXXXX` MUST, noting that v3.0 had no defined escape for these characters and that this release formalizes previously undefined behavior without invalidating prior conformant output (per VERSIONING.md's "previously undefined behavior" carve-out).
Cuts redundant restatements and rationale prose added across the two codex review rounds; no normative change. - SPEC.md §4: drop denormalized escape list (§7.1 owns it) - SPEC.md §9.1: drop "(a key followed by the two-character literal `[]`)" - SPEC.md §9.2: collapse to the negation that's actually load-bearing - SPEC.md §10: drop the §9.1 cross-ref bullet (already in §9.1) - SPEC.md §13.3: drop the validator style-diagnostic bullet - SPEC.md §15: drop the downstream-sanitization sentence - SPEC.md Appendix D: reword §15 bullet to match - CHANGELOG.md: drop §9.2 carry-over note and SemVer justification prose
Grilled-audit pass. No normative change. - CHANGELOG: fold encoder \uXXXX MUST into the Added bullet (§7.1) and reword the §15 bullet to surface the actual MUST NOT instead of "added a security note". Drop the duplicate Changed entry. - SPEC.md §13.1: split the glued bullet at line 705 into two checklist items so the `key: []` SHOULD matches the §13.2 decoder bullet. - README.md: drop the tests-368 badge — actual count is 413, badge has drifted twice. tests/README.md is the canonical source. - VERSIONING.md: tighten the new MINOR rule to be version-agnostic. - tests/fixtures/decode/validation-errors.json: test name said \u00 but input is \u00b; rename to match.
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
\uXXXXUnicode escape (§7.1, closes [RFC]: Add \uXXXX unicode escape sequences for control characters in strings #39) — closes the JSON representability gap for control characters U+0000–U+001F not covered by\n,\r,\t. Decoders accept either case in hex digits; encoders SHOULD emit lowercase. Lone surrogates U+D800–U+DFFF rejected; supplementary code points use literal UTF-8.key: [](§9.1, closes spec: encode empty arrays as instead of (from #19) #50) — encoders SHOULD emitkey: []; decoders MUST accept both canonical and legacykey[0]:. Barekey:remains reserved for empty/nested objects (§8). Inner empty arrays in arrays-of-arrays (§9.2) retain- [0]:for parse clarity.VERSIONING.md gains one bullet clarifying that decoder requirements broadening accepted input without invalidating prior encoder output are MINOR.
Cross-reference sweeps: §4, §13.1, §13.2, §13.3, §14.1, §15:802, §19, §20, Appendix A:1058, Appendix B.4, Appendix D, CHANGELOG.md.
Three commits, one per concern, surgical scope each:
bad304cfeat(spec): add \uXXXX unicode escape (§7.1)6cab9b9feat(spec): canonicalize empty array encoding as `key: []` (§9.1)1e6fe45feat(spec): add §15 control-character smuggling noteEach was produced by a parallel Opus + GPT-5.5 edit run against a shared brief; commits represent the diffed-and-merged best-of-both result.
Test plan
\uXXXXfixtures pass (encode U+0004; decode U+0004; decode mixed-case hex; reject truncated\u00; reject lone surrogate\uD800)key: [], quoted key, empty-string key, root[], bare-key:→ empty object regression)key[0]:decode fixtures continue to pass (backward compatibility)arrays-primitive,arrays-nested,arrays-objectsproduce canonicalkey: []shapeYYYY-MM-DDplaceholders in CHANGELOG.md and SPEC.md Appendix D on release day