Skip to content

spec(v3.1): \uXXXX escape, key: [] encoding, §15 smuggling note#51

Merged
johannschopplich merged 7 commits into
mainfrom
v3.1
May 18, 2026
Merged

spec(v3.1): \uXXXX escape, key: [] encoding, §15 smuggling note#51
johannschopplich merged 7 commits into
mainfrom
v3.1

Conversation

@johannschopplich
Copy link
Copy Markdown
Contributor

Summary

VERSIONING.md gains one bullet clarifying that decoder requirements broadening accepted input without invalidating prior encoder output are MINOR.

Cross-reference sweeps: §4, §13.1, §13.2, §13.3, §14.1, §15:802, §19, §20, Appendix A:1058, Appendix B.4, Appendix D, CHANGELOG.md.

Three commits, one per concern, surgical scope each:

  • bad304c feat(spec): add \uXXXX unicode escape (§7.1)
  • 6cab9b9 feat(spec): canonicalize empty array encoding as `key: []` (§9.1)
  • 1e6fe45 feat(spec): add §15 control-character smuggling note

Each was produced by a parallel Opus + GPT-5.5 edit run against a shared brief; commits represent the diffed-and-merged best-of-both result.

Test plan

  • Reference impl (toon-format/toon) updated to emit/accept new forms and ships with green CI
  • Five new \uXXXX fixtures pass (encode U+0004; decode U+0004; decode mixed-case hex; reject truncated \u00; reject lone surrogate \uD800)
  • Five new empty-array fixtures pass (canonical key: [], quoted key, empty-string key, root [], bare-key: → empty object regression)
  • Eight existing legacy key[0]: decode fixtures continue to pass (backward compatibility)
  • Encoder fixtures for arrays-primitive, arrays-nested, arrays-objects produce canonical key: [] shape
  • Replace YYYY-MM-DD placeholders in CHANGELOG.md and SPEC.md Appendix D on release day
  • README badge/test count refreshed if applicable

Closes the JSON representability gap for control characters U+0000-U+001F
not covered by \n, \r, \t. Decoders accept either case in hex digits;
encoders SHOULD emit lowercase. Lone surrogates U+D800-U+DFFF are
rejected; supplementary code points use literal UTF-8 (no surrogate pair
recognition).

Refs spec#39
Empty object-field arrays now canonically encode as `key: []`; decoders
MUST accept both the canonical form and the legacy `key[0]:` for
backward compatibility. Bare `key:` remains reserved for empty/nested
objects (§8). Inner empty arrays in arrays-of-arrays (§9.2) retain
`- [0]:` for parse clarity.

This change introduces a new value-token meaning (`[]` after a colon),
documented in §4 and §9.1. VERSIONING.md is extended to clarify that a
decoder requirement broadening accepted input without invalidating
prior encoder output is MINOR.

Refs spec#50
Documents that control characters now representable via \uXXXX (§7.1)
round-trip verbatim per §17.1 and that encoders MUST NOT strip them
during normalization. Pins downstream sanitization (terminals, logs,
markup) as a recipient responsibility – TOON is a transport format,
not a sanitization boundary.

Refs spec#39
- Bump spec version banner to 3.1, README badge to v3.1 and tests-368,
  package.json to 3.1.0; dates left as YYYY-MM-DD until release day.
- §7.2: extend control-character quoting rule to all U+0000–U+001F so
  strings containing newly representable control chars (NUL, ESC, etc.)
  are forced to be quoted, closing a gap §7.1 alone could not fix.
- §7.1: restrict the MAY-use-\uXXXX clause to non-surrogate code points
  in U+0000–U+FFFF, removing the self-contradiction with the surrogate
  rejection rule below.
- §5: add an explicit root-form branch for the empty-array sentinel
  `[]`, so a one-line `[]` document decodes as an empty root array
  before the single-primitive rule applies.
- examples: update primitive-arrays.toon and examples/README.md to the
  canonical `key: []` form.
- Fill release date 2026-05-18 in SPEC.md, README.md, CHANGELOG.md, and
  Appendix D (was YYYY-MM-DD placeholder).
- §7.1 ABNF: tighten case of `n`, `r`, `t` terminals via `%x6E`, `%x72`,
  `%x74` so the grammar rejects the same uppercase forms the prose
  rejects (RFC 5234 quoted literals are case-insensitive by default).
- §15: drop the §17.1 cross-reference (which is JSON-interop scoped, not
  control-character scoped) and reword to "preserved as data values".
- §10: explicit cross-reference to §9.1 for empty arrays in list-item
  object fields so `- data: []` is unambiguously specified.
- App B.5: add the `[]` empty-array branch to the key-value parsing
  sketch so a literal implementation does not fall through to string.
- CHANGELOG / App D: correct "arbitrary code points" wording to "BMP
  code points" — `\uXXXX` is restricted to non-surrogate U+0000-U+FFFF
  per §7.1 and supplementary code points must use literal UTF-8.
- CHANGELOG: add an explicit Changed entry for the encoder `\uXXXX`
  MUST, noting that v3.0 had no defined escape for these characters and
  that this release formalizes previously undefined behavior without
  invalidating prior conformant output (per VERSIONING.md's
  "previously undefined behavior" carve-out).
Cuts redundant restatements and rationale prose added across the two
codex review rounds; no normative change.

- SPEC.md §4: drop denormalized escape list (§7.1 owns it)
- SPEC.md §9.1: drop "(a key followed by the two-character literal `[]`)"
- SPEC.md §9.2: collapse to the negation that's actually load-bearing
- SPEC.md §10: drop the §9.1 cross-ref bullet (already in §9.1)
- SPEC.md §13.3: drop the validator style-diagnostic bullet
- SPEC.md §15: drop the downstream-sanitization sentence
- SPEC.md Appendix D: reword §15 bullet to match
- CHANGELOG.md: drop §9.2 carry-over note and SemVer justification prose
Grilled-audit pass. No normative change.

- CHANGELOG: fold encoder \uXXXX MUST into the Added bullet (§7.1) and
  reword the §15 bullet to surface the actual MUST NOT instead of
  "added a security note". Drop the duplicate Changed entry.
- SPEC.md §13.1: split the glued bullet at line 705 into two checklist
  items so the `key: []` SHOULD matches the §13.2 decoder bullet.
- README.md: drop the tests-368 badge — actual count is 413, badge has
  drifted twice. tests/README.md is the canonical source.
- VERSIONING.md: tighten the new MINOR rule to be version-agnostic.
- tests/fixtures/decode/validation-errors.json: test name said \u00 but
  input is \u00b; rename to match.
@johannschopplich johannschopplich merged commit 3eb5b88 into main May 18, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

spec: encode empty arrays as instead of (from #19) [RFC]: Add \uXXXX unicode escape sequences for control characters in strings

1 participant