spec(v3.1): \uXXXX escape, key: [] encoding, §15 smuggling note#51

Merged

johannschopplich merged 7 commits into

mainfrom

May 18, 2026

Contributor

johannschopplich commented May 18, 2026

Summary

\uXXXX Unicode escape (§7.1, closes [RFC]: Add \uXXXX unicode escape sequences for control characters in strings #39) — closes the JSON representability gap for control characters U+0000–U+001F not covered by \n, \r, \t. Decoders accept either case in hex digits; encoders SHOULD emit lowercase. Lone surrogates U+D800–U+DFFF rejected; supplementary code points use literal UTF-8.
Canonical empty-array encoding key: [] (§9.1, closes spec: encode empty arrays as instead of (from #19) #50) — encoders SHOULD emit key: []; decoders MUST accept both canonical and legacy key[0]:. Bare key: remains reserved for empty/nested objects (§8). Inner empty arrays in arrays-of-arrays (§9.2) retain - [0]: for parse clarity.
§15 control-character smuggling note (rides [RFC]: Add \uXXXX unicode escape sequences for control characters in strings #39) — pins encoders MUST NOT strip control characters during normalization; round-trip fidelity preserved per §17.1; downstream sanitization is recipient responsibility.

VERSIONING.md gains one bullet clarifying that decoder requirements broadening accepted input without invalidating prior encoder output are MINOR.

Cross-reference sweeps: §4, §13.1, §13.2, §13.3, §14.1, §15:802, §19, §20, Appendix A:1058, Appendix B.4, Appendix D, CHANGELOG.md.

Three commits, one per concern, surgical scope each:

bad304c feat(spec): add \uXXXX unicode escape (§7.1)
6cab9b9 feat(spec): canonicalize empty array encoding as `key: []` (§9.1)
1e6fe45 feat(spec): add §15 control-character smuggling note

Each was produced by a parallel Opus + GPT-5.5 edit run against a shared brief; commits represent the diffed-and-merged best-of-both result.

Test plan

Reference impl (toon-format/toon) updated to emit/accept new forms and ships with green CI
Five new \uXXXX fixtures pass (encode U+0004; decode U+0004; decode mixed-case hex; reject truncated \u00; reject lone surrogate \uD800)
Five new empty-array fixtures pass (canonical key: [], quoted key, empty-string key, root [], bare-key: → empty object regression)
Eight existing legacy key[0]: decode fixtures continue to pass (backward compatibility)
Encoder fixtures for arrays-primitive, arrays-nested, arrays-objects produce canonical key: [] shape
Replace YYYY-MM-DD placeholders in CHANGELOG.md and SPEC.md Appendix D on release day
README badge/test count refreshed if applicable

johannschopplich added 7 commits

May 18, 2026 08:09


          feat(spec): add \uXXXX unicode escape (§7.1)

bad304c

Closes the JSON representability gap for control characters U+0000-U+001F
not covered by \n, \r, \t. Decoders accept either case in hex digits;
encoders SHOULD emit lowercase. Lone surrogates U+D800-U+DFFF are
rejected; supplementary code points use literal UTF-8 (no surrogate pair
recognition).

Refs spec#39


          feat(spec): canonicalize empty array encoding as key: [] (§9.1)

6cab9b9

Empty object-field arrays now canonically encode as `key: []`; decoders
MUST accept both the canonical form and the legacy `key[0]:` for
backward compatibility. Bare `key:` remains reserved for empty/nested
objects (§8). Inner empty arrays in arrays-of-arrays (§9.2) retain
`- [0]:` for parse clarity.

This change introduces a new value-token meaning (`[]` after a colon),
documented in §4 and §9.1. VERSIONING.md is extended to clarify that a
decoder requirement broadening accepted input without invalidating
prior encoder output is MINOR.

Refs spec#50


          feat(spec): add §15 control-character smuggling note

1e6fe45

Documents that control characters now representable via \uXXXX (§7.1)
round-trip verbatim per §17.1 and that encoders MUST NOT strip them
during normalization. Pins downstream sanitization (terminals, logs,
markup) as a recipient responsibility – TOON is a transport format,
not a sanitization boundary.

Refs spec#39


          fix(spec): address pre-release review findings for v3.1

4ed8340

- Bump spec version banner to 3.1, README badge to v3.1 and tests-368,
  package.json to 3.1.0; dates left as YYYY-MM-DD until release day.
- §7.2: extend control-character quoting rule to all U+0000–U+001F so
  strings containing newly representable control chars (NUL, ESC, etc.)
  are forced to be quoted, closing a gap §7.1 alone could not fix.
- §7.1: restrict the MAY-use-\uXXXX clause to non-surrogate code points
  in U+0000–U+FFFF, removing the self-contradiction with the surrogate
  rejection rule below.
- §5: add an explicit root-form branch for the empty-array sentinel
  `[]`, so a one-line `[]` document decodes as an empty root array
  before the single-primitive rule applies.
- examples: update primitive-arrays.toon and examples/README.md to the
  canonical `key: []` form.


          fix(spec): address second-round codex review for v3.1

ecf4cab

- Fill release date 2026-05-18 in SPEC.md, README.md, CHANGELOG.md, and
  Appendix D (was YYYY-MM-DD placeholder).
- §7.1 ABNF: tighten case of `n`, `r`, `t` terminals via `%x6E`, `%x72`,
  `%x74` so the grammar rejects the same uppercase forms the prose
  rejects (RFC 5234 quoted literals are case-insensitive by default).
- §15: drop the §17.1 cross-reference (which is JSON-interop scoped, not
  control-character scoped) and reword to "preserved as data values".
- §10: explicit cross-reference to §9.1 for empty arrays in list-item
  object fields so `- data: []` is unambiguously specified.
- App B.5: add the `[]` empty-array branch to the key-value parsing
  sketch so a literal implementation does not fall through to string.
- CHANGELOG / App D: correct "arbitrary code points" wording to "BMP
  code points" — `\uXXXX` is restricted to non-surrogate U+0000-U+FFFF
  per §7.1 and supplementary code points must use literal UTF-8.
- CHANGELOG: add an explicit Changed entry for the encoder `\uXXXX`
  MUST, noting that v3.0 had no defined escape for these characters and
  that this release formalizes previously undefined behavior without
  invalidating prior conformant output (per VERSIONING.md's
  "previously undefined behavior" carve-out).


          chore(spec): trim review-cycle noise from v3.1

43c4a45

Cuts redundant restatements and rationale prose added across the two
codex review rounds; no normative change.

- SPEC.md §4: drop denormalized escape list (§7.1 owns it)
- SPEC.md §9.1: drop "(a key followed by the two-character literal `[]`)"
- SPEC.md §9.2: collapse to the negation that's actually load-bearing
- SPEC.md §10: drop the §9.1 cross-ref bullet (already in §9.1)
- SPEC.md §13.3: drop the validator style-diagnostic bullet
- SPEC.md §15: drop the downstream-sanitization sentence
- SPEC.md Appendix D: reword §15 bullet to match
- CHANGELOG.md: drop §9.2 carry-over note and SemVer justification prose


          chore(spec): fourth-round trim and verified-bug fixes

45facd8

Grilled-audit pass. No normative change.

- CHANGELOG: fold encoder \uXXXX MUST into the Added bullet (§7.1) and
  reword the §15 bullet to surface the actual MUST NOT instead of
  "added a security note". Drop the duplicate Changed entry.
- SPEC.md §13.1: split the glued bullet at line 705 into two checklist
  items so the `key: []` SHOULD matches the §13.2 decoder bullet.
- README.md: drop the tests-368 badge — actual count is 413, badge has
  drifted twice. tests/README.md is the canonical source.
- VERSIONING.md: tighten the new MINOR rule to be version-agnostic.
- tests/fixtures/decode/validation-errors.json: test name said \u00 but
  input is \u00b; rename to match.

johannschopplich merged commit 3eb5b88 into main

1 check passed

johannschopplich mentioned this pull request

Empty arrays (eg matches[0]:) are often misinterpreted by LLMs #19

Closed

3 tasks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet