C++ pybind11 extension for ASUN (Array-Schema Unified Notation).
Provides 7 functions without requiring manual schema strings for encoding:
encode, encodeTyped, encodePretty, encodePrettyTyped, decode, encodeBinary, decodeBinary.
The wheel also ships asun.pyi and py.typed, so editors and static type checkers can understand the extension module without a separate stub package.
json
Standard JSON repeats every field name in every record. When you send structured data to an LLM, over an API, or across services, that repetition wastes tokens, bytes, and attention:
[
{ "id": 1, "name": "Alice", "active": true },
{ "id": 2, "name": "Bob", "active": false },
{ "id": 3, "name": "Carol", "active": true }
]asun
ASUN declares the schema once and streams data as compact tuples:
[{id, name, active}]:
(1,Alice,true),
(2,Bob,false),
(3,Carol,true)
Fewer tokens. Smaller payloads. Clearer structure, and faster parsing than repeated-object JSON.
| Tool | Version |
|---|---|
| g++ | ≥ 11 (C++17) |
| python3-dev | any (provides Python.h) |
| Python | ≥ 3.8 |
pybind11 2.13.6 headers are vendored in vendor/pybind11/ — no separate installation needed.
# Option A — shell script (auto-installs python3-dev via sudo if missing)
bash build.sh
# Option B — Makefile
make
# Option C — CMake
cmake -B build && cmake --build build| Python value | Inferred ASUN type |
|---|---|
bool |
bool |
int |
int |
float |
float |
str |
str |
None |
optional (e.g. str?, int?) |
Cross-row type merging for lists: When encoding a list, all rows are scanned to compute the final type:
- A field that is non-
Nonein row 0 butNonein some later row is promoted to optional (e.g.str→str?,int→int?). - Type conflicts between non-
Nonevalues (e.g.intin row 0,strin row 1) fall back tostr.
This means encodeTyped is safe to use even when only some rows have None for a given field.
asun.encode({"id": 1, "name": "Alice"})
# → '{id,name}:\n(1,Alice)\n'
asun.encode([{"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}])
# → '[{id,name}]:\n(1,Alice),\n(2,Bob)\n'Decode semantics without scalar hints: When decoded with
decode(), terminal field values are returned as strings because the schema omits scalar type hints. Structural bindings such as@{}and@[]still remain in the schema. UseencodeTypedwhen you need a type-preserving round-trip.
Type is inferred from all rows (not just the first). A field that is None in any row is made optional:
asun.encodeTyped({"id": 1, "name": "Alice", "active": True})
# → '{id@int,name@str,active@bool}:\n(1,Alice,true)\n'
# Optional field inferred from cross-row merging:
asun.encodeTyped([{"id": 1, "tag": "hello"}, {"id": 2, "tag": None}])
# → '[{id@int,tag@str?}]:\n(1,hello),\n(2,)\n'pretty = asun.encodePretty(rows)pretty = asun.encodePrettyTyped(rows)Decodes both schemas with scalar hints and schemas without scalar hints embedded in the text:
# schema with scalar hints → values restored as Python types
rec = asun.decode('{id@int, name@str}:\n(1,Alice)\n') # {'id': 1, 'name': 'Alice'}
rows = asun.decode('[{id@int, name@str}]:\n(1,Alice),\n(2,Bob)\n')
# schema without scalar hints → scalar values returned as strings
rec2 = asun.decode('{id,name}:\n(1,Alice)\n') # {'id': '1', 'name': 'Alice'}Block comments are supported anywhere whitespace is allowed:
rec = asun.decode('/* top */ {id@int,name@str}: /* row */ (1, /* name */ Alice)')data = asun.encodeBinary(rows)Schema is required because the binary wire format carries no embedded type information:
rows = asun.decodeBinary(data, "[{id@int, name@str}]")asun-py includes inline typing support for the compiled extension:
from asun import decode
rows = decode("[{id@int, name@str}]:(1,Alice),(2,Bob)")Type checkers will infer dict[str, Any] | list[dict[str, Any]] for decode results and validate function signatures from the bundled asun.pyi.
Little-endian layout, identical to asun-rs and asun-go:
| Type | Bytes |
|---|---|
int |
8 (i64 LE) |
uint |
8 (u64 LE) |
float |
8 (f64 LE) |
bool |
1 |
str |
4-byte length LE + UTF-8 bytes |
| optional | 1-byte tag (0=null, 1=present) + value |
| slice | 4-byte count LE + elements |
# after building:
python3 -m pytest tests/ -vimport asun
users = [
{"id": 1, "name": "Alice", "score": 9.5},
{"id": 2, "name": "Bob", "score": 7.2},
]
# Schema is inferred automatically—no schema string needed
text = asun.encode(users) # schema binding without scalar hints
textTyped = asun.encodeTyped(users) # schema binding with scalar hints
pretty = asun.encodePrettyTyped(users) # pretty + scalar hints
blob = asun.encodeBinary(users) # binary (schema inferred internally)
assert asun.decode(textTyped) == users # round-trip with scalar hints
assert asun.decode(pretty) == users
assert asun.decodeBinary(blob, "[{id@int, name@str, score@float}]") == usersMeasured on this machine with:
bash build.sh
PYTHONPATH=. python3 examples/bench.pyHeadline numbers:
- Flat 1,000-record dataset: ASUN text serialize
118.98msvs JSON403.32ms, deserialize221.21msvs JSON441.89ms - Flat 10,000-record dataset: ASUN text serialize
81.70msvs JSON293.38ms, deserialize158.39msvs JSON317.44ms - Size summary for 1,000 flat records: JSON
137,674 B, ASUN text57,761 B(58%smaller), ASUN binary74,454 B(46%smaller vs JSON) - Throughput summary on 1,000 records: ASUN text was
3.58xfaster than JSON for serialize and2.01xfaster for deserialize - Binary mode was even faster:
7.18xfaster than JSON on serialization and4.16xfaster on deserialization in the benchmark summary