Structure-Aware Tokenization for JSON
preprint
Overview
General-purpose subword tokenizers such as cl100k_base treat JSON as flat text, splitting structural syntax characters across multiple tokens and re-encoding identical key names in every object. json-tokenizer assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies BPE only to value content.
Key Results
- 5-15% token savings over cl100k_base on schema-repetitive JSON workloads
- ~90x smaller vocabulary (1,100–4,100 tokens vs. 100,256)
- Batch encoding of JSON arrays yields savings up to 9.3%
- Lossless: decoded output is byte-identical across 4,200+ test objects
- Stable: standard deviation <0.01% across five random seeds
Target Workloads
- GeoJSON features
- Observability telemetry
- Kubernetes manifests
- LLM structured output pipelines
- API response caching
Honest Limitations
General-purpose tokenizers remain superior for prose-heavy JSON where schema repetition is minimal.
Links
- HuggingFace PreTrainedTokenizer compatible
- Zero-dependency Python implementation