Structure-Aware Tokenization for JSON

March 5, 2026 preprint

Overview

General-purpose subword tokenizers such as cl100k_base treat JSON as flat text, splitting structural syntax characters across multiple tokens and re-encoding identical key names in every object. json-tokenizer assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies BPE only to value content.

Key Results

5-15% token savings over cl100k_base on schema-repetitive JSON workloads
~90x smaller vocabulary (1,100–4,100 tokens vs. 100,256)
Batch encoding of JSON arrays yields savings up to 9.3%
Lossless: decoded output is byte-identical across 4,200+ test objects
Stable: standard deviation <0.01% across five random seeds

Target Workloads

GeoJSON features
Observability telemetry
Kubernetes manifests
LLM structured output pipelines
API response caching

Honest Limitations

General-purpose tokenizers remain superior for prose-heavy JSON where schema repetition is minimal.

Overview

Key Results

Target Workloads

Honest Limitations

Links