Skip to content
Back to research

Structure-Aware Tokenization for JSON

preprint

Overview

General-purpose subword tokenizers such as cl100k_base treat JSON as flat text, splitting structural syntax characters across multiple tokens and re-encoding identical key names in every object. json-tokenizer assigns dedicated single tokens to JSON grammar elements, learns a compact key vocabulary from training data, and applies BPE only to value content.

Key Results

  • 5-15% token savings over cl100k_base on schema-repetitive JSON workloads
  • ~90x smaller vocabulary (1,100–4,100 tokens vs. 100,256)
  • Batch encoding of JSON arrays yields savings up to 9.3%
  • Lossless: decoded output is byte-identical across 4,200+ test objects
  • Stable: standard deviation <0.01% across five random seeds

Target Workloads

  • GeoJSON features
  • Observability telemetry
  • Kubernetes manifests
  • LLM structured output pipelines
  • API response caching

Honest Limitations

General-purpose tokenizers remain superior for prose-heavy JSON where schema repetition is minimal.

  • HuggingFace PreTrainedTokenizer compatible
  • Zero-dependency Python implementation