Syntax Experiments
Nine syntax variants for the same language, tested against Claude Haiku. Most initial assumptions about what would save tokens were wrong.
Each variant implements the same five programs: a simple function, one with dependencies, a data transform, a tool interaction with error handling, and a multi-step workflow. Measured with cl100k_base token counts, then tested against Claude Haiku with zero prior knowledge - generating correct code from just a spec and examples.
The variants
Section titled “The variants”The same function across three variants:
idea1 - readable, Haskell-ish:
total price:number quantity:number rate:number -> number subtotal = multiply price quantity tax = multiply subtotal rate add subtotal taxidea7 - dense wire format:
tot p:n q:n r:n>n;s=*p q;t=*s r;+s tidea9 - ultra-dense with short names (the winner):
tot p:n q:n r:n>n;s=*p q;t=*s r;+s tResults
Section titled “Results”Idea Tokens vs Py Chars vs Py Score--------------------------------------------------------------------python-baseline 871 1.00x 3635 1.00x -idea1 921 1.06x 3108 0.86x 10.0idea1-compact 677 0.78x 2564 0.71x 10.0idea2-tool-calling 983 1.13x 3203 0.88x 10.0idea3-constrained-decoding 598 0.69x 2187 0.60x 10.0idea4-ast-bytecode 584 0.67x 1190 0.33x 9.8idea5-workflow-dag 710 0.82x 2603 0.72x 10.0idea6-mcp-composition 956 1.10x 2978 0.82x 9.5idea7-dense-wire 351 0.40x 1292 0.36x 10.0idea8-ultra-dense 285 0.33x 901 0.25x 10.0idea9-ultra-dense-short 287 0.33x 787 0.22x 10.0idea9 uses 0.33x the tokens and 0.22x the characters of Python, with 10/10 generation accuracy. idea4 (a bytecode-like format) was the only one that dipped below 10.
Surprises
Section titled “Surprises”Positional arguments are the biggest win. Going from charge(pid:pid, amt:amt) to just charge pid amt eliminates parens, colons, and repeated parameter names. The single largest token reduction across all variants. The concern about parameter-swap errors was completely unfounded - 10/10 accuracy across all task types.
Short variable names don’t save tokens. order and ord are both single tokens in cl100k_base. The tokenizer already handles common English words efficiently. Abbreviating only saves characters, not tokens. That’s why idea8 (285 tokens) and idea9 (287 tokens) are nearly identical despite idea9 being 114 characters shorter.
Sigils beat keywords, but not by as much as you’d think. Replacing match with ? and for with @ saves characters more than tokens. The real win is disambiguation - a sigil can never be confused with a variable name, which reduces generation errors.
Spec quality trumps syntax cleverness. Going from 8/10 to 10/10 generation accuracy just by adding better operator examples to the spec. The spec is part of the prompt. If it’s ambiguous, the LLM will struggle no matter how clean your syntax is.
The bytecode format was a trap. idea4 (integer-ID based AST) had great character efficiency (0.33x) but was the only variant that dropped below 10/10 accuracy. Making things machine-optimal at the character level can make them harder for LLMs to work with. The model needs some semantic signal.
Cold-LLM test
Section titled “Cold-LLM test”The real validation: give Haiku the spec and examples, then ask it to write completely new programs in unfamiliar task domains. Not reformatting - actually understanding the language and producing novel code.
Per-task breakdown (Full test, /10):
Idea workflow data_pipe decision api_orch--------------------------------------------------------------------idea7-dense-wire 10.0 10.0 10.0 10.0idea8-ultra-dense 10.0 10.0 10.0 10.0idea9-ultra-dense-short 10.0 10.0 10.0 10.0idea4-ast-bytecode 10.0 10.0 9.0 10.0idea6-mcp-composition 9.0 10.0 9.0 10.0The dense formats weren’t just parseable, they were learnable.
Takeaway
Section titled “Takeaway”idea1 through idea3 were designed by intuition, without proper benchmarks. The comparison harness made the difference. Measure first, design second.