JSON Tokenizer (C)
Topics: parsing, state machines, zero-copy
Problem
Implement a minimal, allocation-free JSON tokenizer in the style of
jsmn. It does not build a tree or copy strings; it
scans the input once and fills a caller-provided array of tokens, each recording a
type and the byte span of its text.
typedef enum { JSON_UNDEFINED, JSON_OBJECT, JSON_ARRAY, JSON_STRING, JSON_PRIMITIVE } JsonType;
typedef struct { JsonType type; int start; int end; int size; } JsonToken;
int json_parse(const char *js, size_t len, JsonToken *out, size_t maxtok);
// returns the token count, or a negative error code
Each token's [start, end) indexes into the original input (strings exclude the
quotes). Containers record their child count in size; an object key records 1
(its value). Errors: -1 not enough tokens, -2 invalid character or mismatched
bracket, -3 truncated input.
For {"a":1,"b":[2,3]} the tokens are: object(size 2), "a"(size 1), 1, "b"(size 1),
array(size 2), 2, 3.
Key concepts
- Zero-copy, no heap. The genius of jsmn is that a token is just
(type, start, end, size). No strings are duplicated, no nodes allocated, so it runs in constant
extra memory on a microcontroller. The caller decides how many tokens to allow.
- Track the enclosing token. Keep an index to the current "super" (the open
container or the key awaiting a value). Opening a container pushes it as the new
super; closing one pops back to its parent;
: makes the value belong to the key;
, re-points to the container. Every token bumps its super's size.
- Closing brackets walk back. To close
} or ], scan the token array backward
for the nearest still-open container, verify its type matches, and set its end.
- Validation is mostly structural: an unmatched bracket is
INVAL, a container or
string still open at the end is PART.
This is exactly how constrained systems (IoT firmware, embedded HTTP clients) parse
JSON without a dynamic allocator.
Run
cc -std=c11 -O2 -Wall tests.c -o /tmp/json && /tmp/json
Sign in to submit your solution.