Problem

UTF-8 Decode & Validate (C)

Topics: bit manipulation, encoding, Unicode

Problem

Decode a UTF-8 byte stream into Unicode code points, validating as you go — rejecting every kind of malformed input. No allocation.

long utf8_decode(const unsigned char *in, size_t len, unsigned int *out, size_t outcap);
bool utf8_valid(const unsigned char *in, size_t len);

utf8_decode writes code points into out[0..outcap) and returns the total count (the full count even if outcap is too small, like a sizing call), or -1 if the input is not valid UTF-8.

The encoding

Bytes	Pattern	Code points
1	`0xxxxxxx`	`U+0000..U+007F`
2	`110xxxxx 10xxxxxx`	`U+0080..U+07FF`
3	`1110xxxx 10xxxxxx 10xxxxxx`	`U+0800..U+FFFF`
4	`11110xxx 10xxxxxx 10xxxxxx 10xxxxxx`	`U+10000..U+10FFFF`

Concatenate the x bits (high to low) to form the code point.

What counts as invalid

A lead byte of 10xxxxxx (a lone continuation) or 11111xxx (5+ bytes).
A continuation byte that isn't 10xxxxxx.
A truncated sequence (fewer continuation bytes than the lead promises).
An overlong encoding — the code point is below the minimum for that length (e.g. C0 80 for U+0000). Check cp >= min for the byte count.
A surrogate code point U+D800..U+DFFF.
Anything beyond U+10FFFF.

Key concepts

Mask the lead byte to learn the length (& 0xE0 == 0xC0 → 2 bytes, etc.) and the seed bits; shift in 6 bits per continuation ((cp << 6) | (c & 0x3F)).
Validate ranges after assembling the code point: overlong, surrogate, and the U+10FFFF ceiling are all range checks once you have cp.

Run

cc -std=c11 -O2 -Wall tests.c -o /tmp/utf8 && /tmp/utf8

Leaderboard score: lower is better, 1.0 ≈ reference

#	Name	score	ns/op	allocs

Output

No submission yet.