UTF-8 Decode & Validate (C)
Topics: bit manipulation, encoding, Unicode
Problem
Decode a UTF-8 byte stream into Unicode code points, validating as you go —
rejecting every kind of malformed input. No allocation.
long utf8_decode(const unsigned char *in, size_t len, unsigned int *out, size_t outcap);
bool utf8_valid(const unsigned char *in, size_t len);
utf8_decode writes code points into out[0..outcap) and returns the total count
(the full count even if outcap is too small, like a sizing call), or -1 if the
input is not valid UTF-8.
The encoding
| Bytes |
Pattern |
Code points |
| 1 |
0xxxxxxx |
U+0000..U+007F |
| 2 |
110xxxxx 10xxxxxx |
U+0080..U+07FF |
| 3 |
1110xxxx 10xxxxxx 10xxxxxx |
U+0800..U+FFFF |
| 4 |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx |
U+10000..U+10FFFF |
Concatenate the x bits (high to low) to form the code point.
What counts as invalid
- A lead byte of
10xxxxxx (a lone continuation) or 11111xxx (5+ bytes).
- A continuation byte that isn't
10xxxxxx.
- A truncated sequence (fewer continuation bytes than the lead promises).
- An overlong encoding — the code point is below the minimum for that length
(e.g.
C0 80 for U+0000). Check cp >= min for the byte count.
- A surrogate code point
U+D800..U+DFFF.
- Anything beyond
U+10FFFF.
Key concepts
- Mask the lead byte to learn the length (
& 0xE0 == 0xC0 → 2 bytes, etc.) and
the seed bits; shift in 6 bits per continuation ((cp << 6) | (c & 0x3F)).
- Validate ranges after assembling the code point: overlong, surrogate, and the
U+10FFFF ceiling are all range checks once you have cp.
Run
cc -std=c11 -O2 -Wall tests.c -o /tmp/utf8 && /tmp/utf8
Sign in to submit your solution.