Hi everyone! In this first installment of the Whoa, that’s fantastic! series, we’re gonna talk about UTF-8 and it’s mysterious, mind-bending ways. (This blog post is complementary to this video.)
But let’s take this from the beginning.
You see, back in my day, if you wanted to have a voice chat with your BFF you needed to put coins into these huge-ass smartphones they had bolted on to the sidewalk, then turn a dial based on some random number assigned to your friend, and then that’d make a slightly smaller smartphone plugged into a wall at your friend’s house make noises, and it meant someone wanted to have a voice chat.
My point is, I’m old. And as an old person, I grew up using ASCII.
And ASCII is simple:
x := "I'm so hungry right now."
for i := 0; i < len(x); i++ {
fmt.Printf("%-3v → %4d → %08b\n", string(x[i]), x[i], x[i])
}
↓
I → 73 → 01001001
' → 39 → 00100111
m → 109 → 01101101
→ 32 → 00100000
s → 115 → 01110011
o → 111 → 01101111
→ 32 → 00100000
h → 104 → 01101000
u → 117 → 01110101
[...]
One byte per character. One character per byte. Beautiful.
Then I took a break from computering for a few years, and when I came back there was this whole UTF-8 thing.
So I learned about runes and whatever (that’s Golang if you’re wondering), and for a while, functionally at least, everything was fine.
But y’know, I’m a curious person, and there’s nothing I don’t get curious about eventually.
Which led me to…
x := "Ó o auê aí, ô!" // Perfectly valid Portuguese. (Not really.)
for i := 0; i < len(x); i++ {
fmt.Printf("%-3v → %4d → %8b\n", string(x[i]), x[i], x[i])
}
↓
à → 195 → 11000011
→ 147 → 10010011
→ 32 → 00100000
o → 111 → 01101111
→ 32 → 00100000
a → 97 → 01100001
u → 117 → 01110101
à → 195 → 11000011
ª → 170 → 10101010
[...]
I mean, of course, right? UTF-8 needs more than one character per byte, else how would it encode a billion different characters?
s := "Ó o auê aí, ô!"
for len(s) > 0 {
char, size := utf8.DecodeRune([]byte(s))
fmt.Printf("%-2c → %3d → Size: %1v byte(s).\n", char, char, size)
s = s[size:]
}
↓
Ó → 211 → Size: 2 byte(s).
[...]
ê → 234 → Size: 2 byte(s).
[...]
í → 237 → Size: 2 byte(s).
[...]
ô → 244 → Size: 2 byte(s).
[...]
So those funny characters take more than one byte (and in UTF-8 a character may take up to four).
Fine.
But… how does it work?
Let’s see:
// a + ~ = ã, right?
ã := []byte{97, 126}
fmt.Println(string(ã))
// And can we spot a pattern of repeating separators?
x := "Ó o auê aí, ô!"
for _, v := range x {
fmt.Printf("%v ", v)
}
↓
a~
211 32 111 32 97 117 234 32 97 237 44 32 244 33
No and no. Well gosh darn it.
(And about #3, my otherwordly consultants said it isn’t.)
So… how?
At this point I spent ages trying to solve this on my own, and then I RTFM at last. Here’s the important bit, adapted from WP:
Number of bytes | Bits for code point | Byte 1 | Byte 2 | Byte 3 | Byte 4 |
---|---|---|---|---|---|
1 | 7 | 0xxxxxxx | |||
2 | 11 | 110xxxxx | 10xxxxxx | ||
3 | 16 | 1110xxxx | 10xxxxxx | 10xxxxxx | |
4 | 21 | 11110xxx | 10xxxxxx | 10xxxxxx | 10xxxxxx |
Meaning:
To see if this works:
x := "aã香🤔"
for i := 0; i < len(x); i++ {
fmt.Printf("%08b ", x[i])
}
Let’s ponder this before I show you the output. We have four characters there, and their code points are: U+0061, U+00E3, U+9999, and U+1F914. They should be one, two, three, and four bytes long respectively.
Meaning we should have:
Let’s see:
01100001 11000011 10100011 11101001 10100110 10011001 11110000 10011111 10100100 10010100
Reshuffled a bit for clarity:
Awesome!
Now let’s see if the remaining bits, apart from all that signaling, actually form the numbers we’re looking for:
Which we can use to:
a, _ := strconv.ParseInt("00011100011", 2, 64)
b, _ := strconv.ParseInt("1001100110011001", 2, 64)
c, _ := strconv.ParseInt("000011111100100010100", 2, 64)
fmt.Printf("%x %x %x", a, b, c)
The output here is e3 9999 1f914
. In other words:
So there we go.
All is good in the world now.
We know how UTF-8 does its magic.