Code points to binaries in Elixir
A code point is a numerical representation of a character in the Unicode Standard. Elixir uses UTF-8 to encode strings and binaries.
For example, code point for letter ę
equals to 281
:
iex(1)> ?ę
281
Here is how UTF-8 encoding serializes numbers on multiple bytes:
Length | First byte | Following bytes |
Single byte | 0XXXXXXX | N/A |
Two bytes | 110XXXXX | 10XXXXXX |
Three bytes | 1110XXXX | 10XXXXXX |
Four bytes | 11110XXX | 10XXXXXX |
Our 281
code point requires 2 bytes for representation in UTF-8 encoding (a single UTF-8 byte can encode numbers from 0 to 127 only).
If we convert 281
straight to a binary, we will receive 100011001
. Placing this sequence of 1/0 on placeholders for two-bytes length from the table above gives the following bytes: 11000100 11011001
. In the base-10 system, they are represented as 196
and 153
which are our binaries:
iex(2)> "ę" <> <<0>>
<<196, 153, 0>>
iex(3)> "foo" <> <<0>>
<<102, 111, 111, 0>>
Concatenating a string with with the null byte <<0>>
returns its inner binary representation.