Binary / Text encoding
if you wish to efficiently encode binary data as Unicode text,
- in UTF-8, use Base64 or Base85
- in UTF-16, use Base32768
- in UTF-32, use Base65536
see also
ASCII‑constrained
- Base1 *
- Base16,Hex - 50% UTF8 Efficiency - hash output a9eb85ea214a6cfa6882f4be041d5cce7bee3e45
- Base32 - standard 32-character set: twenty-six upper-case letters A–Z and the digits 2–7.
- Base36 - Arabic numerals 0–9 and the Latin letters A–Z
- Base58 - avoid both non-alphanumeric characters (+ and /) and letters which might look ambiguous when printed (0 - zero, I - capital i, O - capital o and l - lower case L).
- Base64 / uuencode - 75% UTF8 Efficiency
- Base85,Ascii85 - 80% UTF8 Efficiency - more efficient than uuencode or Base64, may contain escape characters such as backslash and quote †
- basE91
- Base-122
- yEnc - 8-bit encoding method, 252 of the 256 possible bytes are passed through unencoded as a single byte, whether that result is a printable ASCII character or not. Only NUL, LF, CR, and = are escaped.
† Base85 is listed for completeness but all variants use characters which are considered hazardous for general use in
* The Base1 encoding is not as simple as taking the binary as a place-value base 256 number. This would give no way to distinguish buffers with leading null bytes from one another. We have to encode the length of the source buffer as well. We do this by sorting all possible buffers by length and then lexicographically, then simply returning the index of the buffer in the list.
Unicode
To encode one additional bit per code point, we need to double the number of code points we use from 65,536 to 131,072. This would be a new encoding, Base131072, and its UTF-32 encoding efficiency would be 53% vs. 50% for Base65536. (Note that in UTF-16, Base32768 significantly outperforms either choice, and in UTF-8, Base64 remains the preferred choice.) - Base efficiency
- Base2048 56%
- Base32768 63%
Base2048 sadly renders Base65536 obsolete for its original intended purpose of sending binary data through Twitter. Using Base2048, up to 385 octets can fit in a single Tweet. Compare with Base65536, which manages only 280 octets.
However, Base65536 remains the state of the art for sending binary data through text-based systems which naively counts Unicode code points, particularly those using the fixed-width UTF-32 encoding.
BMP‑constrained
Full Unicode
- Base65536 56%
Scheme and overhead
- base2048 / HN - a binary encoding optimised for transmitting data through Twitter, up to 385 octets can fit in a single Tweet. Compare with Base65536, which manages only 280 octets.
- qntm/ base65536
- online example
-
Base32768 - a binary encoding optimised for UTF-16-encoded text.
- Base65536 / HN - Base65536 is a binary encoding optimised for UTF-32-encoded text.
Base65536 encodes data in a similar fashion to base64, but its alphabet, instead of being 64 characters long, is 65536 characters long. This means, one can map 16 bits of data into a single unicode codepoint. It is of course terribly inefficient, if you were to count the outputted bytes (especially when UTF-8 encoded), but if you count just the number of unicode characters, as for example Twitter does for it’s length limit, you can fit double the data per character.
What is the most efficient binary to text encoding?
Base64/UUencode
install
Usage
Ruby uudecode = string.unpack(‘u’)
C++
- tomykaira/Base64.h - single header base64 decode/encoder.
- boost/beast/core/detail/base64.hpp (1.74)