The bytes/str dichotomy in Python 3

January 30th, 2012 at 7:48 pm

Arguably the most significant new feature of Python 3 is a much cleaner separation between text and binary data. Text is always Unicode and is represented by the str type, and binary data is represented by the bytes type. What makes the separation particularly clean is that str and bytes can’t be mixed in Python 3 in any implicit way. You can’t concatenate them, look for one inside another, and generally pass one to a function that expects the other. This is a good thing.

However, boundaries between strings and bytes are inevitable, and this is where the following diagram is always important to keep in mind:

http://eli.thegreenplace.net/wp-content/uploads/2012/01/py3_string_bytes.png

Strings can be encoded to bytes, and bytes can be decoded back to strings.

>>> '€20'.encode('utf-8')
b'\xe2\x82\xac20'
>>> b'\xe2\x82\xac20'.decode('utf-8')
'€20'

Think of it this way: a string is an abstract representation of text. A string consists of characters, which are also abstract entities not tied to any particular binary representation. When manipulating strings, we’re living in blissful ignorance. We can split and slice them, concatenate and search inside them. We don’t care how they are represented internally and how many bytes it takes to hold each character in them. We only start caring about this when encoding strings into bytes (for example, in order to send them over a communication channel), or decoding strings from bytes (for the other direction).

The argument given to encode and decode is the encoding (or codec). The encoding is a way to represent abstract characters in binary data. There are many possible encodings. UTF-8, shown above, is one. Here’s another:

>>> '€20'.encode('iso-8859-15')
b'\xa420'
>>> b'\xa420'.decode('iso-8859-15')
'€20'

The encoding is a crucial part of this translation process. Without the encoding, the bytes object b'\xa420' is just a bunch of bits. The encoding gives it meaning. Using a different encoding, this bunch of bits can have a different meaning:

>>> b'\xa420'.decode('windows-1255')
'₪20'

That’s 80% of the money lost due to using the wrong encoding, so be careful ;-)

Related posts:

  1. endian-ness of bits and bytes
  2. Perl’s “guess if file is text or binary” implemented in Python
  3. SICP section 2.3.4
  4. Storing BLOBs in a SQLite DB with Python/pysqlite
  5. Unicode and character sets

12 Responses to “The bytes/str dichotomy in Python 3”

  1. Nick CoghlanNo Gravatar Says:

    Nice summary. One minor technicality: Python 3 strings are a sequence of code points rather than characters. While those are effectively the same thing for most text processing, they aren’t *exactly* the same thing in the Unicode standard.

  2. elibenNo Gravatar Says:

    Nick,

    Thanks for the clarification. I was actually planning to write a more in-depth article on Unicode with Python 3 at some point (because I couldn’t really find something that was good enough online). Maybe I’ll get to it one day…

  3. donnNo Gravatar Says:

    Quick and neat. I like the diagram too. Thx.

  4. TobuNo Gravatar Says:

    AIUI this is complicated by the fact that Python compiled in UCS2 mode will only use code points that are represented by a single utf16 code unit, using surrogate pairs for the astral plane. This applies to windows builds of CPython, and on Linux, to non-distribution builds of CPython (due to a questionable configure default); I believe this will be fixed by PEP 393 in Python 3.3.

  5. elibenNo Gravatar Says:

    Tobu,

    Thanks to your comment I’ve just learned a new acronym – AIUI :-)

  6. MariuzNo Gravatar Says:

    I like to think of encoding of strings to the analogy of encrypting your plain text into a random stream of bytes

    string (plain text) -> encode (encrypt) -> bytes (stream)
    bytes (stream) -> decode (decrypt) -> string (plain text)

    and the codec you choose is your key (utf-8 …)

  7. RobertNo Gravatar Says:

    Since I have started using Python 3, I do not understand strings in Python 2 anymore. :-/

  8. Tshepang LekhonkhobeNo Gravatar Says:

    I found the chapter on Unicode in DiveIntoPython3 excellent. Have you read it?

  9. anatoly techtonikNo Gravatar Says:

    encode()/decode() is confusing/misleading/hard to remember.
    tobytes()/frombytes() would be more clear.

  10. elibenNo Gravatar Says:

    anatoly,

    I’d argue this is a matter of preference. Personally I find encode / decode clearer.

  11. Sandra WalkerNo Gravatar Says:

    This is a good change. People far too often don’t appreciate the difference between text and bytes. Especially, I am afraid, americans who live far from Europe and only use ASCII themselves.

    In more recently designed frameworks such as .NET, there is a the same distinction, and you go through either encodings or, more usual, TextReader/Writers, to read/write text into various byte streams (such as files).

    I’ve unfortunately spent too many hours arguing with people about the need to handle this correctly in open source software I am involved in.

  12. Chris GrahamNo Gravatar Says:

    encode()/decode() makes more sense because of the parameters you can pass in for the different types of encoding schemes. If it was tobytes()/frombytes() that would suggest that there was a single standard byte representation for a given string which is not the case.

Leave a Reply

To post code with preserved formatting, enclose it in `backticks` (even multiple lines)