Unicode Fundamentals

Presenter Notes

  • This talk covers both Python 2 and Python 3.
  • Who's been bitten by unicode?

In the beginning there was ASCII

7-bit encoding (128 values from 0x00 to 0x7f)

ASCII strings usually represented by a string of bytes

    0 1 2 3 4 5 6 7 8 9 A B C D E F
0x0                      
0x1                      
0x2 ! " # $ % & ' ( ) * + , - . /
0x3 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
0x4 @ A B C D E F G H I J K L M N O
0x5 P Q R S T U V W X Y Z [ \ ] ^ _
0x6 ` a b c d e f g h i j k l m n o
0x7 p q r s t u v w x y z { | } ~

Presenter Notes

What about non-English text?

  • Let's use those extra 128 combinations

    • Proliferation of encodings for bytes 0x80 to 0xff (e.g. ISO-8859-1)
  • But some languages have (far) more than 256 characters

    • multi-byte encodings (e.g. Shift JIS)

A string of bytes can mean lots of different things!

Presenter Notes

You cannot know what text a series of bytes represents without its encoding

Unicode

Universal Character Set - Represents characters as "code points"

Over 1 million available code points, enough for every language on Earth

Strings of Unicode code points can be represented using different encodings

Presenter Notes

Encodings example: 'a'


Unicode example: 'a'

Presenter Notes

Encodings example: 'é'


Unicode example: 'é'

Presenter Notes

Python 2

Presenter Notes

Byte strings in Python 2

str type - 8-bit strings (byte strings)

String literals are 8-bit strings (str) by default

They are not "ASCII strings"

>>>'hello'
'hello'
>>>'\x62\x72\x10\xff'
'br\x10\xff'

str is a generic byte string that happens to treat the range 0-127 as the corresponding ASCII character (e.g. for methods like upper and lower)

Presenter Notes

Bytes are not text!

Text processing with byte strings is impractical unless you are guaranteed that everything is pure ASCII

Devices (terminal, files, network) give you bytes (unless the API decodes it for you)

You might think you're safe if you know that every encoded string is using the same encoding, but although some things will work (e.g. equality tests), other things will not.

Presenter Notes

Bytes are not text! (Example 1)

>>> name = raw_input('Name: ')
Name: André
>>> print "That has {} letters".format(len(name))
That has 6 letters

Some characters take up multiple bytes in certain encodings

Presenter Notes

Bytes are not text! (Example 2)

>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> raw_input('Name: ')
Name: André
'Andr\xc3\xa9'

>>> import sys
>>> sys.stdin.encoding
'ISO8859-1'
>>> raw_input('Name: ')
Name: André
'Andr\xe9'

Presenter Notes

Unicode strings in Python 2

unicode type - sequence of code points

Use 'u' prefix for unicode literals

unicode.encode(encoding)str

str.decode(encoding)unicode

Presenter Notes

Python 2 Unicode encoding

>>> u'hello'.encode('utf-8')
'hello'
>>> u'André'.encode('utf-8')
'Andr\xc3\xa9'
>>> u'André'.encode('ISO8859-1')
'Andr\xe9'
>>> u'André'.encode('ascii')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode
character u'\xe9' in position 4:
ordinal not in range(128)

Presenter Notes

Python 2 Unicode decoding

>>> 'hello'.decode('utf-8')
u'hello'
>>> 'Andr\xc3\xa9'.decode('utf-8')
u'Andr\xe9'
>>> 'Andr\xe9'.decode('ISO8859-1')
u'Andr\xe9'
>>> 'Andr\x80'.decode('ascii')
Traceback (most recent call last):
...
UnicodeDecodeError: 'ascii' codec can't decode
byte 0x80 in position 4:
ordinal not in range(128)

Presenter Notes

Unicode confusion in Python 2

>>> u'hello'.decode('utf-8')
u'hello'
>>> u'André'.decode('utf-8')
Traceback (most recent call last):
...
UnicodeEncodeError: 'ascii' codec can't encode
character u'\xe9' in position 4:
ordinal not in range(128)
...

?????

Why does a unicode object even have a decode method??

Presenter Notes

Automatic string coercion in Python 2

Python 2 will automatically convert between str and unicode if necessary.

>>> import sys
>>> sys.getdefaultencoding()
'ascii'

Mixing str and unicode freely in code will seem to work fine... until you start to receive non-ascii unicode characters.

Python is normally a strongly typed language, which makes this implicit conversion policy extra weird.

Presenter Notes

Python 3

Presenter Notes

Python 3 makes it all better

str: text (sequence of Unicode code points)

bytes: bytes!

String literals are unicode by default (str), use the 'b' prefix for bytes literals

Presenter Notes

encode/decode in Python 3

str.encode(encoding)bytes

bytes.decode(encoding)str


str.decode doesn't exist!

bytes.encode doesn't exist!

Presenter Notes

In Python 3, the delineation between bytes and text is very clear.

Automatic string coercion in Python 3?

No!

Trying to mix bytes and str is a TypeError

Presenter Notes

Best practices for Unicode in Python

  • Always know what you have (text or bytes)
    • Don't mix them freely (Python 2 will let you, Python 3 won't)
  • Take bytes in at the 'boundaries' (files, network, etc.), and use Unicode internally everywhere
  • Test with non-ASCII characters (preferably code point 256 and above)
    • ℝεα∂@ßʟ℮ ☂ℯṧт υηḯ¢☺ḓ℮

Give examples of non-ascii readable text

Presenter Notes

The End

See also:

Presenter Notes