Unicode

Unicode logo

Read my free ebook: Programming with Unicode!

Encodings

  • Windows
    • OEM code page: used by stdin, stdou and stderr in the Windows console
    • ANSI code page: used by all other Windows “ANSI” functions. Some examples: filenames, command line arguments, environment variables, etc.
  • UNIX: “Locale encoding”
    • LC_CTYPE locale
    • used for filenames, command line arguments, environment variables, the console (stdin, stdout, stderr)
  • Common encodings:
    • UTF-8
    • ISO 8859-1 aka Latin1 or Windows code page 1252
    • ASCII

Python

See my conference (in french) “Comprendre les erreurs Unicode” (Pycon FR 2009 at Paris): slides (PDF) and video.

Narrow and wide builds, PEP 393

Python 3.3 introduced the Flexible String Representation (PEP 393) and supports the whole Unicode range (U+0000 - U+10ffff) on all platforms.

Older Python versions had a “narrow or wide” compilation option:

  • UNIX and Mac OS X uses wide mode: the unicode type uses 32-bit code points. In Unicode, it is called the UCS-4 encoding.
  • Windows uses narrow mode: the unicode type uses 16-bit code points, non-BMP characters (unicode range U+10000 - U+10ffff) are used as a surrogate pair (two 16-bit code points). In Unicode, it is called the UTF-16 encoding. This mode is preferred on Windows because Windows kernel uses also the UTF-16 encoding internally.

Use sys.maxunicode == 0xffff to check if Python is compiled in narrow mode. Otherwise, sys.maxunicode is equal to 0x10ffff.

Python 2

  • str type and "abc" are strings of bytes, unicode type is a string of characters
  • “Default encoding”
    • sys.getdefaultencoding()
    • used by unicode.encode() and str.decode() when no encoding is specified
    • ASCII by default, must not by modified (sys.setdefaultencoding())
  • File system encoding
    • sys.getfilesystemencoding()
    • used to encode filenames and environment variables
    • used on UNIX by os.listdir(unicode) to decode filenames
    • ANSI code page (mbcs) on Windows, utf-8 on Mac OS X, the locale encoding on UNIX
  • Locale encoding
    • locale.getpreferredencoding()
    • used by default by io.TextIOWrapper
    • ANSI code page on Windows, LC_CTYPE locale on UNIX
  • OEM code page (Windows only)
    • sys.stdin.encoding, sys.stdout.encoding and sys.stderr.encoding

Python 3

  • bytes type is a string of bytes, str type and "abc" are strings of characters
  • UTF-8
    • used for the default encoding of the source code
  • “Locale encoding”
    • locale.getpreferredencoding()
    • ANSI code page on Windows, LC_CTYPE locale on UNIX
    • used by sys.stdin, sys.stdout, sys.stderr, and by default by open() (and io.TextIOWrapper)
  • “File system encoding”
    • sys.getfilesystemencoding()
    • ANSI code page (mbcs) on Windows, utf-8 on Mac OS X, the locale encoding on UNIX
    • used for filenames, command line arguments, environment variables
  • “Default encoding”
    • sys.getdefaultencoding(), hardcoded to utf-8
    • used by bytes.decode() and str.encode() when no encoding is specified
  • OEM code page (Windows only)
    • sys.stdin.encoding, sys.stdout.encoding and sys.stderr.encoding

Issues

  • GB2312 codec is using a wrong covert table: WONTFIX, It’s a bug, but one which is present in a lot of other systems as well, so we’d potentially make it impossible to write GB2312 data which is supposed to be read back by these other systems.

Test non-ASCII characters with locales

It seems like FreeBSD 11 doesn’t support all encodings: only Latin1 and UTF-8 seem to be implemented. At least, KOI8-R, Big5 and CP1131 are not implemented properly.

Windows locales: “fr-FR”, “en-US”, “ja-JP”, etc.

On Windows, before setlocale(LC_CTYPE, “”) is called, LC_CTYPE uses the Latin1 encoding in practice (see Python issue #29571). Call setlocale(LC_CTYPE, “”) to use the ANSI code page.

My tools:

  • test_all_locales.py: test Python implementation of locales. Only support a few operating systems.
  • all_locales.py: script to list all working locales, can be different than “locale -a”
  • c_locale.c: basic info on the “C” locale

Use cases

  • Latin1 or UTF-8 encoding (locale different than C and POSIX)
  • C or POSIX locale: ASCII encoding on Linux, Latin1 encoding on FreeBSD/Solaris (but ASCII announced by nl_langinfo(CODESET))
  • LC_NUMERIC != LC_CTYPE: the fun localeconv() bug, https://bugs.python.org/issue31900
  • python 3.7 -X utf8
  • macOS and Android UTF-8

Locale, announced encoding, effective encoding

Inconsistent:

Operating system Locale Announced encoding Effective encoding
FreeBSD C, POSIX US-ASCII ISO-8859-1
FreeBSD zh_TW.Big5 Big5 ? (not Big5)
macOS C, POSIX US-ASCII ISO-8859-1
macOS zh_TW.Big5 Big5 ? (not Big5)

Consistent, announced encoding = effective encoding:

Operating system Locale Encoding
Fedora 27 C, POSIX ASCII
FreeBSD fr_FR.UTF-8 UTF-8
macOS fr_FR.UTF-8 UTF-8
Fedora 27 fr_FR.UTF-8 UTF-8
Fedora 27 zh_TW.Big5 Big5

Tested operating systems:

  • macOS 10.13.2:
  • FreeBSD 11.1
  • Fedora 27 (glibc 2.26)

localeconv()

Fedora 27:

LC_ALL locale Encoding Field Bytes Text
es_MX.utf8 UTF-8 thousands_sep 0xE2 0x80 0x89 U+2009
fr_FR.UTF-8 UTF-8 currency_symbol 0xE2 0x82 0xAC U+20AC (€)
ps_AF.utf8 UTF-8 thousands_sep 0xD9 0xAC U+066C (٬)
uk_UA.koi8u KOI8-U currency_symbol 0xC7 0xD2 0xCE 0x2E U+0433 U+0440 U+043d U+002E (грн.)
uk_UA.koi8u KOI8-U thousands_sep 0x9A U+00A0

macOS 10.13.2:

LC_ALL locale Encoding Field Bytes Text
ru_RU.ISO8859-5 ISO8859-5 currency_symbol b'\xe0\xe3\xd1.' U+0440 U+0443 U+0431 U+002e (руб.)

FreeBSD 11:

LC_ALL locale Encoding Field Bytes Text
ar_SA.UTF-8 UTF-8 decimal_point b'\xd9\xab' U+066b (’٫’)
ar_SA.UTF-8 UTF-8 thousands_sep b'\xd9\xac' U+066c (’٬’)
ar_SA.UTF-8 UTF-8 currency_symbol b'\xd8\xb1.\xd8\xb3.\xe2\x80\x8f' U+0631 U+002e U+0633 U+002e U+200f (‘ر.س.u200f’)
zh_TW.Big5 Big5 currency_symbol b'\xa2\xdc\xa2\xe2\xa2\x43' u'\uff2e\uff34\uff04' (NT$)
zh_TW.Big5 Big5 decimal_point b'\xa1\x44' u'\uff0e' (.)
zh_TW.Big5 Big5 thousands_sep b'\xa1\x41' u'\uff0c' (,)

Note: On FreeBSD with LC_CTYPE=”zh_TW.Big5”, mbstowcs() doesn’t use Big5 but a different encoding and so returns mojibake.

Windows 7.1:

LC_ALL locale Encoding Field Bytes Text
fr-FR cp1252 currency_symbol b'\x80' U+20AC
fr-FR cp1252 thousands_sep b'\xA0' U+00A0

strftime(), tzname

Fedora 27:

LC_ALL locale Encoding Month %b Bytes Text
fr_FR Latin1 December b'd\xe9c.' 'd\xe9c.' (déc.)

Windows 8.1:

LC_ALL locale Encoding Date, format Bytes Text
fr-FR cp1252 December, %b b'd\xe9c.' 'd\xe9c.' (déc.)
ja-JP cp932? Monday, %a N/A '\u6708'

Python2:

vstinner@apu$ python2
>>> import time, locale
>>> locale.setlocale(locale.LC_ALL, "fr_FR")
'fr_FR'
>>> time.strftime("%A, %d %B %Y", time.localtime(time.mktime((2018, 2, 1, 12, 0, 0, 0, 0, 0))))
'jeudi, 01 f\xe9vrier 2018'

strerror()

LC_ALL locale Encoding Bytes Text
fr_FR.ISO8859-1 ISO-8859-1 b'Fichier ou r\xe9pertoire inexistant' 'Fichier ou r\xe9pertoire inexistant'

Links:

Political and regional differences

Unicode provides a single standard and so cannot have special cases depending on country or recent political changes. Examples: