Unicode¶

Read my free ebook: Programming with Unicode!
Encodings¶
- Windows
- OEM code page: used by stdin, stdou and stderr in the Windows console
- ANSI code page: used by all other Windows “ANSI” functions. Some examples: filenames, command line arguments, environment variables, etc.
- UNIX: “Locale encoding”
LC_CTYPE
locale- used for filenames, command line arguments, environment variables, the console (stdin, stdout, stderr)
- Common encodings:
- UTF-8
- ISO 8859-1 aka Latin1 or Windows code page 1252
- ASCII
Python¶
See my conference (in french) “Comprendre les erreurs Unicode” (Pycon FR 2009 at Paris): slides (PDF) and video.
Narrow and wide builds, PEP 393¶
Python 3.3 introduced the Flexible String Representation (PEP 393) and supports the whole Unicode
range (U+0000
- U+10ffff
) on all platforms.
Older Python versions had a “narrow or wide” compilation option:
- UNIX and Mac OS X uses wide mode: the
unicode
type uses 32-bit code points. In Unicode, it is called theUCS-4
encoding. - Windows uses narrow mode: the
unicode
type uses 16-bit code points, non-BMP characters (unicode rangeU+10000
-U+10ffff
) are used as a surrogate pair (two 16-bit code points). In Unicode, it is called theUTF-16
encoding. This mode is preferred on Windows because Windows kernel uses also theUTF-16
encoding internally.
Use sys.maxunicode == 0xffff
to check if Python is compiled in narrow mode.
Otherwise, sys.maxunicode
is equal to 0x10ffff
.
Python 2¶
str
type and"abc"
are strings of bytes,unicode
type is a string of characters- “Default encoding”
sys.getdefaultencoding()
- used by
unicode.encode()
andstr.decode()
when no encoding is specified - ASCII by default, must not by modified (
sys.setdefaultencoding()
)
- File system encoding
sys.getfilesystemencoding()
- used to encode filenames and environment variables
- used on UNIX by
os.listdir(unicode)
to decode filenames - ANSI code page (
mbcs
) on Windows,utf-8
on Mac OS X, the locale encoding on UNIX
- Locale encoding
locale.getpreferredencoding()
- used by default by io.TextIOWrapper
- ANSI code page on Windows,
LC_CTYPE
locale on UNIX
- OEM code page (Windows only)
sys.stdin.encoding
,sys.stdout.encoding
andsys.stderr.encoding
Python 3¶
bytes
type is a string of bytes,str
type and"abc"
are strings of characters- UTF-8
- used for the default encoding of the source code
- “Locale encoding”
locale.getpreferredencoding()
- ANSI code page on Windows,
LC_CTYPE
locale on UNIX - used by
sys.stdin
,sys.stdout
,sys.stderr
, and by default byopen()
(and io.TextIOWrapper)
- “File system encoding”
- sys.getfilesystemencoding()
- ANSI code page (
mbcs
) on Windows,utf-8
on Mac OS X, the locale encoding on UNIX - used for filenames, command line arguments, environment variables
- “Default encoding”
sys.getdefaultencoding()
, hardcoded toutf-8
- used by
bytes.decode()
andstr.encode()
when no encoding is specified
- OEM code page (Windows only)
sys.stdin.encoding
,sys.stdout.encoding
andsys.stderr.encoding
Issues¶
- GB2312 codec is using a wrong covert table: WONTFIX, It’s a bug, but one which is present in a lot of other systems as well, so we’d potentially make it impossible to write GB2312 data which is supposed to be read back by these other systems.
Test non-ASCII characters with locales¶
It seems like FreeBSD 11 doesn’t support all encodings: only Latin1 and UTF-8 seem to be implemented. At least, KOI8-R, Big5 and CP1131 are not implemented properly.
Windows locales: “fr-FR”, “en-US”, “ja-JP”, etc.
On Windows, before setlocale(LC_CTYPE, “”) is called, LC_CTYPE uses the Latin1 encoding in practice (see Python issue #29571). Call setlocale(LC_CTYPE, “”) to use the ANSI code page.
My tools:
- test_all_locales.py: test Python implementation of locales. Only support a few operating systems.
- all_locales.py: script to list all working locales, can be different than “locale -a”
- c_locale.c: basic info on the “C” locale
Use cases¶
- Latin1 or UTF-8 encoding (locale different than C and POSIX)
- C or POSIX locale: ASCII encoding on Linux, Latin1 encoding on FreeBSD/Solaris (but ASCII announced by nl_langinfo(CODESET))
- LC_NUMERIC != LC_CTYPE: the fun localeconv() bug, https://bugs.python.org/issue31900
- python 3.7 -X utf8
- macOS and Android UTF-8
Locale, announced encoding, effective encoding¶
Inconsistent:
Operating system | Locale | Announced encoding | Effective encoding |
---|---|---|---|
FreeBSD | C, POSIX | US-ASCII | ISO-8859-1 |
FreeBSD | zh_TW.Big5 | Big5 | ? (not Big5) |
macOS | C, POSIX | US-ASCII | ISO-8859-1 |
macOS | zh_TW.Big5 | Big5 | ? (not Big5) |
Consistent, announced encoding = effective encoding:
Operating system | Locale | Encoding |
---|---|---|
Fedora 27 | C, POSIX | ASCII |
FreeBSD | fr_FR.UTF-8 | UTF-8 |
macOS | fr_FR.UTF-8 | UTF-8 |
Fedora 27 | fr_FR.UTF-8 | UTF-8 |
Fedora 27 | zh_TW.Big5 | Big5 |
Tested operating systems:
- macOS 10.13.2:
- FreeBSD 11.1
- Fedora 27 (glibc 2.26)
localeconv()¶
Fedora 27:
LC_ALL locale | Encoding | Field | Bytes | Text |
---|---|---|---|---|
es_MX.utf8 | UTF-8 | thousands_sep | 0xE2 0x80 0x89 |
U+2009 |
fr_FR.UTF-8 | UTF-8 | currency_symbol | 0xE2 0x82 0xAC |
U+20AC (€) |
ps_AF.utf8 | UTF-8 | thousands_sep | 0xD9 0xAC |
U+066C (٬) |
uk_UA.koi8u | KOI8-U | currency_symbol | 0xC7 0xD2 0xCE 0x2E |
U+0433 U+0440 U+043d U+002E (грн.) |
uk_UA.koi8u | KOI8-U | thousands_sep | 0x9A |
U+00A0 |
macOS 10.13.2:
LC_ALL locale | Encoding | Field | Bytes | Text |
---|---|---|---|---|
ru_RU.ISO8859-5 | ISO8859-5 | currency_symbol | b'\xe0\xe3\xd1.' |
U+0440 U+0443 U+0431 U+002e (руб.) |
FreeBSD 11:
LC_ALL locale | Encoding | Field | Bytes | Text |
---|---|---|---|---|
ar_SA.UTF-8 | UTF-8 | decimal_point | b'\xd9\xab' |
U+066b (’٫’) |
ar_SA.UTF-8 | UTF-8 | thousands_sep | b'\xd9\xac' |
U+066c (’٬’) |
ar_SA.UTF-8 | UTF-8 | currency_symbol | b'\xd8\xb1.\xd8\xb3.\xe2\x80\x8f' |
U+0631 U+002e U+0633 U+002e U+200f (‘ر.س.u200f’) |
zh_TW.Big5 | Big5 | currency_symbol | b'\xa2\xdc\xa2\xe2\xa2\x43' |
u'\uff2e\uff34\uff04' (NT$) |
zh_TW.Big5 | Big5 | decimal_point | b'\xa1\x44' |
u'\uff0e' (.) |
zh_TW.Big5 | Big5 | thousands_sep | b'\xa1\x41' |
u'\uff0c' (,) |
Note: On FreeBSD with LC_CTYPE=”zh_TW.Big5”, mbstowcs() doesn’t use Big5 but a different encoding and so returns mojibake.
Windows 7.1:
LC_ALL locale | Encoding | Field | Bytes | Text |
---|---|---|---|---|
fr-FR | cp1252 | currency_symbol | b'\x80' |
U+20AC |
fr-FR | cp1252 | thousands_sep | b'\xA0' |
U+00A0 |
strftime(), tzname¶
Fedora 27:
LC_ALL locale | Encoding | Month %b | Bytes | Text |
---|---|---|---|---|
fr_FR | Latin1 | December | b'd\xe9c.' |
'd\xe9c.' (déc.) |
Windows 8.1:
LC_ALL locale | Encoding | Date, format | Bytes | Text |
---|---|---|---|---|
fr-FR | cp1252 | December, %b | b'd\xe9c.' |
'd\xe9c.' (déc.) |
ja-JP | cp932? | Monday, %a | N/A | '\u6708' |
Python2:
vstinner@apu$ python2
>>> import time, locale
>>> locale.setlocale(locale.LC_ALL, "fr_FR")
'fr_FR'
>>> time.strftime("%A, %d %B %Y", time.localtime(time.mktime((2018, 2, 1, 12, 0, 0, 0, 0, 0))))
'jeudi, 01 f\xe9vrier 2018'
- non-ASCII tzname on Windows: “‘東京 (標準時)’ means ‘Tokyo (Standard Time)’ in Japanese.”
- https://bugs.python.org/issue5905
- https://bugs.python.org/issue13560
- https://bugs.python.org/issue16322
- Commit af02e1c8: Add PyUnicode_DecodeLocaleAndSize() and PyUnicode_DecodeLocale() “Fix time.strftime() (if wcsftime() is missing): decode strftime() result from the current locale encoding, not from the filesystem encoding.”
- Commit 720f34a3: Issue #5905: “time.strftime() is now using the locale encoding, instead of UTF-8, if the wcsftime() function is not available.”
strerror()¶
LC_ALL locale | Encoding | Bytes | Text |
---|---|---|---|
fr_FR.ISO8859-1 | ISO-8859-1 | b'Fichier ou r\xe9pertoire inexistant' |
'Fichier ou r\xe9pertoire inexistant' |
Links:
- non-ASCII strerror: “os.strerror(23) = ‘Trop de fichiers ouverts dans le syst\xe8me’.”
- https://bugs.python.org/issue13560
- Commit 1f33f2b0: “Issue #13560: os.strerror() now uses the current locale encoding instead of UTF-8”
Political and regional differences¶
Unicode provides a single standard and so cannot have special cases depending on country or recent political changes. Examples: