Unicode¶
Read my free ebook: Programming with Unicode!
Encodings¶
Windows
OEM code page: used by stdin, stdou and stderr in the Windows console
ANSI code page: used by all other Windows “ANSI” functions. Some examples: filenames, command line arguments, environment variables, etc.
UNIX: “Locale encoding”
LC_CTYPE
localeused for filenames, command line arguments, environment variables, the console (stdin, stdout, stderr)
Common encodings:
UTF-8
ISO 8859-1 aka Latin1 or Windows code page 1252
ASCII
Python¶
See my conference (in french) “Comprendre les erreurs Unicode” (Pycon FR 2009 at Paris): slides (PDF) and video.
Narrow and wide builds, PEP 393¶
Python 3.3 introduced the Flexible String Representation (PEP 393) and supports the whole Unicode
range (U+0000
- U+10ffff
) on all platforms.
Older Python versions had a “narrow or wide” compilation option:
UNIX and Mac OS X uses wide mode: the
unicode
type uses 32-bit code points. In Unicode, it is called theUCS-4
encoding.Windows uses narrow mode: the
unicode
type uses 16-bit code points, non-BMP characters (unicode rangeU+10000
-U+10ffff
) are used as a surrogate pair (two 16-bit code points). In Unicode, it is called theUTF-16
encoding. This mode is preferred on Windows because Windows kernel uses also theUTF-16
encoding internally.
Use sys.maxunicode == 0xffff
to check if Python is compiled in narrow mode.
Otherwise, sys.maxunicode
is equal to 0x10ffff
.
Python 2¶
str
type and"abc"
are strings of bytes,unicode
type is a string of characters“Default encoding”
sys.getdefaultencoding()
used by
unicode.encode()
andstr.decode()
when no encoding is specifiedASCII by default, must not by modified (
sys.setdefaultencoding()
)
File system encoding
sys.getfilesystemencoding()
used to encode filenames and environment variables
used on UNIX by
os.listdir(unicode)
to decode filenamesANSI code page (
mbcs
) on Windows,utf-8
on Mac OS X, the locale encoding on UNIX
Locale encoding
locale.getpreferredencoding()
used by default by io.TextIOWrapper
ANSI code page on Windows,
LC_CTYPE
locale on UNIX
OEM code page (Windows only)
sys.stdin.encoding
,sys.stdout.encoding
andsys.stderr.encoding
Python 3¶
bytes
type is a string of bytes,str
type and"abc"
are strings of charactersUTF-8
used for the default encoding of the source code
“Locale encoding”
locale.getpreferredencoding()
ANSI code page on Windows,
LC_CTYPE
locale on UNIXused by
sys.stdin
,sys.stdout
,sys.stderr
, and by default byopen()
(and io.TextIOWrapper)
“File system encoding”
sys.getfilesystemencoding()
ANSI code page (
mbcs
) on Windows,utf-8
on Mac OS X, the locale encoding on UNIXused for filenames, command line arguments, environment variables
“Default encoding”
sys.getdefaultencoding()
, hardcoded toutf-8
used by
bytes.decode()
andstr.encode()
when no encoding is specified
OEM code page (Windows only)
sys.stdin.encoding
,sys.stdout.encoding
andsys.stderr.encoding
Issues¶
GB2312 codec is using a wrong covert table: WONTFIX, It’s a bug, but one which is present in a lot of other systems as well, so we’d potentially make it impossible to write GB2312 data which is supposed to be read back by these other systems.
Test non-ASCII characters with locales¶
It seems like FreeBSD 11 doesn’t support all encodings: only Latin1 and UTF-8 seem to be implemented. At least, KOI8-R, Big5 and CP1131 are not implemented properly.
Windows locales: “fr-FR”, “en-US”, “ja-JP”, etc.
On Windows, before setlocale(LC_CTYPE, “”) is called, LC_CTYPE uses the Latin1 encoding in practice (see Python issue #29571). Call setlocale(LC_CTYPE, “”) to use the ANSI code page.
My tools:
test_all_locales.py: test Python implementation of locales. Only support a few operating systems.
all_locales.py: script to list all working locales, can be different than “locale -a”
c_locale.c: basic info on the “C” locale
Use cases¶
Latin1 or UTF-8 encoding (locale different than C and POSIX)
C or POSIX locale: ASCII encoding on Linux, Latin1 encoding on FreeBSD/Solaris (but ASCII announced by nl_langinfo(CODESET))
LC_NUMERIC != LC_CTYPE: the fun localeconv() bug, https://bugs.python.org/issue31900
python 3.7 -X utf8
macOS and Android UTF-8
Locale, announced encoding, effective encoding¶
Inconsistent:
Operating system |
Locale |
Announced encoding |
Effective encoding |
---|---|---|---|
FreeBSD |
C, POSIX |
US-ASCII |
ISO-8859-1 |
FreeBSD |
zh_TW.Big5 |
Big5 |
? (not Big5) |
macOS |
C, POSIX |
US-ASCII |
ISO-8859-1 |
macOS |
zh_TW.Big5 |
Big5 |
? (not Big5) |
Consistent, announced encoding = effective encoding:
Operating system |
Locale |
Encoding |
---|---|---|
Fedora 27 |
C, POSIX |
ASCII |
FreeBSD |
fr_FR.UTF-8 |
UTF-8 |
macOS |
fr_FR.UTF-8 |
UTF-8 |
Fedora 27 |
fr_FR.UTF-8 |
UTF-8 |
Fedora 27 |
zh_TW.Big5 |
Big5 |
Tested operating systems:
macOS 10.13.2:
FreeBSD 11.1
Fedora 27 (glibc 2.26)
localeconv()¶
Fedora 27:
LC_ALL locale |
Encoding |
Field |
Bytes |
Text |
---|---|---|---|---|
es_MX.utf8 |
UTF-8 |
thousands_sep |
|
U+2009 |
fr_FR.UTF-8 |
UTF-8 |
currency_symbol |
|
U+20AC (€) |
ps_AF.utf8 |
UTF-8 |
thousands_sep |
|
U+066C (٬) |
uk_UA.koi8u |
KOI8-U |
currency_symbol |
|
U+0433 U+0440 U+043d U+002E (грн.) |
uk_UA.koi8u |
KOI8-U |
thousands_sep |
|
U+00A0 |
macOS 10.13.2:
LC_ALL locale |
Encoding |
Field |
Bytes |
Text |
---|---|---|---|---|
ru_RU.ISO8859-5 |
ISO8859-5 |
currency_symbol |
|
U+0440 U+0443 U+0431 U+002e (руб.) |
FreeBSD 11:
LC_ALL locale |
Encoding |
Field |
Bytes |
Text |
---|---|---|---|---|
ar_SA.UTF-8 |
UTF-8 |
decimal_point |
|
U+066b (’٫’) |
ar_SA.UTF-8 |
UTF-8 |
thousands_sep |
|
U+066c (’٬’) |
ar_SA.UTF-8 |
UTF-8 |
currency_symbol |
|
U+0631 U+002e U+0633 U+002e U+200f (‘ر.س.u200f’) |
zh_TW.Big5 |
Big5 |
currency_symbol |
|
|
zh_TW.Big5 |
Big5 |
decimal_point |
|
|
zh_TW.Big5 |
Big5 |
thousands_sep |
|
|
Note: On FreeBSD with LC_CTYPE=”zh_TW.Big5”, mbstowcs() doesn’t use Big5 but a different encoding and so returns mojibake.
Windows 7.1:
LC_ALL locale |
Encoding |
Field |
Bytes |
Text |
---|---|---|---|---|
fr-FR |
cp1252 |
currency_symbol |
|
U+20AC |
fr-FR |
cp1252 |
thousands_sep |
|
U+00A0 |
strftime(), tzname¶
Fedora 27:
LC_ALL locale |
Encoding |
Month %b |
Bytes |
Text |
---|---|---|---|---|
fr_FR |
Latin1 |
December |
|
|
Windows 8.1:
LC_ALL locale |
Encoding |
Date, format |
Bytes |
Text |
---|---|---|---|---|
fr-FR |
cp1252 |
December, %b |
|
|
ja-JP |
cp932? |
Monday, %a |
N/A |
|
Python2:
vstinner@apu$ python2
>>> import time, locale
>>> locale.setlocale(locale.LC_ALL, "fr_FR")
'fr_FR'
>>> time.strftime("%A, %d %B %Y", time.localtime(time.mktime((2018, 2, 1, 12, 0, 0, 0, 0, 0))))
'jeudi, 01 f\xe9vrier 2018'
non-ASCII tzname on Windows: “‘東京 (標準時)’ means ‘Tokyo (Standard Time)’ in Japanese.”
Commit af02e1c8: Add PyUnicode_DecodeLocaleAndSize() and PyUnicode_DecodeLocale() “Fix time.strftime() (if wcsftime() is missing): decode strftime() result from the current locale encoding, not from the filesystem encoding.”
Commit 720f34a3: Issue #5905: “time.strftime() is now using the locale encoding, instead of UTF-8, if the wcsftime() function is not available.”
strerror()¶
LC_ALL locale |
Encoding |
Bytes |
Text |
---|---|---|---|
fr_FR.ISO8859-1 |
ISO-8859-1 |
|
|
Links:
non-ASCII strerror: “os.strerror(23) = ‘Trop de fichiers ouverts dans le syst\xe8me’.”
Commit 1f33f2b0: “Issue #13560: os.strerror() now uses the current locale encoding instead of UTF-8”
Political and regional differences¶
Unicode provides a single standard and so cannot have special cases depending on country or recent political changes. Examples: