+++++++
Unicode
+++++++
.. image:: unicode.png
:alt: Unicode logo
:align: right
:target: http://unicodebook.readthedocs.org/
Read my free ebook: `Programming with Unicode
`_!
Encodings
=========
* Windows
* OEM code page: used by stdin, stdou and stderr in the Windows console
* ANSI code page: used by all other Windows "ANSI" functions. Some examples:
filenames, command line arguments, environment variables, etc.
* UNIX: "Locale encoding"
* ``LC_CTYPE`` locale
* used for filenames, command line arguments, environment variables,
the console (stdin, stdout, stderr)
* Common encodings:
* UTF-8
* ISO 8859-1 aka Latin1 or Windows code page 1252
* ASCII
.. _python-unicode:
Python
======
See my conference (in french) "Comprendre les erreurs Unicode" (Pycon FR 2009
at Paris): `slides (PDF)
`_
and `video `_.
Narrow and wide builds, PEP 393
-------------------------------
Python 3.3 introduced the `Flexible String Representation (PEP 393)
`_ and supports the whole Unicode
range (``U+0000`` - ``U+10ffff``) on all platforms.
Older Python versions had a "narrow or wide" compilation option:
* UNIX and Mac OS X uses wide mode: the ``unicode`` type uses 32-bit code
points. In Unicode, it is called the ``UCS-4`` encoding.
* Windows uses narrow mode: the ``unicode`` type uses 16-bit code points,
non-BMP characters (unicode range ``U+10000`` - ``U+10ffff``) are used as
a surrogate pair (two 16-bit code points). In Unicode, it is called the
``UTF-16`` encoding. This mode is preferred on Windows because Windows kernel
uses also the ``UTF-16`` encoding internally.
Use ``sys.maxunicode == 0xffff`` to check if Python is compiled in narrow mode.
Otherwise, ``sys.maxunicode`` is equal to ``0x10ffff``.
Python 2
--------
* ``str`` type and ``"abc"`` are strings of *bytes*, ``unicode`` type is a
string of *characters*
* "Default encoding"
* ``sys.getdefaultencoding()``
* used by ``unicode.encode()`` and ``str.decode()`` when no encoding is
specified
* ASCII by default, must not by modified (``sys.setdefaultencoding()``)
* File system encoding
* ``sys.getfilesystemencoding()``
* used to encode filenames and environment variables
* used on UNIX by ``os.listdir(unicode)`` to decode filenames
* ANSI code page (``mbcs``) on Windows, ``utf-8`` on Mac OS X, the locale
encoding on UNIX
* Locale encoding
* ``locale.getpreferredencoding()``
* used by default by io.TextIOWrapper
* ANSI code page on Windows, ``LC_CTYPE`` locale on UNIX
* OEM code page (Windows only)
* ``sys.stdin.encoding``, ``sys.stdout.encoding`` and ``sys.stderr.encoding``
Python 3
--------
* ``bytes`` type is a string of *bytes*, ``str`` type and ``"abc"`` are strings
of *characters*
* UTF-8
* used for the default encoding of the source code
* "Locale encoding"
* ``locale.getpreferredencoding()``
* ANSI code page on Windows, ``LC_CTYPE`` locale on UNIX
* used by ``sys.stdin``, ``sys.stdout``, ``sys.stderr``, and by default by
``open()`` (and io.TextIOWrapper)
* "File system encoding"
* sys.getfilesystemencoding()
* ANSI code page (``mbcs``) on Windows, ``utf-8`` on Mac OS X, the locale
encoding on UNIX
* used for filenames, command line arguments, environment variables
* "Default encoding"
* ``sys.getdefaultencoding()``, hardcoded to ``utf-8``
* used by ``bytes.decode()`` and ``str.encode()`` when no encoding is
specified
* OEM code page (Windows only)
* ``sys.stdin.encoding``, ``sys.stdout.encoding`` and ``sys.stderr.encoding``
Issues
------
* `GB2312 codec is using a wrong covert table
`_: WONTFIX, It's a bug, but one which is
present in a lot of other systems as well, so we'd potentially make it
impossible to write GB2312 data which is supposed to be read back by these
other systems.
Test non-ASCII characters with locales
======================================
It seems like FreeBSD 11 doesn't support all encodings: only Latin1 and UTF-8
seem to be implemented. At least, KOI8-R, Big5 and CP1131 are not implemented
properly.
Windows locales: "fr-FR", "en-US", "ja-JP", etc.
On Windows, before setlocale(LC_CTYPE, "") is called, LC_CTYPE uses the Latin1
encoding in practice (see `Python issue #29571
`_). Call setlocale(LC_CTYPE, "")
to use the ANSI code page.
My tools:
* `test_all_locales.py
`_:
test Python implementation of locales. Only support a few operating systems.
* `all_locales.py
`_:
script to list all **working** locales, can be different than "locale -a"
* `c_locale.c `_:
basic info on the "C" locale
Use cases
---------
* Latin1 or UTF-8 encoding (locale different than C and POSIX)
* C or POSIX locale: ASCII encoding on Linux, Latin1 encoding on
FreeBSD/Solaris (but ASCII announced by nl_langinfo(CODESET))
* LC_NUMERIC != LC_CTYPE: the fun localeconv() bug, https://bugs.python.org/issue31900
* python 3.7 -X utf8
* macOS and Android UTF-8
Locale, announced encoding, effective encoding
----------------------------------------------
Inconsistent:
================ ========== ================== ==================
Operating system Locale Announced encoding Effective encoding
================ ========== ================== ==================
FreeBSD C, POSIX US-ASCII ISO-8859-1
FreeBSD zh_TW.Big5 Big5 ? (not Big5)
macOS C, POSIX US-ASCII ISO-8859-1
macOS zh_TW.Big5 Big5 ? (not Big5)
================ ========== ================== ==================
Consistent, announced encoding = effective encoding:
================ =========== ==================
Operating system Locale Encoding
================ =========== ==================
Fedora 27 C, POSIX ASCII
FreeBSD fr_FR.UTF-8 UTF-8
macOS fr_FR.UTF-8 UTF-8
Fedora 27 fr_FR.UTF-8 UTF-8
Fedora 27 zh_TW.Big5 Big5
================ =========== ==================
Tested operating systems:
* macOS 10.13.2:
* FreeBSD 11.1
* Fedora 27 (glibc 2.26)
localeconv()
------------
Fedora 27:
============== ======== =============== ======================== ===================================
LC_ALL locale Encoding Field Bytes Text
============== ======== =============== ======================== ===================================
es_MX.utf8 UTF-8 thousands_sep ``0xE2 0x80 0x89`` U+2009
fr_FR.UTF-8 UTF-8 currency_symbol ``0xE2 0x82 0xAC`` U+20AC (€)
ps_AF.utf8 UTF-8 thousands_sep ``0xD9 0xAC`` U+066C (٬)
uk_UA.koi8u KOI8-U currency_symbol ``0xC7 0xD2 0xCE 0x2E`` U+0433 U+0440 U+043d U+002E (грн.)
uk_UA.koi8u KOI8-U thousands_sep ``0x9A`` U+00A0
============== ======== =============== ======================== ===================================
macOS 10.13.2:
=============== ========= =============== ======================== ==================================
LC_ALL locale Encoding Field Bytes Text
=============== ========= =============== ======================== ==================================
ru_RU.ISO8859-5 ISO8859-5 currency_symbol ``b'\xe0\xe3\xd1.'`` U+0440 U+0443 U+0431 U+002e (руб.)
=============== ========= =============== ======================== ==================================
FreeBSD 11:
=============== ========= =============== ===================================== =================================================
LC_ALL locale Encoding Field Bytes Text
=============== ========= =============== ===================================== =================================================
ar_SA.UTF-8 UTF-8 decimal_point ``b'\xd9\xab'`` U+066b ('٫')
ar_SA.UTF-8 UTF-8 thousands_sep ``b'\xd9\xac'`` U+066c ('٬')
ar_SA.UTF-8 UTF-8 currency_symbol ``b'\xd8\xb1.\xd8\xb3.\xe2\x80\x8f'`` U+0631 U+002e U+0633 U+002e U+200f ('ر.س.\u200f')
zh_TW.Big5 Big5 currency_symbol ``b'\xa2\xdc\xa2\xe2\xa2\x43'`` ``u'\uff2e\uff34\uff04'`` (NT$)
zh_TW.Big5 Big5 decimal_point ``b'\xa1\x44'`` ``u'\uff0e'`` (.)
zh_TW.Big5 Big5 thousands_sep ``b'\xa1\x41'`` ``u'\uff0c'`` (,)
=============== ========= =============== ===================================== =================================================
Note: On FreeBSD with LC_CTYPE="zh_TW.Big5", mbstowcs() doesn't use Big5 but a
different encoding and so returns mojibake.
Windows 7.1:
=============== ========= =============== =========== ======
LC_ALL locale Encoding Field Bytes Text
=============== ========= =============== =========== ======
fr-FR cp1252 currency_symbol ``b'\x80'`` U+20AC
fr-FR cp1252 thousands_sep ``b'\xA0'`` U+00A0
=============== ========= =============== =========== ======
strftime(), tzname
------------------
Fedora 27:
============== ======== =============== ============== ===========================
LC_ALL locale Encoding Month %b Bytes Text
============== ======== =============== ============== ===========================
fr_FR Latin1 December ``b'd\xe9c.'`` ``'d\xe9c.'`` (déc.)
============== ======== =============== ============== ===========================
Windows 8.1:
============== ======== =============== ============== ====================
LC_ALL locale Encoding Date, format Bytes Text
============== ======== =============== ============== ====================
fr-FR cp1252 December, %b ``b'd\xe9c.'`` ``'d\xe9c.'`` (déc.)
ja-JP cp932? Monday, %a N/A ``'\u6708'``
============== ======== =============== ============== ====================
Python2::
vstinner@apu$ python2
>>> import time, locale
>>> locale.setlocale(locale.LC_ALL, "fr_FR")
'fr_FR'
>>> time.strftime("%A, %d %B %Y", time.localtime(time.mktime((2018, 2, 1, 12, 0, 0, 0, 0, 0))))
'jeudi, 01 f\xe9vrier 2018'
* `non-ASCII tzname on Windows `_:
"'東京 (標準時)' means 'Tokyo (Standard Time)' in Japanese."
* https://bugs.python.org/issue5905
* https://bugs.python.org/issue13560
* https://bugs.python.org/issue16322
* `Commit af02e1c8: Add PyUnicode_DecodeLocaleAndSize() and PyUnicode_DecodeLocale()
`_
"Fix time.strftime() (if wcsftime() is missing): decode strftime() result
from the current locale encoding, not from the filesystem encoding."
* `Commit 720f34a3: Issue #5905
`_:
"time.strftime() is now using the locale encoding, instead of UTF-8, if the
wcsftime() function is not available."
strerror()
----------
=============== ========== =========================================== ===========================================
LC_ALL locale Encoding Bytes Text
=============== ========== =========================================== ===========================================
fr_FR.ISO8859-1 ISO-8859-1 ``b'Fichier ou r\xe9pertoire inexistant'`` ``'Fichier ou r\xe9pertoire inexistant'``
=============== ========== =========================================== ===========================================
Links:
* `non-ASCII strerror `_:
"os.strerror(23) = 'Trop de fichiers ouverts dans le syst\\xe8me'."
* https://bugs.python.org/issue13560
* `Commit 1f33f2b0
`_:
"Issue #13560: os.strerror() now uses the current locale encoding instead
of UTF-8"
Political and regional differences
==================================
Unicode provides a single standard and so cannot have special cases depending
on country or recent political changes. Examples:
* 2018: `lower() on Turkish letter "İ" returns a 2-chars-long string
`_
* 2017: `Germany made the upper case ß official. 'ß'.upper() should now return ẞ.
`_