utf8mb4
Jacques Distler: Remarkably, even after a decade of such pain, Unicode is, in 2012, still “cutting edge.”
Ouch.
It’s just data
Jacques Distler: Remarkably, even after a decade of such pain, Unicode is, in 2012, still “cutting edge.”
Ouch.
Amusing exercise: compile a list of OS vendors who, by default, ship a non-Unicode-capable ("narrow") build of Python.
Posted by Jacques Distler atI gather that I’m one of the lucky ones:
$ python Python 2.7.3 (default, Apr 20 2012, 22:39:59) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> unichr(0x1D49C) u'\U0001d49c'Posted by Sam Ruby at
MacOSX 10.7.4 ("Lion"), with XCode 4.3.3:
% python Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) [GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin Type "help", "copyright", "credits" or "license" for more information. >>> unichr(0x1D49C) Traceback (most recent call last): File "<stdin>", line 1, in <module> ValueError: unichr() arg not in range(0x10000) (narrow Python build)
(N.B.: XCode 4.3.3 is the latest release of the Developer tools (June 11, 2012). )
Posted by Jacques Distler atA nice summary of the state of Unicode support in various languages. Fittingly, his slides make heavy use of emoji characters from Unicode 6. So, if you didn’t already know that most languages' Unicode support is a 💩, you’ll need to use Safari (or, alternatively, install the Symbola font) to view them properly.
Posted by Jacques Distler atI think it’s not correct to characterize UTF-16 builds of Python as non-Unicode-capable. UTF-16 can represent all of Unicode. It was and continues to be a terrible, terrible idea to make the meaning of .py programs depend on whether the interpreter is compiled in UTF-32 or UTF-16 mode, though. Also, if a language moves away from UTF-16, UTF-32 is the wrong way to go. UTF-8 storage with library support for per-code point iteration is the right way to go.
Posted by Henri Sivonen atI think it’s not correct to characterize UTF-16 builds of Python as non-Unicode-capable
How about defining builds of Python for which the following fails to be "non-Unicode-capable"?
unichr(0x1D49C)Posted by Sam Ruby at
Hey, Python 3.3 has been released and drops the wide-build/narrow-build distinction (by picking the most compact internal representation as needed, from ascii to ucs-4). Prior to that, Linux builds did the right thing (due to no UCS-2 syscall legacy), while OSX & Windows builds did not.
Posted by anonymous at