It’s just data

utf8mb4

Jacques Distler: Remarkably, even after a decade of such pain, Unicode is, in 2012, still “cutting edge.”

Ouch.


Amusing exercise: compile a list of OS vendors who, by default, ship a non-Unicode-capable ("narrow") build of Python.

Posted by Jacques Distler at

I gather that I’m one of the lucky ones:

$ python
Python 2.7.3 (default, Apr 20 2012, 22:39:59) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> unichr(0x1D49C)
u'\U0001d49c'
Posted by Sam Ruby at

MacOSX 10.7.4 ("Lion"), with XCode 4.3.3:

% python
Python 2.7.1 (r271:86832, Jun 16 2011, 16:59:05) 
[GCC 4.2.1 (Based on Apple Inc. build 5658) (LLVM build 2335.15.00)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> unichr(0x1D49C)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
ValueError: unichr() arg not in range(0x10000) (narrow Python build)

(N.B.: XCode 4.3.3 is the latest release of the Developer tools (June 11, 2012). )

Posted by Jacques Distler at

A nice summary of the state of Unicode support in various languages. Fittingly, his slides make heavy use of emoji characters from Unicode 6. So, if you didn’t already know that most languages' Unicode support is a 💩, you’ll need to use Safari (or, alternatively, install the Symbola font) to view them properly.

Posted by Jacques Distler at

I think it’s not correct to characterize UTF-16 builds of Python as non-Unicode-capable. UTF-16 can represent all of Unicode. It was and continues to be a terrible, terrible idea to make the meaning of .py programs depend on whether the interpreter is compiled in UTF-32 or UTF-16 mode, though. Also, if a language moves away from UTF-16, UTF-32 is the wrong way to go. UTF-8 storage with library support for per-code point iteration is the right way to go.

Posted by Henri Sivonen at

I think it’s not correct to characterize UTF-16 builds of Python as non-Unicode-capable

How about defining builds of Python for which the following fails to be "non-Unicode-capable"?

unichr(0x1D49C)
Posted by Sam Ruby at

Hey, Python 3.3 has been released and drops the wide-build/narrow-build distinction (by picking the most compact internal representation as needed, from ascii to ucs-4). Prior to that, Linux builds did the right thing (due to no UCS-2 syscall legacy), while OSX & Windows builds did not.

PEP 393

Posted by anonymous at

Add your comment