Truth be told, however, it also would be considered as 100%
valid XHTML 1.1, encoded as iso-8859-1 (roman), iso-8859-5
(cyrillic), win-1252 (Microsoft), or macroman (Apple).
In fact, it would also be 100% valid if it were declared as
encoded as us-ascii.
So, I’m not actually disagreeing with Jacques, and
therefore probably could avoid a smack. I would however,
quibble with his first line:
i18n is hard.
Don’t let anyone tell you any different.
Certain versions of certain tools don’t handle utf-8 well:
that much I agree with. I would also add that finding the
right combination for a given configuration and keeping it working
through upgrades is a bit of an effort.
But i18n != utf-8. You can i18n with us-ascii just
fine. Just use numeric entities. In my Python
implementation, the logic to do this is a bit spread out, but I
have build a more compact
Moral of the story, don’t convert to utf-8 in a plugin
unless you are certain that every link in the chain
can handle utf-8 properly. If you feel that you must convert
to utf-8, then is best to do it as some sort of post-processing
filter after all the other logic has taken place.
But, for me, numeric entities are just fine.
“Just use numeric entities” assumes you are writing for (or at least publishing to) the web. There are lots of offline applications of i18n. (By the way, lots of web-centric software doesn’t handle numeric entities well either.) [link]
Is there a toolchain (for this type of application) whose elements are unicode-safe?
Versions of MySQL before 4.1 are not. Versions of Perl before 5.8 are not (or not well). No version of DBD::mysql is. MovableType (built on the above) is not.
Using NCRs is fine for me, and fine for you, but it’s not fine for someone who is really using international characters in a serious way. Converting to utf-8 at the end of the toolchain doesn’t cut it, if you are really trying to do character-based manipulations in some international character set. (See, e.g., Zack’s example of Urdu usernames in phpbb.)
Sam, I am not in the arms race because I am no biggie. But I ate my fair share of trouble by going all UTF and I can say the following:
1.Numeric entities are NOT ok - they consume alot of space, they make the source unreadable, they are not handled anywhere behalf XML and HTML. They are not handled by databases - no search for you. They are not handled by terminals - which means no automation. Actually they are not handled anywhere except XML/HTML land. PHP4 has no mechanism for decoding them either.
2. When starting on any programming project you can reasonably expect every component of the toolchain to have broken UTF support. Every single one of them. From database to the bindings to the scripting language to the web server. Rest assured - there is no SQL call for client connection charset. Rest assured - full-text searching will be broken. Rest assured - field length of your DB counts bytes instead of chars.
What is important (and what many don’t realise) - all of this is wrong and it has to be fixed. Goddamn it’s long overdue if you ask me.
Ruby itself has no UTF support at all (and thus no notion of character and no notion of normalization - which means “no case conversions for you” etc.) - which is a pity because in the case of Ruby this is a political matter. I had to go through some intensive hacking to make it work. You can’t stick your head in the sand and pretend that “I receive some non-ASCII content every now and then but I even can’t read it”. Many, many many people need to edit it, enter it, search it. And I deeply sympathize people like Jacques or Zack who just go and hack the bits other developers blissfully ignored. And only if we keep on hacking the situation might improve somewhat.
I already know how to call SET NAMES UTF8 in 4 programming languages to be precise. I think it’s absolutely ridiculous that every single piece of software that has database configuration needed manual introduction of this call. It’s just plain wrong.
Numeric entities are not okay for someone like me who writes in a non-English language because search doesn’t work.
And every programming language and tool I have used for the web required hacking to work with UTF-8. And I still haven’t gotten everything working for a host of projects involving python, perl, php, MT, WP, phpbb, etc.
BTW your page displays horribly and completely uselessly on my Treo.
It took me quite a while before I felt I understood what the heck I was doing with unicode... I think over the past month or so I moved from bewilderment through dawning realization and now I’m in the rueful acceptance phase... and that’s with a language that tries to ease the unicode pain - Python.
It certainly helps if all the key weapons in your toolbelt support unicode sanely - “qp”, a new cousin to the elder Quixote http framework - does this. But if you don’t like python object db’s or need to code for Winders, don’t bother.
You can browse the tarball or download QP from:  [link]
You’ll also need to check out qpy and durus, also from the same site noted
From the readme: The abbreviation “qp” stands for “quantum placet”, the Latin phrase meaning “as much as you please”.
Julik doesn’t like UTF: When starting on any programming project you can reasonably expect every component of the toolchain to have broken UTF support. Every single one of them. From database to the bindings to the scripting language to the web...
Shouldn’t the message re: UTF-8 be - if you’re going to use it, convert ALL of your content offline in a “migration”, rather than attempting conversions on-the-fly (the exception being remote feeds)? The later seems to be a path to madness. Meanwhile MySQL < 4.1.14 shouldn’t actually convert (or destroy) an encoding.
Been putting together a PHP-specific wiki page at [link] (apologies IBM - wiki requires signup but is open to all if the initial section needs updating) - mainly piecing together the bits Sam’s been reporting over the years (although I may have made mistakes). I imagine most of this should apply to Ruby (iconv, PCRE etc.).
Also have a bunch of functions for validation as well as “utf-8 aware” versions of PHP’s string functions which I need to get round to releasing under CVS here: [link] (really intended as a last resort on shared hosts). Found PCRE can do alot to help with UTF-8, as it “understands” UTF-8 up to 6 bytes (bearing this in mind: [link]) and otherwise this is surprisingly fast and strict: [link].
And think UTF-8 can be bearable (if not easy) in languages like PHP - Dokuwiki is a great example (stores content in files) - [link]
no, i18n != utf-8. it’s much, much more complicated than getting the text encoding right. Plone has been fairly successful in europe largely because it does a better job with i18n than most competing content management systems. dig through its i18n architecture and you’ll see that it’s not trivial to do right. eg, see: [link]
utf-8 support isn’t hard to get right, it just usually requires discipline and a commitment to testing. some languages make it easier than others. eg, python’s unicode support is relatively sane with a seperate “unicode string” data type which is supported and methods to convert that to and from byte strings in various encodings. to make your application unicode safe, you just use unicode strings everywhere internally and then make sure that the boundaries of your program do the right encoding/decoding.
it’s still not trivial though depending on what libraries you use do and how well they support unicode. i actually just yesterday wrote up my notes of my experience doing a unicode audit and overhaul of a python web application: [link]
Upcoming.org uses numeric entities to display characters that it can’t store in iso-8859-1 (ugh), so when I go back to edit the indie rock concert info that I’ve inputted into the Upcoming Shanghai metro, I get a mess of numbers and ampersands in the form’s textarea, instead of characters that I can read and deal with.
Perl’s utf-8 support is decent, but still atoning for past mistakes.
Unicode is a wonderful thing. it is also occasionally the bane of my existance. Joel Spolsky has a classic article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets that covers the...
Neither Python nor Perl get this right. They both make Unicode optional and subclass (or flag) unicode strings, and of course every developer on Earth who doesn’t know (or care) just uses them as-is. As long as Unicode is optional, nobody will care. What there silly hacks do in Perl and Python is they force you to typecast string to each other! (in a scripting language? are you out of your mind?). Which, most certainly, nobody actually does.
Java and C# do it right OTOH. When you have a string - it’s UTF-16, and you got NO choice. That’s the way it should be, otherwise all English developers will just stick their heads in the sand forever. Or emigrate to the US prior to doing that.