It’s just data

Dragons be gone

Jacques Distler: if anyone tells you: “i18n is easy, just use utf-8!” … go ahead and smack them.

Luckily, I’m outside of arms reach.  You see, my weblog is 100% valid XHTML 1.1, encoded as utf-8.

Truth be told, however, it also would be considered as 100% valid XHTML 1.1, encoded as iso-8859-1 (roman), iso-8859-5 (cyrillic), win-1252 (Microsoft), or macroman (Apple).

In fact, it would also be 100% valid if it were declared as encoded as us-ascii.

So, I’m not actually disagreeing with Jacques, and therefore probably could avoid a smack.  I would however, quibble with his first line:

i18n is hard. Don’t let anyone tell you any different.

Certain versions of certain tools don’t handle utf-8 well: that much I agree with.  I would also add that finding the right combination for a given configuration and keeping it working through upgrades is a bit of an effort.

But i18n != utf-8.  You can i18n with us-ascii just fine.  Just use numeric entities.  In my Python implementation, the logic to do this is a bit spread out, but I have build a more compact Ruby version.

Moral of the story, don’t convert to utf-8 in a plugin unless you are certain that every link in the chain can handle utf-8 properly.  If you feel that you must convert to utf-8, then is best to do it as some sort of post-processing filter after all the other logic has taken place.

But, for me, numeric entities are just fine.