Scott Johnson: ɥɦɐ I just had to try out
some funky characters to see what would happen. :)
An advantage of declaring this page as utf-8 is that I can
distinguish between somebody typing ɥɦɐ and
ɥɦɐ, meaning that people
don’t have to double escape if they want to talk about
numeric entities on my weblog.
But don’t try to
search
for ɥɦɐ. While such a query will be
properly URI encoded based on utf-8, that particular string does
not appear in any text files.
So, sometimes the dragon wins. If you have a requirement
for full text search, and you haven’t outsourced it to
google, then you need a database that understands encodings, and
all of
Julik’s points apply.
Before I deploy my Ruby based weblog, I want to make sure that
both fastcgi and a database that supports utf-8 are in place
(Cornerhost is currently running mysql 3.23.58).
Some footnotes:
Beyond
Java? At the present time, Java is better than Python,
PHP, Perl, and Ruby in handling Unicode.
Taking
charge of your own destiny? Sure, I have access to the
full source to MySQL, but you think I’m going to hack Unicode
support in there? Heck no, there be dragons in there!
It’s cheaper to switch databases (or, in this case, upgrade
to a new version).
Actually innovative? I believe more strongly than ever
that internationalization is an excellent litmus test as to whether
or not that flashy startup has an expensive rewrite in their future.
I realize that some people disdain
edge
cases, but what makes this an art more than a science is
knowing which edge cases are important and which can be
YAGNI'd
away, coupled with the spirit of
purports
to conform.
<pedantic>
I sure you hope, you have mysql 3.23.58, not mysql 3.2.58...
</pedantic>
And to actually say something useful: The next version of PHP will have full Unicode support (it’s already in CVS), but I am not sure when that version will be stable. And then there is of course the issue of when you’ll be able to get it at your shared-hosting provider. I mean, mysql4.1 is relatively old, but look at how many (or few) places offer you mysql4.1 or up. And It also looks like PHP4 is still way ahead in terms of availability than PHP5.x
Sam, I’m currently working on a project that relies heavily on UTF-8 (imagine having to search and display massive amounts of data in every language), and we have managed to get it working properly with MySQL (4.1+) and Rails. There are some limitations, of course, but it’s not as bad as we thought it would be (it seems MySQL does actually enable full-text search with Unicode when you use MyISAM and Ruby’s String#each_char is UTF-8 aware if $KCODE = 'u').
As you’re guarantied to have only numeric character entites (NCR) in your files, cannot you encode the search query from UTF-8 into us-ascii+NCR before actually searching?
Sam Ruby: Sometimes the dragon wins. Not here: ɥɦɐ. And you can search for it. My dirty little secret, however, is that I’m storing everything in MySQL in fields declared to be encoded in latin1, storing UTF-8 in there anyway, and trusting the...
Sam Ruby: Sometimes the dragon wins: “I believe more strongly than ever that internationalization is an excellent litmus test as to whether or not that flashy startup has an expensive rewrite in their future.” Also in that post: "At...
Tcl has had good unicode support for several years. It slows things down a little bit if all you ever need to handle is plain ASCII, but that’s because it’s thoroughly integrated into the core of the language. In my experience, Tcl handles most i18n tasks with aplomb.
I realize that some people disdain edge cases, but what makes this an art more than a science is knowing which edge cases are important and which can be YAGNI’d away.
Yep. I’d question whether a basic understanding and conformance to baseline character encoding conventions is an edge case at all. It fits solidly into the 80 for a large portion of the world’s population.