Scott Johnson: ɥɦɐ I just had to try out some funky characters to see what would happen. :)
An advantage of declaring this page as utf-8 is that I can
distinguish between somebody typing ɥɦɐ and
ɥɦɐ
, meaning that people
don’t have to double escape if they want to talk about
numeric entities on my weblog.
But don’t try to search for ɥɦɐ. While such a query will be properly URI encoded based on utf-8, that particular string does not appear in any text files.
So, sometimes the dragon wins. If you have a requirement for full text search, and you haven’t outsourced it to google, then you need a database that understands encodings, and all of Julik’s points apply.
Before I deploy my Ruby based weblog, I want to make sure that both fastcgi and a database that supports utf-8 are in place (Cornerhost is currently running mysql 3.23.58).
Some footnotes:
<pedantic>
I sure you hope, you have mysql 3.23.58, not mysql 3.2.58...
</pedantic>
And to actually say something useful: The next version of PHP will have full Unicode support (it’s already in CVS), but I am not sure when that version will be stable. And then there is of course the issue of when you’ll be able to get it at your shared-hosting provider. I mean, mysql4.1 is relatively old, but look at how many (or few) places offer you mysql4.1 or up. And It also looks like PHP4 is still way ahead in terms of availability than PHP5.x
I sure you hope, you have mysql 3.23.58, not mysql 3.2.58...
Fixed, thanks!
PHP 6’s support for Unicode looks very nice.
Sam, I’m currently working on a project that relies heavily on UTF-8 (imagine having to search and display massive amounts of data in every language), and we have managed to get it working properly with MySQL (4.1+) and Rails. There are some limitations, of course, but it’s not as bad as we thought it would be (it seems MySQL does actually enable full-text search with Unicode when you use MyISAM and Ruby’s String#each_char
is UTF-8 aware if $KCODE = 'u'
).
Thomas: what I have is a word search, meaning that you won’t find “ear” in “search”.
This is implemented using swish-e, which doesn’t handle utf-8 very well.
I realize that some people disdain edge cases, but what makes this an art more than a science is knowing which edge cases are important and which can be YAGNI’d away.
Yep. I’d question whether a basic understanding and conformance to baseline character encoding conventions is an edge case at all. It fits solidly into the 80 for a large portion of the world’s population.