Sam Ruby

Sometimes the dragon wins

2005-11-04T05:08:47-08:00

Scott Johnson: ɥɦɐ I just had to try out some funky characters to see what would happen. :)

An advantage of declaring this page as utf-8 is that I can distinguish between somebody typing ɥɦɐ and ɥɦɐ, meaning that people don’t have to double escape if they want to talk about numeric entities on my weblog.

But don’t try to search for ɥɦɐ. While such a query will be properly URI encoded based on utf-8, that particular string does not appear in any text files.

So, sometimes the dragon wins. If you have a requirement for full text search, and you haven’t outsourced it to google, then you need a database that understands encodings, and all of Julik’s points apply.

Before I deploy my Ruby based weblog, I want to make sure that both fastcgi and a database that supports utf-8 are in place (Cornerhost is currently running mysql 3.23.58).

Some footnotes:

Beyond Java? At the present time, Java is better than Python, PHP, Perl, and Ruby in handling Unicode.
Taking charge of your own destiny? Sure, I have access to the full source to MySQL, but you think I’m going to hack Unicode support in there? Heck no, there be dragons in there! It’s cheaper to switch databases (or, in this case, upgrade to a new version).
Actually innovative? I believe more strongly than ever that internationalization is an excellent litmus test as to whether or not that flashy startup has an expensive rewrite in their future. I realize that some people disdain edge cases, but what makes this an art more than a science is knowing which edge cases are important and which can be YAGNI'd away, coupled with the spirit of purports to conform.

Sometimes the dragon wins

2005-11-04T07:05:21-08:00

I sure you hope, you have mysql 3.23.58, not mysql 3.2.58...

And to actually say something useful: The next version of PHP will have full Unicode support (it’s already in CVS), but I am not sure when that version will be stable. And then there is of course the issue of when you’ll be able to get it at your shared-hosting provider. I mean, mysql4.1 is relatively old, but look at how many (or few) places offer you mysql4.1 or up. And It also looks like PHP4 is still way ahead in terms of availability than PHP5.x

Sometimes the dragon wins

2005-11-04T07:27:13-08:00

I sure you hope, you have mysql 3.23.58, not mysql 3.2.58...

Fixed, thanks!

PHP 6’s support for Unicode looks very nice.

Sometimes the dragon wins

2005-11-04T09:13:59-08:00

Sam, I’m currently working on a project that relies heavily on UTF-8 (imagine having to search and display massive amounts of data in every language), and we have managed to get it working properly with MySQL (4.1+) and Rails. There are some limitations, of course, but it’s not as bad as we thought it would be (it seems MySQL does actually enable full-text search with Unicode when you use MyISAM and Ruby’s String#each_char is UTF-8 aware if $KCODE = 'u').

Documentation is here, here and here.

Sometimes the dragon wins

2005-11-04T09:20:03-08:00

As you’re guarantied to have only numeric character entites (NCR) in your files, cannot you encode the search query from UTF-8 into us-ascii+NCR before actually searching?

Sometimes the dragon wins

2005-11-04T10:58:29-08:00

Thomas: what I have is a word search, meaning that you won’t find “ear” in “search”.

This is implemented using swish-e, which doesn’t handle utf-8 very well.

Sam Ruby: Sometimes the dragon wins

2005-11-04T11:15:29-08:00

Sam Ruby: Sometimes the dragon wins. Not here: ɥɦɐ. And you can search for it. My dirty little secret, however, is that I’m storing everything in MySQL in fields declared to be encoded in latin1, storing UTF-8 in there anyway, and trusting the...

Sam Ruby: Sometimes the dragon wins

2005-11-04T11:15:31-08:00

Check the comments....

Sam Ruby: Sometimes the dragon wins

2005-11-05T00:45:14-08:00

Sam Ruby: Sometimes the dragon wins: “I believe more strongly than ever that internationalization is an excellent litmus test as to whether or not that flashy startup has an expensive rewrite in their future.” Also in that post: "At...

Sometimes the dragon wins

2005-11-05T01:40:43-08:00

Tcl has had good unicode support for several years. It slows things down a little bit if all you ever need to handle is plain ASCII, but that’s because it’s thoroughly integrated into the core of the language. In my experience, Tcl handles most i18n tasks with aplomb.

Sometimes the dragon wins

2005-11-05T01:43:49-08:00

I realize that some people disdain edge cases, but what makes this an art more than a science is knowing which edge cases are important and which can be YAGNI’d away.

Yep. I’d question whether a basic understanding and conformance to baseline character encoding conventions is an edge case at all. It fits solidly into the 80 for a large portion of the world’s population.