intertwingly

It’s just data

Yahoo Search and Iñtërnâtiônàlizætiøn


OK, so we’ve established that you’ve got a tool that you want to ensure is internationalized. The first thing I want you to do is to copy the string

Iñtërnâtiônàlizætiøn

into your tool and observe what comes out the other side.

This is from my Survival guide to i18n.  Let's see how the recently announced Yahoo! Search Web Services fare.

Let's start simple, with a Web Search.

From the documentation:

Parameter Value Description
query string (required) The query to search for. This query supports the full search language of Yahoo! Search, including meta keywords. For details on constructing queries, see Search Tips.

Hmm.  No mention of encoding.  But clearly this information goes into the URI, so lets look at the relevant RFC for more information.  There you find instructions on how to Percent-Encode your binary octets.  Still no encoding requirement, but in section 2.5 you see some rather strong hints:

A system that internally provides identifiers in the form of a different character encoding, such as EBCDIC, will generally perform character translation of textual identifiers to UTF-8 [STD63] (or some other superset of the US-ASCII character encoding) at an internal interface, thereby providing more meaningful identifiers than those resulting from simply percent-encoding the original octets.

and

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding

So, my read is that utf-8 is suggested as a way forward, but existing schemes (like http) are grandfathered out.  The only suggestion is that a proper superset of ASCII be chosen... something that includes not only utf-8, but also character sets such as mac roman, win-1252, iso-8859-1, and cp 437.

So, lets try utf-8 first: Iñtërnâtiônàlizætiøn.

Hmmm.  That didn't work out too well, did it?  As a guess, lets try iso-8859-1 next.  Iñtërnâtiônàlizætiøn.  Much better.

Sorta.

It seems that the results come back in utf-8.  Which is slightly unexpected given that utf-8 doesn't work as input.  Furthermore, they get the content-type and/or charset incorrect.

Next I tried the YahooSearchExample.php.  How well this works depends on the browser you have chosen as the character set is not specified.  Internet Explorer follows the guidance provided by HTTP specification section 3.7.1 which indicates that iso-8859-1 is to be taken as the default, and then correctly displays gibberish.  Mozilla, on the other hand, takes advantage of quirks mode and senses that that data would be better displayed as utf-8, and proceeds to do so.

Recommendation

As it currently stands, Yahoo! Search Web Services is effectively only able to process searches for languages covered by latin-1.  It is my recommendation that the Yahoo! change to support utf-8 on input and to update both the documentation and the examples accordingly.

If the server change were made, I would be willing to submit patches that will bring any or all of the examples up to compliance.