It’s just data

Yahoo Search and Iñtërnâtiônàlizætiøn

OK, so we’ve established that you’ve got a tool that you want to ensure is internationalized. The first thing I want you to do is to copy the string

Iñtërnâtiônàlizætiøn

into your tool and observe what comes out the other side.

This is from my Survival guide to i18n.  Let's see how the recently announced Yahoo! Search Web Services fare.

Let's start simple, with a Web Search.

From the documentation:

Parameter Value Description
query string (required) The query to search for. This query supports the full search language of Yahoo! Search, including meta keywords. For details on constructing queries, see Search Tips.

Hmm.  No mention of encoding.  But clearly this information goes into the URI, so lets look at the relevant RFC for more information.  There you find instructions on how to Percent-Encode your binary octets.  Still no encoding requirement, but in section 2.5 you see some rather strong hints:

A system that internally provides identifiers in the form of a different character encoding, such as EBCDIC, will generally perform character translation of textual identifiers to UTF-8 [STD63] (or some other superset of the US-ASCII character encoding) at an internal interface, thereby providing more meaningful identifiers than those resulting from simply percent-encoding the original octets.

and

When a new URI scheme defines a component that represents textual data consisting of characters from the Universal Character Set [UCS], the data should first be encoded as octets according to the UTF-8 character encoding

So, my read is that utf-8 is suggested as a way forward, but existing schemes (like http) are grandfathered out.  The only suggestion is that a proper superset of ASCII be chosen... something that includes not only utf-8, but also character sets such as mac roman, win-1252, iso-8859-1, and cp 437.

So, lets try utf-8 first: Iñtërnâtiônàlizætiøn.

Hmmm.  That didn't work out too well, did it?  As a guess, lets try iso-8859-1 next.  Iñtërnâtiônàlizætiøn.  Much better.

Sorta.

It seems that the results come back in utf-8.  Which is slightly unexpected given that utf-8 doesn't work as input.  Furthermore, they get the content-type and/or charset incorrect.

Next I tried the YahooSearchExample.php.  How well this works depends on the browser you have chosen as the character set is not specified.  Internet Explorer follows the guidance provided by HTTP specification section 3.7.1 which indicates that iso-8859-1 is to be taken as the default, and then correctly displays gibberish.  Mozilla, on the other hand, takes advantage of quirks mode and senses that that data would be better displayed as utf-8, and proceeds to do so.

Recommendation

As it currently stands, Yahoo! Search Web Services is effectively only able to process searches for languages covered by latin-1.  It is my recommendation that the Yahoo! change to support utf-8 on input and to update both the documentation and the examples accordingly.

If the server change were made, I would be willing to submit patches that will bring any or all of the examples up to compliance.


Yahoo Search and Iñtërnâtiônàlizætiøn

Your feeds of this post contain double-escaped ">" entity references, which don't render correctly in any news reader I tried.  It looks like they're supposed to be marking quoted lines, but even if they are unescaped correctly, some necessary line breaks are missing.

Posted by Matt Brubeck at

Yahoo Search and Iñtërnâtiônàlizætiøn

They look doubly escaped to you, because they are doubly escaped: once for the HTML, and again for the XML.

That is the appropriate level of escaping.

You need to find new aggregators, if they don't get this. The RSS 2 feed works fine in NewsGator. I'm sure that many other aggregators have no problem with this.

Posted by Brad Wilson at

Yahoo Search and Iñtërnâtiônàlizætiøn

Mozilla, on the other hand, takes advantage of quirks mode and senses that that data would be better displayed as utf-8, and proceeds to do so.

When opening the page with Firefox 1.0.1 it is displayed as ISO-8859-1, thus rendering it in gibberish. After manually setting the value to UTF-8 it is rendered as it should be, and closing/reopening the page it is still in UTF-8 (and setting it to ISO-8859-1 and closing/reopening the page renders it in gibberish again).

So it would seem that Firefox actually doesn't guess the charset but goes by user preference.

Posted by porges at

Yahoo Search and Iñtërnâtiônàlizætiøn

So as you mentioned and I just learned the results actually get back as US-ASCII.

Posted by Anne at

Yahoo Search and Iñtërnâtiônàlizætiøn

As usual your analysis of encoding issues is spot on. This made me go check what the MSN Search Result RSS feeds do and from looking at the results of the query for Iñtërnâtiônàlizætiøn it looks like we are doing the right thing with UTF-8 input.

PS: The post also looks fine in RSS Bandit.

Posted by Dare Obasanjo at

Yahoo Search and Iñtërnâtiônàlizætiøn

Thanks for the heads up. The query is supposed to be issued in utf-8 and we're updating the docs to reflect that. Your first query should work fine now.

Posted by Toby Elliott at

Yahoo Search and Iñtërnâtiônàlizætiøn

Toby: cool!  It looks like the query will now accept valid utf-8, and then fall back to iso-8859-1 if the input is not valid utf-8.  It also looks like the charset on the content-type is also specified now.

There are a number of changes that should be done to the samples to make them correct.  I offered to provide patches.  Let me know if you are interested...

Posted by Sam Ruby at

Yahoo Search and Iñtërnâtiônàlizætiøn

I offered to provide patches.  Let me know if you are interested...

No guarantees (there'd be a race to kill me first), but please do send patches and/or fixes and we'll do what we can. You can email me directly, or send them to yws-feedback at yahoo-inc.com. Thanks!

Posted by Toby Elliott at

Sam Ruby: Yahoo Search and Iñtërnâtiônàlizætiøn

Wayne Burkett : Sam Ruby: Yahoo Search and Iñtërnâtiônàlizætiøn...

Excerpt from HotLinks - Level 1 at

Ampersands are like Unicode lite

.flickr-photo { border: solid 2px #000000; } .flickr-yourcomment { } .flickr-frame { text-align: left; padding: 3px; } .flickr-caption { font-size: 0.8em; margin-top: 0px; } ampersand, originally uploaded by David Ascher. Ampersands, as I’ve...

Excerpt from david ascher at

Yahoo Search and Iñtërnâtiônàlizætiøn

Looking into the reason for the somewhat bizarre excerpt above, it looks like WordPress’s description elements in RSS 2.0 feeds, and summary elements in Atom feeds, are somewhat suboptimal for items/entries which represent pictures.

Posted by Sam Ruby at

Yahoo Search and Iñtërnâtiônàlizætiøn

Yuck. Other than the obvious (forty lashes for anyone who considers putting non-content, non-body elements into a post: is that style element coming from Flickr?), what should WordPress do for maximum plain-text safety? I have the feeling that the no-HTML description comes from just running strip_tags() at the post body - should it add in an allowable_tags param in the strip_tags() call, with a list of every HTML element that shouldn’t have contents shown (script, style, um, noscript?, frameset/frame?, er, ...), and then try to strip them by hand from the strip_tags() output, hoping that it will at least be a little less likely to be destroyed by regex parsing at that point? Or is that a bug in strip_tags(), that it doesn’t realize it shouldn’t leave the content of a style element behind?

Posted by Phil Ringnalda at

Yahoo Search and Iñtërnâtiônàlizætiøn

Yuck is right. The real source of the problem has something to do with how David got that flickr reference in there (style elements don’t belong in the body). But a script element could be legit, and would leave it’s content behind as well. If I can find time (and somebody else doesn’t beat me to it), I’ll try to patch our excerpt generation to make an attempt to guard against “garbage” like above.

Any other suggestions while I’m in there?

Posted by Dougal Campbell at

Yahoo Search and Iñtërnâtiônàlizætiøn

Yowza, those lashes hurt!

Dougal’s right, the problem is the template that Flickr uses to post to wordpress via the metaweblog API.  It’s not Wordpress’s fault.

Flickr lets me customize the template, so I’ll probably rip out the style info from the template and put it in my WordPress CSS file, but it’d be nice to fix the problem for all other Flickr users. 

The only way I can think of that would leave the functionality intact and would be standard-compliant would be to put the styling info in the HTML attributes, but that’s somewhat offensive as well.

Any other ideas?

Posted by David Ascher at

Yahoo Web Services

Yahoo has released its web services, and they’re pretty darn spiffy. You can use the API to search the Web, images, video, and news, or to do a local search. Of interest to me, as a developer of web services, is the fact that they’ve gone with a...

Excerpt from rc3.org Daily at

Carnet : Yahoo et Internationalization

Yahoo et Internationalization Un problème détecté, puis corrigé. Encourageant ? Peut-être, mais c’est dommage qu’il faille une voix importante pour que les choses changent. C’est bien en fait que carnet Web ou pas, rien ne change. I18N, W3C...

Excerpt from Karl & Cow - Le carnet Web at

Yahoo Search and Iñtërnâtiônàlizætiøn

[link]...

Excerpt from del.icio.us/raster/unicode at

Sam Ruby: Yahoo Search and Iñtërnâtiônàlizætiøn

[link]...

Excerpt from Delicious/dionidium/xml at

Add your comment