The query to search for.
This query supports the full search language of Yahoo! Search,
including meta keywords. For details on constructing queries, see
Hmm. No mention of encoding. But clearly this
information goes into the URI, so lets look at the
relevant RFC for
more information. There you find instructions on how to
your binary octets. Still no encoding requirement, but in
2.5 you see some rather strong hints:
A system that internally provides identifiers in the form of a
different character encoding, such as EBCDIC, will generally
perform character translation of textual identifiers to UTF-8
[STD63] (or some other superset of the US-ASCII character encoding)
at an internal interface, thereby providing more meaningful
identifiers than those resulting from simply percent-encoding the
When a new URI scheme defines a component that represents
textual data consisting of characters from the Universal Character
Set [UCS], the data should first be encoded as octets according to
the UTF-8 character encoding
So, my read is that utf-8 is suggested as a way forward, but
existing schemes (like http) are grandfathered out. The only
suggestion is that a proper superset of ASCII be chosen...
something that includes not only utf-8, but also character sets
such as mac roman, win-1252, iso-8859-1, and cp 437.
Hmmm. That didn't work out too well, did it? As a
guess, lets try iso-8859-1 next.
It seems that the results come back in utf-8. Which is
slightly unexpected given that utf-8 doesn't work as input.
Furthermore, they get the
and/or charset incorrect.
Next I tried the
YahooSearchExample.php. How well this works depends on
the browser you have chosen as the character set is not
specified. Internet Explorer follows the guidance provided by
specification section 3.7.1 which indicates that iso-8859-1 is
to be taken as the default, and then correctly displays
gibberish. Mozilla, on the other hand, takes advantage of
mode and senses that that data would be better displayed as
utf-8, and proceeds to do so.
As it currently stands, Yahoo! Search Web Services is
effectively only able to process searches for languages covered by
latin-1. It is my recommendation that the Yahoo! change to
support utf-8 on input and to update both the documentation and the
If the server change were made, I would be willing to submit
patches that will bring any or all of the examples up to
Yahoo Search and Iñtërnâtiônàlizætiøn
Your feeds of this post contain double-escaped ">" entity references, which don't render correctly in any news reader I tried. It looks like they're supposed to be marking quoted lines, but even if they are unescaped correctly, some necessary line breaks are missing.
Mozilla, on the other hand, takes advantage of quirks mode and senses that that data would be better displayed as utf-8, and proceeds to do so.
When opening the page with Firefox 1.0.1 it is displayed as ISO-8859-1, thus rendering it in gibberish. After manually setting the value to UTF-8 it is rendered as it should be, and closing/reopening the page it is still in UTF-8 (and setting it to ISO-8859-1 and closing/reopening the page renders it in gibberish again).
So it would seem that Firefox actually doesn't guess the charset but goes by user preference.
As usual your analysis of encoding issues is spot on. This made me go check what the MSN Search Result RSS feeds do and from looking at the results of the query for Iñtërnâtiônàlizætiøn it looks like we are doing the right thing with UTF-8 input.
Toby: cool! It looks like the query will now accept valid utf-8, and then fall back to iso-8859-1 if the input is not valid utf-8. It also looks like the charset on the content-type is also specified now.
There are a number of changes that should be done to the samples to make them correct. I offered to provide patches. Let me know if you are interested...
Looking into the reason for the somewhat bizarre excerpt above, it looks like WordPress’s description elements in RSS 2.0 feeds, and summary elements in Atom feeds, are somewhat suboptimal for items/entries which represent pictures.
Yuck. Other than the obvious (forty lashes for anyone who considers putting non-content, non-body elements into a post: is that style element coming from Flickr?), what should WordPress do for maximum plain-text safety? I have the feeling that the no-HTML description comes from just running strip_tags() at the post body - should it add in an allowable_tags param in the strip_tags() call, with a list of every HTML element that shouldn’t have contents shown (script, style, um, noscript?, frameset/frame?, er, ...), and then try to strip them by hand from the strip_tags() output, hoping that it will at least be a little less likely to be destroyed by regex parsing at that point? Or is that a bug in strip_tags(), that it doesn’t realize it shouldn’t leave the content of a style element behind?
Yuck is right. The real source of the problem has something to do with how David got that flickr reference in there (style elements don’t belong in the body). But a script element could be legit, and would leave it’s content behind as well. If I can find time (and somebody else doesn’t beat me to it), I’ll try to patch our excerpt generation to make an attempt to guard against “garbage” like above.
Yahoo has released its web services, and they’re pretty darn spiffy. You can use the API to search the Web, images, video, and news, or to do a local search. Of interest to me, as a developer of web services, is the fact that they’ve gone with a...
Yahoo et Internationalization Un problème détecté, puis corrigé. Encourageant ? Peut-être, mais c’est dommage qu’il faille une voix importante pour que les choses changent. C’est bien en fait que carnet Web ou pas, rien ne change. I18N, W3C...