The query to search for.
This query supports the full search language of Yahoo! Search,
including meta keywords. For details on constructing queries, see
Search Tips.
Hmm. No mention of encoding. But clearly this
information goes into the URI, so lets look at the
relevant RFC for
more information. There you find instructions on how to
Percent-Encode
your binary octets. Still no encoding requirement, but in
section
2.5 you see some rather strong hints:
A system that internally provides identifiers in the form of a
different character encoding, such as EBCDIC, will generally
perform character translation of textual identifiers to UTF-8
[STD63] (or some other superset of the US-ASCII character encoding)
at an internal interface, thereby providing more meaningful
identifiers than those resulting from simply percent-encoding the
original octets.
and
When a new URI scheme defines a component that represents
textual data consisting of characters from the Universal Character
Set [UCS], the data should first be encoded as octets according to
the UTF-8 character encoding
So, my read is that utf-8 is suggested as a way forward, but
existing schemes (like http) are grandfathered out. The only
suggestion is that a proper superset of ASCII be chosen...
something that includes not only utf-8, but also character sets
such as mac roman, win-1252, iso-8859-1, and cp 437.
Hmmm. That didn't work out too well, did it? As a
guess, lets try iso-8859-1 next.
Iñtërnâtiônàlizætiøn.
Much better.
Sorta.
It seems that the results come back in utf-8. Which is
slightly unexpected given that utf-8 doesn't work as input.
Furthermore, they get the
content-type
and/or charset incorrect.
Next I tried the
YahooSearchExample.php. How well this works depends on
the browser you have chosen as the character set is not
specified. Internet Explorer follows the guidance provided by
HTTP
specification section 3.7.1 which indicates that iso-8859-1 is
to be taken as the default, and then correctly displays
gibberish. Mozilla, on the other hand, takes advantage of
quirks
mode and senses that that data would be better displayed as
utf-8, and proceeds to do so.
Recommendation
As it currently stands, Yahoo! Search Web Services is
effectively only able to process searches for languages covered by
latin-1. It is my recommendation that the Yahoo! change to
support utf-8 on input and to update both the documentation and the
examples accordingly.
If the server change were made, I would be willing to submit
patches that will bring any or all of the examples up to
compliance.
Yahoo Search and Iñtërnâtiônàlizætiøn
Your feeds of this post contain double-escaped ">" entity references, which don't render correctly in any news reader I tried. It looks like they're supposed to be marking quoted lines, but even if they are unescaped correctly, some necessary line breaks are missing.
They look doubly escaped to you, because they are doubly escaped: once for the HTML, and again for the XML.
That is the appropriate level of escaping.
You need to find new aggregators, if they don't get this. The RSS 2 feed works fine in NewsGator. I'm sure that many other aggregators have no problem with this.
Mozilla, on the other hand, takes advantage of quirks mode and senses that that data would be better displayed as utf-8, and proceeds to do so.
When opening the page with Firefox 1.0.1 it is displayed as ISO-8859-1, thus rendering it in gibberish. After manually setting the value to UTF-8 it is rendered as it should be, and closing/reopening the page it is still in UTF-8 (and setting it to ISO-8859-1 and closing/reopening the page renders it in gibberish again).
So it would seem that Firefox actually doesn't guess the charset but goes by user preference.
As usual your analysis of encoding issues is spot on. This made me go check what the MSN Search Result RSS feeds do and from looking at the results of the query for Iñtërnâtiônàlizætiøn it looks like we are doing the right thing with UTF-8 input.
Thanks for the heads up. The query is supposed to be issued in utf-8 and we're updating the docs to reflect that. Your first query should work fine now.
Toby: cool! It looks like the query will now accept valid utf-8, and then fall back to iso-8859-1 if the input is not valid utf-8. It also looks like the charset on the content-type is also specified now.
There are a number of changes that should be done to the samples to make them correct. I offered to provide patches. Let me know if you are interested...
I offered to provide patches. Let me know if you are interested...
No guarantees (there'd be a race to kill me first), but please do send patches and/or fixes and we'll do what we can. You can email me directly, or send them to yws-feedback at yahoo-inc.com. Thanks!
Looking into the reason for the somewhat bizarre excerpt above, it looks like WordPress’s description elements in RSS 2.0 feeds, and summary elements in Atom feeds, are somewhat suboptimal for items/entries which represent pictures.
Yuck. Other than the obvious (forty lashes for anyone who considers putting non-content, non-body elements into a post: is that style element coming from Flickr?), what should WordPress do for maximum plain-text safety? I have the feeling that the no-HTML description comes from just running strip_tags() at the post body - should it add in an allowable_tags param in the strip_tags() call, with a list of every HTML element that shouldn’t have contents shown (script, style, um, noscript?, frameset/frame?, er, ...), and then try to strip them by hand from the strip_tags() output, hoping that it will at least be a little less likely to be destroyed by regex parsing at that point? Or is that a bug in strip_tags(), that it doesn’t realize it shouldn’t leave the content of a style element behind?
Yuck is right. The real source of the problem has something to do with how David got that flickr reference in there (style elements don’t belong in the body). But a script element could be legit, and would leave it’s content behind as well. If I can find time (and somebody else doesn’t beat me to it), I’ll try to patch our excerpt generation to make an attempt to guard against “garbage” like above.
Dougal’s right, the problem is the template that Flickr uses to post to wordpress via the metaweblog API. It’s not Wordpress’s fault.
Flickr lets me customize the template, so I’ll probably rip out the style info from the template and put it in my WordPress CSS file, but it’d be nice to fix the problem for all other Flickr users.
The only way I can think of that would leave the functionality intact and would be standard-compliant would be to put the styling info in the HTML attributes, but that’s somewhat offensive as well.
Yahoo has released its web services, and they’re pretty darn spiffy. You can use the API to search the Web, images, video, and news, or to do a local search. Of interest to me, as a developer of web services, is the fact that they’ve gone with a...
Yahoo et Internationalization Un problème détecté, puis corrigé. Encourageant ? Peut-être, mais c’est dommage qu’il faille une voix importante pour que les choses changent. C’est bien en fait que carnet Web ou pas, rien ne change. I18N, W3C...