I point this out, as a
permathread has reerupted on atom-syntax. What bothers me
about the permathread is that people seem to take certain things as
absolutes when reality is so much more deliciously complicated.
Now, lets take a look at what SharpReader says when it
encounters a feed it can't handle:
Error parsing RSS XML: Undefined root element: html
Very simple, honest, and to the point. It doesn't proclaim
that the "feed" is invalid. It simply states that it
encountered an error during parse, specifies what the error is,
suggests a way to independently validate the feed, and suggests a
way to provide feedback to the tool author if you think that there
is likely a bug in SharpReader.
Based on
experiments,
I am very much convinced that every possible permutation of
validity, successfully passing the validator, and being able to be
meaningfully consumed by your favorite aggregator exists out there
in the wild.
In an absolute sense, the feedvalidator is not perfect.
Does that mean that it is not useful? The best we can observe
is that there is a high correlation between correctness and
usefulness. This also is true for feeds. A feed may be
technically valid but not useful. A feed may be technically
invalid but useful.
In the midst of all this noise, a
sensible suggestion re-emerged. A totally opt-in feature
which enables feedback to be provided. This being said, I
have the following concerns:
Placing the information on how to handle invalid feeds inside
the feed itself seems counterproductive. This seems like a
perfect use case for an HTTP header, with a fallback of an element
with suggestions that if the fallback is used, the element should
be placed near the top of the feed (for the benefit of stream,
pull, of SAX parsers), and rigidly matching a regular
expression. The
pingback
spec can be used for inspiration.
From a security perspective, I have grave concerns about the
ability for a single person to
orchestrate a
DDOS attack on a third party. Such an effort could easily
be
cloaked.
However, overall, the idea of an optional feature whereby an
HTTP GET coupled with a
User-Agent header seems like it can't do much harm, and
might actually prove useful.
RE: Feedback loops
Why would the feed at [link] be considered invalid? It's in a valid encoding, isn't it? As for what Luke does with feeds that he's had difficulty parsing RSS Bandit does the same, also if we've successfully parsed the feed in the past we provide the contact info from the admin:errorReportsTo, managingEditor, etc elements in a mailto: link so folks can send a message to the site admins.
The Postel's law arguments is getting old. Some people will parse feeds liberally and others won't. Most will to some degree or the other and provide some UI hints to users when the feed crosses the line they can handle.
Can't you guys just move along and produce something useful. Is ATOM 0.3 all we are going to get out of the ATOM effort?
Encodings are a known and accepted interoperability problem. If you publish a feed in something other than utf-8, you may lose.
The admonition not to publish feeds as application/octet-stream seems a bit pedantic but it's going to cause trouble for some applications so I suppose it's reasonable for a validator to point it out.
By the fuzzy specifications of RSS, the feed might be perfectly valid. But serving feeds as 'application/octet-stream' and in 'gb2312' encoding is something I would consider, at least, a bit awkward. And I think the feed validator is very correct in pointing these peculiar things out.
Asbjørn,
Besides the fact that whatever XML parser Sam is using is not internationalized enough to support an encoding used by millions of people, what exactly is wrong with gb2312?
And what would be wrong with my encoding my feeds in x-ringnalda-klingon-37? The XML rec doesn't say I can't, no RSS spec says I can't, no XML parser can read it. Should I not be notified that I'm going to be unparseable? Should a gb2312 encoder not be notified that they will be unparseable in any PHP-based aggregator (at least pre-PHP5), in any Python-based aggregator not run by someone who understands the ways of ICONV, in any...
Norman may be right that encoding problems are known and accepted in the inner sanctum of XML, but for those of us with very little exposure to it, it's a big surprise that we can either use utf-8, or risk being given the cold shoulder (yes, in theory we should be able to use utf-8 or utf-16, but since my parser fails to accept utf-16 despite the rec requiring it to, I say utf-8 is the only possible choice).
Instead, Dare, could you give me one single good reason why the feed shouldn't be encoded in UTF-8? Instead of leaving the burden of parsing it up to (possibly) millions of consumers, isn't it better to fix the «problem» (although you seem to think it's not) at the source, leaving the burden on one single head -- the producer?
I think most can be learned by asking the author of the blog why? Maybe his community of friends don't have a problem w/ the feed. That's likely all that matters to him. Maybe he didn't know how to get his characters working w/ UTF-8 and MT, in which case, SixLog should likely spend some time writing a HowTo, which they might already have and you can redirect users accordingly from the Feed Validator.
James, perhaps I should update this to say used to choke ;-)
A fix was made by Joseph Walton to not only handle this case properly, but to display the Chinese characters correctly in utf-8.
Pedantic warnings are still issued for use of encodings which are not widely supported, use of non-xml content types, and the explicit overriding the XML encoding with an HTTP charset.
re. "I am very much convinced that every possible permutation of validity..."
Therein lies the rub - there isn't any consistent notion of RSS validity. The Validator is as useful as can be, but without real standardisation of the formats involved, it's never going to be correct. RSS is notionally XML, but implementers generally have to treat it as loosely-tagged text. One key consequence is that there isn't the option of corrective feedback for feed producers (how it's done is another matter). It will be a shame if Atom doesn't learn from this mistake.
The only way that Atom could learn from this mistake is to require all Atom feeds to be utf-8.
Probably an OK, but very strict, solution to the problem. I think it should be UTF-X, though, so UTF-16 and -32 can be used in the future. «Life is like a box of chocolate, you never know what you're gonna get».
"The only way that Atom could learn from this mistake is to require all Atom feeds to be utf-8."
That sort-of sounds like "US rules". Remember that UTF-8 is effectively the interna...zation of ASCII, where most non-ASCII characters require three to six bytes. Non-ASCIIans generally look at UTF-8 and say "I don't think so". What about UCS-2?
Asbjørn,
It seems you are assuming content syndication is only of interest to people in the western world. UTF-8 is popular in the Western world but isn't in places like China. Effectively banning the most popular encoding used by a sizable chunk of this planet's population sounds pretty rude to me.
Read [link] for some idea of what encodings are actually used outside of the Western hemisphere by people who actually have to work with this stuff for a living. Note that UTF-8 isn't listed there.
So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. (An earlier UTF-8 specification allowed even higher code points to be represented, using 5 or 6 bytes, but this is no longer supported.)
and
Ideographs use 3 bytes in UTF-8, but only 2 in UTF-16. So Chinese/Japanese/Korean text will take up more space when represented in UTF-8.
Besides the fact that whatever XML parser Sam is using is not internationalized enough to support an encoding used by millions of people
That's crap. It says the feed is valid, but it issues a warning on the encoding. It only uses a real XML parser; it wouldn't know the feed was valid unless it could read it.
Do your homework before you go making unfounded accusations of incompetence (again).
The only way that Atom could learn from this mistake is to require all Atom feeds to be utf-8.
That sort-of sounds like "US rules"
That's crap too. The only encodings guaranteed to be supported by XML parsers are UTF-8 and UTF-16. Atom didn't make it that way; the XML specification did. Want to replace the XML specification? Oh, how we would line up behind you.
I think it should be UTF-X, though, so UTF-16 and -32 can be used in the future.
UTF-32 is not guaranteed to be supported by XML parsers, so I don't see how this is an improvement.
We could mandate us-ascii too (all the element names in Atom are us-ascii, and all content can be represented with numeric entities), but I don't see how that would be an improvement either.
Mark,
Sam said it choked and when I clicked on the page I saw error messages. I didn't expect I had to scroll down all the way to the bottom of the page to see that it still said the feed was valid.
This looks like a problem with the feed validator UI. It wasn't clear at all that the feed was considered valid from following Sam's link without scrolling all the way past an XML feed that looked like garbage in my browser.
Mark,
Whatever! Your argument is invalid. What does XML have to do w/ UTF-8 just being a forward kludge of ASCII, a.k.a. US-ASCII, a.k.a. American Standard Code...? Just because the XML groups implemented this pro-Western BS kludge doesn't mean it's not a pro-Western BS kludge.
Things I can change: I can issue warnings when such conditions are detected so that feed producers who care to have the widest possible interoperability can make informed choices.
My understanding was that GB2312 had officially been superseded by GB18030, which doesn’t generate a warning. The other non-obscure encodings were taken from Syndic8’s statistics, and probably reflect some Western bias:
Here's an interesting thread that's getting very political. Let's start w/ a great comment... MSXML Dude: Can't you guys just move along and produce something useful. Is ATOM 0.3 all we are going to get out of the ATOM effort? Finally... Atom Dude:...
Randy Charles Morin:
Remember that UTF-8 is effectively the interna...zation of ASCII, where most non-ASCII characters require three to six bytes. Non-ASCIIans generally look at UTF-8 and say "I don't think so".
I really don't find the byte size problem that big of a problem. Why are everyone so damned protective of their bandwidth when it comes to the size of the encoding, but couldn't care less when it's about using XML or other (better) formats (like Enamel), and don't give a heck about enabling gzip on their HTTP 1.1 served content?
This is arguing for the arguments sake. UTF-8 solves an incredible amount of problems, and the rest of the UTF family solves the rest. I'm a home grown, radical, Non-ASCIIan (you can tell from the 'ø' in my name), but still love UTF-8 more than my own mother. And I'm not alone.
Dare Obasanjo:
It seems you are assuming content syndication is only of interest to people in the western world. UTF-8 is popular in the Western world but isn't in places like China.
No, I'm not. That's why I also mentioned UTF-16 and UTF-32.
Mark:
UTF-32 is not guaranteed to be supported by XML parsers, so I don't see how this is an improvement.
Maybe UTF-32 isn't, but UTF-8 is an incredible improvement from the insane mess we have today. All languages that fit in UTF-8 have absolutely no reason to use another encoding. The ISO-8859 family is dead, or at least should be. The languages that doesn't fit in there, fits in UTF-16 which is (or will be) supported by both the specification and most tools, no?
Yea, that solves a lot of problems, doesn't it? Why the heck use XML at all? CSV is much more bandwidth friendly! Heck, why send the characters as literals at all? Binary objects are much better. Yes, US-ASCII-enforced characters in binary a format. That's the future! :-)
Randy Charles Morin:
Just because the XML groups implemented this pro-Western BS kludge doesn't mean it's not a pro-Western BS kludge.
You don't find understanding in the fact that US-ASCII was the most widely used character set before the days of Unicode, and that Unicode probably wouldn't be more than another dead specification if it wasn't backward compatible with ASCII? I understand that very well, even though the ASCII characters only cover a part of the letters I write on a daily basis.
Besides the fact that whatever XML parser Sam is using is not internationalized enough to support an encoding used by millions of people, what exactly is wrong with gb2312?
GB18030 has been a mandatory standard in Mainland China since September 1, 2001, plus or minus a day. Which is not to say GB-2312 isn't a legal encoding, but both theory and practice appear to support the warning message of "obscure".
Mark,
Considering that Syndic8 claims that there are 0 feeds whose encoding is GB18030 does that mean that none exist? Using an English language site mostly utilized by people who speak English to speculate on the number of Chinese feeds seems to be an error prone endeavor.
On the other hand Googling for "GB2312" returns almost 10 times more results than Googling for "GB18030".
There's really no point in going back and forth about this anyway. The feed validator says the encoding valid and warns users that the encoding may not be widely supported. This seems fair to me. Arguing about whether GB2312 is actually widely used or not when neither of us is familiar with the XML usage in mainland China is a waste of time.
Using an English language site mostly utilized by people who speak English to speculate on the number of Chinese feeds seems to be an error prone endeavor.
And slinging unfounded insults like "whatever XML parser Sam is using is not internationalized enough to support an encoding used by millions of people" without any data to back up either the first half or the second half is... well, it's par for the course for you, Dare. But that doesn't mean you should get away with it.