It’s just data

Feedback loops

Is this feed valid?  Both SharpReader and Bloglines handle it flawlessly.  In fact, there are active blogline subscribers.

The feedvalidator chokes on it.

I point this out, as a permathread has reerupted on atom-syntax.  What bothers me about the permathread is that people seem to take certain things as absolutes when reality is so much more deliciously complicated.

Now, lets take a look at what SharpReader says when it encounters a feed it can't handle:

Error parsing RSS XML: Undefined root element: html

Please try to validate this feed. If this feed validates as correct RSS, you can send an error report.

Very simple, honest, and to the point.  It doesn't proclaim that the "feed" is invalid.  It simply states that it encountered an error during parse, specifies what the error is, suggests a way to independently validate the feed, and suggests a way to provide feedback to the tool author if you think that there is likely a bug in SharpReader.

Based on experiments, I am very much convinced that every possible permutation of validity, successfully passing the validator, and being able to be meaningfully consumed by your favorite aggregator exists out there in the wild.

In an absolute sense, the feedvalidator is not perfect.  Does that mean that it is not useful?  The best we can observe is that there is a high correlation between correctness and usefulness.  This also is true for feeds.  A feed may be technically valid but not useful.  A feed may be technically invalid but useful.

In the midst of all this noise, a sensible suggestion re-emerged.  A totally opt-in feature which enables feedback to be provided.  This being said, I have the following concerns:

However, overall, the idea of an optional feature whereby an HTTP GET coupled with a User-Agent header seems like it can't do much harm, and might actually prove useful.


RE: Feedback loops

Why would the feed at [link] be considered invalid? It's in a valid encoding, isn't it? As for what Luke does with feeds that he's had difficulty parsing RSS Bandit does the  same, also if we've successfully parsed the feed in the past we provide the contact info from the admin:errorReportsTo, managingEditor, etc elements in a mailto: link so folks can send a message to the site admins.

The Postel's law arguments is getting old. Some people will parse feeds liberally and others won't. Most will to some degree or the other and provide some UI hints to users when the feed crosses the line they can handle.

Can't you guys just move along and produce something useful. Is ATOM 0.3 all we are going to get out of the ATOM effort?

Message from Dare Obasanjo at


The feed looks entirely valid to me.

Encodings are a known and accepted interoperability problem. If you publish a feed in something other than utf-8, you may lose.

The admonition not to publish feeds as application/octet-stream seems a bit pedantic but it's going to cause trouble for some applications so I suppose it's reasonable for a validator to point it out.

Posted by Norman Walsh at

By the fuzzy specifications of RSS, the feed might be perfectly valid. But serving feeds as 'application/octet-stream' and in 'gb2312' encoding is something I would consider, at least, a bit awkward. And I think the feed validator is very correct in pointing these peculiar things out.

Posted by Asbjørn Ulsberg at

RE: Feedback loops

Asbjørn,
  Besides the fact that whatever XML parser Sam is using is not internationalized enough to support an encoding used by millions of people, what exactly is wrong with gb2312?

Message from Dare Obasanjo

at

And what would be wrong with my encoding my feeds in x-ringnalda-klingon-37? The XML rec doesn't say I can't, no RSS spec says I can't, no XML parser can read it. Should I not be notified that I'm going to be unparseable? Should a gb2312 encoder not be notified that they will be unparseable in any PHP-based aggregator (at least pre-PHP5), in any Python-based aggregator not run by someone who understands the ways of ICONV, in any...

Norman may be right that encoding problems are known and accepted in the inner sanctum of XML, but for those of us with very little exposure to it, it's a big surprise that we can either use utf-8, or risk being given the cold shoulder (yes, in theory we should be able to use utf-8 or utf-16, but since my parser fails to accept utf-16 despite the rec requiring it to, I say utf-8 is the only possible choice).

Posted by Phil Ringnalda at

Instead, Dare, could you give me one single good reason why the feed shouldn't be encoded in UTF-8? Instead of leaving the burden of parsing it up to (possibly) millions of consumers, isn't it better to fix the «problem» (although you seem to think it's not) at the source, leaving the burden on one single head -- the producer?

Posted by Asbjørn Ulsberg at

In what way does the feed validator choke? It emits some warnings, but marks it as valid RSS at the bottom of the report ...

Posted by James Aylett at

I think most can be learned by asking the author of the blog why? Maybe his community of friends don't have a problem w/ the feed. That's likely all that matters to him. Maybe he didn't know how to get his characters working w/ UTF-8 and MT, in which case, SixLog should likely spend some time writing a HowTo, which they might already have and you can redirect users accordingly from the Feed Validator.

Posted by Randy Charles Morin at

James, perhaps I should update this to say used to choke ;-)

A fix was made by Joseph Walton to not only handle this case properly, but to display the Chinese characters correctly in utf-8.

Pedantic warnings are still issued for use of encodings which are not widely supported, use of non-xml content types, and the explicit overriding the XML encoding with an HTTP charset.

Posted by Sam Ruby at

re. "I am very much convinced that every possible permutation of validity..."

Therein lies the rub - there isn't any consistent notion of RSS validity. The Validator is as useful as can be, but without real standardisation of the formats involved, it's never going to be correct. RSS is notionally XML, but implementers generally have to treat it as loosely-tagged text. One key consequence is that there isn't the option of corrective feedback for feed producers (how it's done is another matter). It will be a shame if Atom doesn't learn from this mistake.

Posted by Danny at

RSS is notionally XML, but implementers generally have to treat it as loosely-tagged text.

Danny, forgive me, but I think you have missed the point here.  This feed is well formed XML.  It is valid RSS 1.0.  It is valid RDF/XML.

It is also encoded using a character set that not only isn't required to be supported by XML, it in fact isn't supported by many popular XML parsers.

The only way that Atom could learn from this mistake is to require all Atom feeds to be utf-8.

Posted by Sam Ruby at

The only way that Atom could learn from this mistake is to require all Atom feeds to be utf-8.

Probably an OK, but very strict, solution to the problem. I think it should be UTF-X, though, so UTF-16 and -32 can be used in the future. «Life is like a box of chocolate, you never know what you're gonna get».

Posted by Asbjørn Ulsberg at

"The only way that Atom could learn from this mistake is to require all Atom feeds to be utf-8."

That sort-of sounds like "US rules". Remember that UTF-8 is effectively the interna...zation of ASCII, where most non-ASCII characters require three to six bytes. Non-ASCIIans generally look at UTF-8 and say "I don't think so". What about UCS-2?

Posted by Randy Charles Morin at

Asbjørn,
It seems you are assuming content syndication is only of interest to people in the western world. UTF-8 is popular in the Western world but isn't in places like China. Effectively banning the most popular encoding used by a sizable chunk of this planet's population sounds pretty rude to me.

Read [link] for some idea of what encodings are actually used outside of the Western hemisphere by people who actually have to work with this stuff for a living. Note that UTF-8 isn't listed there.

Posted by Dare Obasanjo at

From wikipedia:

So the first 128 characters need one byte. The next 1920 characters need two bytes to encode. This includes Greek, Cyrillic, Coptic, Armenian, Hebrew, and Arabic characters. The rest of the UCS-2 characters use three bytes, and additional characters are encoded in 4 bytes. (An earlier UTF-8 specification allowed even higher code points to be represented, using 5 or 6 bytes, but this is no longer supported.)

and

Ideographs use 3 bytes in UTF-8, but only 2 in UTF-16. So Chinese/Japanese/Korean text will take up more space when represented in UTF-8.



Posted by Sam Ruby at

Besides the fact that whatever XML parser Sam is using is not internationalized enough to support an encoding used by millions of people

That's crap.  It says the feed is valid, but it issues a warning on the encoding.  It only uses a real XML parser; it wouldn't know the feed was valid unless it could read it.

Do your homework before you go making unfounded accusations of incompetence (again).

Posted by Mark at

The only way that Atom could learn from this mistake is to require all Atom feeds to be utf-8.

That sort-of sounds like "US rules"

That's crap too.  The only encodings guaranteed to be supported by XML parsers are UTF-8 and UTF-16.  Atom didn't make it that way; the XML specification did.  Want to replace the XML specification?  Oh, how we would line up behind you.

Posted by Mark at

I think it should be UTF-X, though, so UTF-16 and -32 can be used in the future.

UTF-32 is not guaranteed to be supported by XML parsers, so I don't see how this is an improvement.

We could mandate us-ascii too (all the element names in Atom are us-ascii, and all content can be represented with numeric entities), but I don't see how that would be an improvement either.

Posted by Mark at

Mark,
Sam said it choked and when I clicked on the page I saw error messages. I didn't expect I had to scroll down all the way to the bottom of the page to see that it still said the feed was valid.

This looks like a problem with the feed validator UI. It wasn't clear at all that the feed was considered valid from following Sam's link without scrolling all the way past an XML feed that looked like garbage in my browser.

Posted by Dare Obasanjo at

That's crap too.

Mark,
Whatever! Your argument is invalid. What does XML have to do w/ UTF-8 just being a forward kludge of ASCII, a.k.a. US-ASCII, a.k.a. American Standard Code...? Just because the XML groups implemented this pro-Western BS kludge doesn't mean it's not a pro-Western BS kludge.

Posted by Randy Charles Morin at

Things I can't change:

Things I can change: I can issue warnings when such conditions are detected so that feed producers who care to have the widest possible interoperability can make informed choices.

Posted by Sam Ruby at

My understanding was that GB2312 had officially been superseded by GB18030, which doesn’t generate a warning. The other non-obscure encodings were taken from Syndic8’s statistics, and probably reflect some Western bias:

US-ASCII
ISO-8859-1
EUC-JP
ISO-8859-2
ISO-8859-15
ISO-8859-7
KOI8-R
SHIFT_JIS
WINDOWS-1250
WINDOWS-1251
WINDOWS-1252
WINDOWS-1254
WINDOWS-1255
WINDOWS-1256
GB18030

(Plus UTF-8 and UTF-16.)

The list’s contents are very much up for debate, but I’d rather see it diminish in size than increase.

Posted by Joseph Walton at

XML encoding=gb2312

Here's an interesting thread that's getting very political. Let's start w/ a great comment... MSXML Dude: Can't you guys just move along and produce something useful. Is ATOM 0.3 all we are going to get out of the ATOM effort? Finally... Atom Dude:...

Excerpt from iBLOGthere4iM at

Randy Charles Morin: Remember that UTF-8 is effectively the interna...zation of ASCII, where most non-ASCII characters require three to six bytes. Non-ASCIIans generally look at UTF-8 and say "I don't think so".

I really don't find the byte size problem that big of a problem. Why are everyone so damned protective of their bandwidth when it comes to the size of the encoding, but couldn't care less when it's about using XML or other (better) formats (like Enamel), and don't give a heck about enabling gzip on their HTTP 1.1 served content?

This is arguing for the arguments sake. UTF-8 solves an incredible amount of problems, and the rest of the UTF family solves the rest. I'm a home grown, radical, Non-ASCIIan (you can tell from the 'ø' in my name), but still love UTF-8 more than my own mother. And I'm not alone.

Dare Obasanjo: It seems you are assuming content syndication is only of interest to people in the western world. UTF-8 is popular in the Western world but isn't in places like China.

No, I'm not. That's why I also mentioned UTF-16 and UTF-32.

Mark: UTF-32 is not guaranteed to be supported by XML parsers, so I don't see how this is an improvement.

Maybe UTF-32 isn't, but UTF-8 is an incredible improvement from the insane mess we have today. All languages that fit in UTF-8 have absolutely no reason to use another encoding. The ISO-8859 family is dead, or at least should be. The languages that doesn't fit in there, fits in UTF-16 which is (or will be) supported by both the specification and most tools, no?

Mark: We could mandate us-ascii too

Yea, that solves a lot of problems, doesn't it? Why the heck use XML at all? CSV is much more bandwidth friendly! Heck, why send the characters as literals at all? Binary objects are much better. Yes, US-ASCII-enforced characters in binary a format. That's the future! :-)

Randy Charles Morin: Just because the XML groups implemented this pro-Western BS kludge doesn't mean it's not a pro-Western BS kludge.

You don't find understanding in the fact that US-ASCII was the most widely used character set before the days of Unicode, and that Unicode probably wouldn't be more than another dead specification if it wasn't backward compatible with ASCII? I understand that very well, even though the ASCII characters only cover a part of the letters I write on a daily basis.

Posted by Asbjørn Ulsberg at

Besides the fact that whatever XML parser Sam is using is not internationalized enough to support an encoding used by millions of people, what exactly is wrong with gb2312?

If by "millions", you mean "10".  Feeds declared as UTF-8: 26058; feeds using GB-2312: 10.

My understanding was that GB2312 had officially been superseded by GB18030

It appears you are correct.  See [link] and [link]

GB18030 has been a mandatory standard in Mainland China since September 1, 2001, plus or minus a day.  Which is not to say GB-2312 isn't a legal encoding, but both theory and practice appear to support the warning message of "obscure".

Posted by Mark at

Mark,
  Considering that Syndic8 claims that there are 0 feeds whose encoding is GB18030 does that mean that none exist? Using an English language site mostly utilized by people who speak English to speculate on the number of Chinese feeds seems to be an error prone endeavor.

On the other hand Googling for "GB2312" returns almost 10 times more results than Googling for "GB18030".

There's really no point in going back and forth about this anyway. The feed validator says the encoding valid and warns users that the encoding may not be widely supported. This seems fair to me. Arguing about whether GB2312 is actually widely used or not when neither of us is familiar with the XML usage in mainland China is a waste of time.

Have a nice weekend.

Posted by Dare Obasanjo at

I really don't find the byte size problem that big of a problem.

Neither do I, but then iM a Westerner ;)

Posted by Randy Charles Mørin at

Why waste all this time arguing about GB2312 versus GB18030? The code is available, if its so terribly important to you, stop complaining and fix it.

Posted by John Beimler at

Using an English language site mostly utilized by people who speak English to speculate on the number of Chinese feeds seems to be an error prone endeavor.

And slinging unfounded insults like "whatever XML parser Sam is using is not internationalized enough to support an encoding used by millions of people" without any data to back up either the first half or the second half is... well, it's par for the course for you, Dare.  But that doesn't mean you should get away with it.

Posted by Mark at

On Orange XML Icons

Summary: A useless clarification of a meaningless stance on an inconsequential issue....

Excerpt from firasd.org at

Add your comment