It’s just data

Longhorn and Standards

What will the Longhorn RSS Platform Sync Engine do with feeds that are not well-formed?

I suspect that it is too much to hope for for Microsoft to suddenly start respecting the widely ignored RFC 3023, so let’s put that aside for a moment.

If Longhorn follows RSS Bandit's lead and refused to process feeds that aren’t (locally) well formed, then this single act will likely send immediate and profound shock waves out to the entire syndication community.

If, however, Longhorn follows the lead of virtually every other aggregator out there, then the impact having a core platform service responsible for delivery of data to applications compensate for XML irregularities won’t be as immediate; but the impact will likely be even more profound and will ultimately affect a much larger portion of the larger XML community.  Probably starting with web services, both of the SOAP and non-SOAP varieties.

Either way, we are in for some interesting times.


They told me they won’t be processing ill-formed XML. BTW, NetNewsWire and Thunderbird make no effort to fix ill-formed XML, either.

Posted by Robert Sayre at

What will the Longhorn RSS Platform Sync Engine do with unknown extension elements?  Will it surface them in the API?  Will they be surfaced in the parsed normalized form?  Will applications be able to get to the extensions in any way?  Will application developers be able plug in support for unknown extensions at the sync parser level?  How will the sync engine handle structural differences between Atom 1.0 and RSS 2.0 and RSS 1.0?  For instance, Atom supports multiple enclosures per entry; RSS 2.0 only allows for one enclosure per item.  Will the Longhorn API surface the ability to get to multiple enclosures?  Will the sync engines normalization process reduce everything to RSS 2.0 semantics? If so, how will it compensate for the extended semantics introduced by Atom and RSS 1.0?  For example, how will the sync engine and API compensate for the differences between the atom:source element and the RSS 2.0 source element? ... or for the atom:contributor element?

Lots of questions to ask and get answers for.  Will definitely be interesting.

Posted by James Snell at

Longhorn and Valid XML

Sam Ruby asks, “What will the Longhorn RSS Platform Sync Engine do with feeds that are not well-formed?”...... [more]

Trackback from Randy Holloway Unfiltered

at

BTW, NetNewsWire and Thunderbird make no effort to fix ill-formed XML, either.

Brent Simmons: NetNewsWire’s RSS parser does try to work around some of these errors

Thunderbird appears to handle the feed mentioned in [link] without complaint.

Posted by Sam Ruby at

I think the Platform Sync Engine is actually just the background download service, and what you’re wondering and hoping about is the Common RSS Data Store (well, though the Sync Engine has to get involved in 3023 compliance).

The answer to James’s fourth question is that “It is also possible to get at the actual item xml for such applications which want to perform operations on the xml instead of using the item’s properties.”

Posted by Phil Ringnalda at

That was Brent in January 2004. In December 2004, things changed. I haven’t checked to see what he meant by that, though.

Re: Thunderbird, you’re right. I had forgotten that Mozilla/Gecko’s underlying machinery will scrape characters. Opening that feed in a Mozilla browser shows the behavior for me. IE displays an error.

Posted by Robert Sayre at

Thunderbird will refuse to process feeds with undeclared entities (including HTML ones), will refuse to process feeds with undeclared ns prefixes (1.1a+), and will not attempt to “fix” any well-formedness error reported by the XML processor.

Posted by Robert Sayre at

Many RSS feeds don’t use entities outside the ones predefined by XML (the HTML ones tend to be double escaped).  Many RSS feeds don’t use namespaces at all.  However, a single cut and paste of a “smart quote” is often all it takes to produce an invalid feed in several blogging tools.

If either the Platform Sync Engine or the Common RSS Data Store were to start rejecting feeds with such common encoding errors, I suspect that this will be a rude awakening to many.

Posted by Sam Ruby at

Sam Ruby: Longhorn and Standards

[link]...

Excerpt from del.icio.us/yohei at

I suspect that it is too much to hope for for Microsoft to suddenly start respecting the widely ignored RFC 3023, so let’s put that aside for a moment.

Oh let’s not.

Following up from my previous experiment 18 months ago, I scanned the Technorati Top 100.  Of 100 sites, 66 had auto-discoverable feeds.  Of 66 feeds, 15 were not well-formed.  Every single one of the non-wellformedness errors stemmed from interactions between HTTP and XML:

So it appears that the “rude awakening” may not be so rude after all.  Unless, of course, Microsoft chooses to support all the relevant standards, instead of just the ones they find convenient.

Posted by Mark at

Phil said: The answer to James’s fourth question is that "It is also possible to get at the actual item xml for such applications which want to perform operations on the xml instead of using the item’s properties."

The “RSS Support in Longhorn” document also says that when the sync engine pulls a feed it will be “Parsed and normalized into a unified format"[emphasis added].  So the question goes back to: will the API provide access to the original feed XML or just the "parsed and normalized unified format”.  If the former, wonderful.  If the latter...what is the “parsed and normalized unified format” and how much of the original will it preserve?

Posted by James Snell at

Mark’s article has always bothered me, here’s why:

“Violations of well-formedness constraints are fatal errors.”

“It is a fatal error when an XML processor encounters an entity with an encoding that it is unable to process. It is a fatal error if an XML entity is determined (via default, encoding declaration, or higher-level protocol) to be in a certain encoding but contains byte sequences that are not legal in that encoding.”

Seems to me that encoding errors are fatal errors that happen during tokenization, while violations of well-formedness constraints are fatal errors in the parser that occur when successfully decoded tokens don’t match the Document grammar. It’s a nitpick, but it drives me nuts.

Posted by Robert Sayre at

Thunderbird (1.0.2) seems to reject even presumably correct feeds, like yours.

XML Parsing Error: undefined entity Location: 

http://www.intertwingly.net/blog/2005/06/26/Longhorn-and-Standards 
Line Number 37, Column 4:
let’s put that aside for a moment.

I guess people will either use a different engine, or not read the ‘bad’ feeds.
Hopefully by the time Longhorn has any significant market share, other cross platform APIs and implementations will be well established, so it won’t matter as much as...

Posted by anonymous at

Anonymous, [link] is not a feed, it is an XHTML page.  Try one of these feeds.

Posted by Sam Ruby at

Of 66 feeds, 15 were not well-formed.

Advantage: my subscriptions. Of 434 feeds, four are currently not well-formed: one failure-to-escape-ampersands, one copy-paste encoding error, one charset problem that might be my aggregator (the validator doesn’t have a problem with Nelson’s feed, but Feed on Feeds does) and one with encoding="" in the XML declaration. If only everyone had the sense to mostly subscribe to geeks, especially syndication geeks, we’d be just fine.

Posted by Phil Ringnalda at

Lång och hornig

Sam Ruby tror att Microsofts inblandning i RSS-världen kommer att leda till »interesting times«. Det är svårt att göra annat än att hålla med och jag hoppas verkligen att de kommer att vara hårda när det gäller att XML (och relaterade...

Excerpt from Månhus beta (David Hall) at

Longhorn and Valid XML

Sam Ruby asks, “What will the Longhorn RSS Platform Sync Engine do with feeds that are not well-formed?”......

Excerpt from Randy Holloway Unfiltered at

Phil, I did my testing with Feed Parser.  What did you use for your tests?

Feed Parser claims bozo=0 for Nelson’s feed.

Posted by Mark at

Re: Thunderbird, I’m using [link] to subscribe, but I unselected ‘show the article summary instead of full page’ - since I want the full page to read offline, and the summary is incomplete. I guess the problem is in the html, not in the feed, but the net result is still that I can’t use it with tbird.
Costin

Posted by Costin Manolache at

Bah. I’ve lost the ability to read: I have non-3023 errors as well as 3023 errors. That makes me worse, not better.

Don’t suppose you have an OPML -> bozo report script I could borrow, to see just how shocked I’ll be?

Posted by Phil Ringnalda at

Costin: Can you explain what you mean by "summary is incomplete"?  That particular feed is RSS 0.91 which variously describes the description as plain text and as a synopsis.  You will find the full marked up content in feeds such as RSS 1.0 and RSS 2.0.

Posted by Sam Ruby at

Don’t suppose you have an OPML -> bozo report script I could borrow

As far as I’m concerned, publishing an OPML file in the first place qualifies the author as a bozo.  Haven’t you heard?  The cool kids are all using XOXO these days.

Posted by Mark at

Sam - I know your have full mark up content, in all possible formats, the point was that T-bird fails with that particular ‘synopsis’ feed ( tbird call it 'summary'). I’m not using T-bird for native rss because it fails with your and few other similar feeds - it’s kind of a basic test for tbird or any other reader to be able to deal with synopsis feeds.

Your posting was about Longhorn being tolerant with the input or using strict well-form checks. If it is too strict - people will consider it buggy, and make it even harder for it to succeed. I assume it’ll have other obstacles - I doubt the APIs will be supported in older windows versions, and I hope some cross-platform APIs will be established by then. But failing to work with real-life content can be a fatal problem.

Posted by Costin Manolache at

If it is too strict - people will consider it buggy, and make it even harder for it to succeed.

While undoubtedly true for T-bird and RSS Bandit, I think the social dynamics for something that is baked into the platform API of the next version of the most popular desktop operating system are somewhat different.

Posted by Sam Ruby at

Don’t suppose you have an OPML -> bozo report script I could borrow

Here’s an bare-bones basic version:

from urllib import urlopen
from xml.dom import minidom
from feedparser import parse

opml=urlopen("http://www.bloglines.com/export?id=rubys").read()
for outline in minidom.parseString(opml).getElementsByTagName("outline"):
  try:
    feed=outline.attributes["xmlUrl"].value
    print feed, parse(feed).bozo
  except:
    pass
Posted by Sam Ruby at

Thanks, Sam.

56 as-yet-unsorted bozos, including atomenabled.org and various luminaries and IETF WG chairs. 3023 isn’t exactly the single most popular RFC, is it?

Posted by Phil Ringnalda at

Countdown

Links and a countdown....

Excerpt from Anne’s Weblog about Markup & Style at

Links for 2005-07-17 [del.icio.us]

New screencast blog opens (recording of computer screens) Best Feed Reader for BlackBerry? Sam Ruby: Longhorn and Standards Click here to view the entire entry.Entry posted using R|mail....

Excerpt from Real Geek at

Links for 2005-07-17 [del.icio.us]

New screencast blog opens (recording of computer screens) Best Feed Reader for BlackBerry? Sam Ruby: Longhorn and Standards Click here to view the entire entry.Entry posted using R|mail....

Excerpt from Geek Space at

Links for 2005-07-17 [del.icio.us]

Sifry’s Alerts: Technorati launches Related Tags New screencast blog opens (recording of computer screens) Best Feed Reader for BlackBerry? Sam Ruby: Longhorn and Standards Click here to view the entire entry.Entry posted using R|mail....

Excerpt from Geek Space at

Links for 2005-07-17 [del.icio.us]

New screencast blog opens (recording of computer screens) Best Feed Reader for BlackBerry? Sam Ruby: Longhorn and Standards...

Excerpt from The RSS Blog at

Links for 2005-07-17 [del.icio.us]

Sifry’s Alerts: Technorati launches Related Tags New screencast blog opens (recording of computer screens) Best Feed Reader for BlackBerry? Sam Ruby: Longhorn and Standards...

Excerpt from iBLOGthere4iM at

Add your comment