It’s just data

RSS Bandit

Dare Obasanjo: So this morning I decided to write an RSS News aggregator.

My advice is to test it on Joe's and Shelley's feeds. This requires two simple, albeit a bit unconventional, rules: anything in the namespace of the DocumentElement is equivalent to the null namespace, and items can be either inside or outside of channels.

And then there are synonyms, e.g., dc:subject vs category...


I've got to work on making my RSS more difficult to parse. My XHTML and FOAF are both locally famous for their ability to break parsers, but so far nobody has complimented my ability to write hard to parse RSS.

If you're making a general list of test feeds, I'd add one that CDATA escapes description, too: I noticed last night that Feedreader displays the CDATA close tag (much better than when I first saw CDATA used, when pretty much everything either broke or refused to read it). Though Dare's no doubt using a real parser and won't ever notice the difference.

Posted by Phil Ringnalda at

Actually neither Shelley nor Joe's feeds needed a code change. A screenshot with Shelley's feed is at http://www.25hoursaday.com/rssbandit3.jpg while one with Joe's is at http://www.25hoursaday.com/rssbandit4.jpg

There does seem to be a problem with the HTML encoding in Joe's content. I can't tell if it's my bug or his.

Dave Winer's feed has been the first so far to make me go back and tweak the code. He uses a <description> but no <title>.

Posted by Dare Obasanjo at

Phil, Tim Appnel's feed would be a good example of that.

However, not all is lost. Your RSS 2.0 feed (and my RSS 2.0 feed for that matter) are good examples as to why one can't ignore namespaces entirely. ;-)

Posted by Sam Ruby at

Phil, your feed is currently invalid because our XML parser can't read past line 1. http://feeds.archive.org/validator/check?url=http%3A%2F%2Fphilringnalda.com%2F

Dunno if this is something we should be compensating for. Are XML documents allowed to have blank lines before the initial XML processing instruction? I was under the impression this had to be the absolute first thing in the document. But what do I know?

Posted by Mark at

Responding to myself,
Fixed the problem with Joe's content. I wasn't escaping the '&amp;'.

Mark,
You are correct about the rules governing white space and the XML declaration [technically it isn't a PI] :)

Posted by Dare Obasanjo at

Mark, Phil's RSS 1.0 and RSS 2.0 feeds are valid RSS. However, his viewable weblog page is valid XHTML despite not being valid XML.

Go figure.

Posted by Sam Ruby at

Tim's feed just broke my aggregator in two ways. :)

I assumed the RSS version number was supposed to actually be limited to being a numeric value which doesn't seem to be the case since his feed validates fine.

His using an unexpected namespace for the RSS elements was also a breaker.

Posted by Dare Obasanjo at

Dare, Tim's feed validates because the RSS validator expects the RSS elements to be in the namespace of the DocumentElement. Given the history of RSS and namespaces, this seems like a most sane approach.

Posted by Sam Ruby at

Damn, you're right. It's a bug with rssfinder.py. It's not finding the RSS feed, or rather, it thinks the home page is the RSS feed. Probably because of the presence of the damn Trackback data, although I swear I fixed that bug months ago. *sigh* I don't have time to debug it. Just ignore my previous comments for now.

Posted by Mark at

Dare, we don't bother validating version numbers. You could publish an RSS feed version="3.141592653589793/and/your/mother's/ugly" and it would validate. In fact, I think I'm going to go do that.

On an unrelated note, it is truly scary that I still know pi to 15 digits after all these years. That was as far as my calculator would show me in high school, and (being the bored genius in the back of the room in the days before Internet access) I spent my time playing with my calculator and memorizing stupid shit. Did you know that 16435934 in hexadecimal spells FACADE? I always thought that was insanely cool.

Posted by Mark at

Mark: once I fixed my whitespace before the XML declaration issue (a PHP include that I didn't actually need anymore that was outputting a blank line), we're back to your old friend, CDATA escaped Javascript. I take the CDATA out, the validator autodiscovers me, put it back in and it doesn't. I thought you fixed <em>that</em> bug months ago.

Sam: Tim's feed is a great parser-breaker, but since lots of things still don't do anything with content:encoded, for completeness we need someone who CDATA's HTML in description. Extra credit if they talk about HTML, so that they've also got entity-encoded stuff inside the CDATA section ;)

Posted by Phil Ringnalda at

Despite the fact that it is invalid for other reasons, Kevin Burton's rss feed is a prime example of the use of CDATA.

And, of course, morenews is an example of invalid XML.

Posted by Sam Ruby at

RSS Bandit. Dare Obasanjo: So this morning I decided to write an RSS News aggregator. My advice is to test it on Joe's and Shelley's feeds. This requires two simple, albeit a bit unconventional, rules: anything in the namespace of the...

Excerpt from Pete Cole: Follow-ups at

Add your comment