It’s just data


Brent Simmons: My job is to treat Atom and RSS as peers, and to do a great job supporting both formats. I do not prefer one over the other, and I go out of my way to stay far away from the fighting. (One of the beautiful parts of newsreaders is the Unsubscribe button.)

Rogers Cadenhead

Rogers Cadenhead was recently appointed to the RSS 2.0 Advisory Board.  Shortly thereafter, Rogers described a technique that will enable you to insert "unusual" characters like "¿" into your RSS 2.0 feeds without causing your feed to blow up.  In the ensuing discussion, it became apparent that while he succeeded in producing a feed that was entirely valid according to the RSS 2.0 spec, he produced one that simply would not display properly in the Radio UserLand aggregator.  Oopsie.

Rogers later made a post that was based on the assertion that software that reads RSS 2.0 assumes — like Radio UserLand apparently does — that things like channel titles and item titles and item descriptions are entity encoded HTML.  Again, the resulting discussion provided some revelations on that score.

The most amazing thing about this entire discussion is that Rogers maintained his sense of humor throughout, and those that see value in the Atom specification didn't take the opportunity to go for the jugular, something I am ashamed to say I have seen happen too often in previous syndication discussions.  IMHO, This combination of humility and restraint is the key to progress.

Kudos to everybody involved, in particular Rogers.

Reuters does RSS

OK, so some unusual characters work differently across aggregators.  What's the big deal?

Suppose you are Reuters.  You produce a feed with BusinessNews.  Not a non-ASCII character in sight.  Your feed conforms to the specification, but contains some GUIDs which are not sufficiently unique.  This may cause aggregators like RSSBandit to not show you items in feeds that you are subscribed too.  Silently.

But again, another rare problem that occurs in few aggregators.  Edge cases.

Now, view source on the business feed.  The descriptions are completely valid and conform to the spec in every way.  As you would expect in a business feed from a company like Reuters, may of these descriptions contain stock ticker symbols.  Now subscribe to this feed (or perhaps this snapshot) in your aggregator.  Look for the stock ticker symbols.

We have yet to find a single aggregator which will show the stock ticker symbols.

Not a single one.

This is called data loss.  Silent data loss.  We are not talking about unusual characters.  Or the occasional item in a few aggregators.  We are talking stock ticker symbols.  In a Reuters Business feed.  In every single aggregator that we have tested with so far.  Silently.

This can be corrected.  I was the first to notice the problem.  I let Mark Pilgrim know and he talked to the responsible person at Reuters.  After enumerating the options, Reuters has elected to update their feeds by wrapping their descriptions in CDATA sections.  In rare circumstances this may cause their feeds to become not well formed, but this is the simplest fix that they can make that will work with the widest range of aggregators.

I asked Reuter's permission before sharing this information.  They agreed that it is an issue worth discussing.  Publically.  Especially given the fix is something not addressed in the RSS specification itself.

We need to get the word out.  Your titles and descriptions can be 100% valid according to the RSS specification.  And yet not work as you intend in any aggregator.

In an ideal world, the RSS spec would be updated to specify precisely how these cases are to be handled.  However, in a very practical sense, this would introduce a discontinuity.  Problematic feeds that previously were technically valid would suddenly become invalid.

Even if this weren't done, the RSS 2.0 roadmap does leave the door open for clarifications.  If the spec were to be updated to merely say how various textual elements SHOULD be interpreted, I would gladly update the feedvalidator to provide informational messages when problematic values for these elements are detected.  The feedvalidator will remain open source, and Dave and Andrew can chose to update their mirror to the latest version at any time.

We can work together to spread the word, reduce surprises, and improve the user experience with RSS 2.0.  I'm confident it can be done.

RSS Versions

If mangling unusual characters and silent data loss wasn't bad enough, there is that nearly four year old fork thing to deal with.  Robert Scoble had suggested that companies with limited resources (like Microsoft?) only support RSS 2.0 and Atom.  This would rule out a large number of popular feeds.  Including SlashdotBoingBoingGoogleKuro5hin.  And, most importantly, Dilbert.

It is not simply a matter of multiple versions.  It is multiple versions that call themselves RSS.  This inevitably causes some bleed-through.  Consider the Dilbert feed mentioned above.  It is RSS 1.0.  With guids.  Guids are from RSS 2.0.  You won't find a mention of guids in the RSS 1.0 spec.

Bleed-through goes both ways, ESPN feeds are RSS 2.0.  With rdf:resource and rdf:about attributes.  From, you guessed it, RSS 1.0.  You won't find a mention of rdf in the RSS 2.0 spec.

There are many elements that were introduced in RSS 2.0 that duplicate functionallity that was commonly found in namespaces in RSS 1.0.  Jon Udell added dublin core information to his feeds in September of 2002.  I felt strongly about this.  As did Rael Dornfest.  And Mark Pilgrim.

But these new elements were added anyway.  And political FAQs were written that simply boil down to a suggestion that you should use one set of optional elements instead of another.

Be that as it may, if you want to get the full value out of Jon Udell's, or Steve Gillmor's, or Larry Lessig's RSS 2.0 feed, you need to understand namespaces that are not on the list of RSS 2.0 namespaces.

What this means to you is that if you want to support any version of RSS completely, you essentially need to support all of them.  And there is no central directory which includes all of them.

But don't despair. There is a Universal Feed Parser.  It handles every known version of RSS.  It even supports Atom.  And CDF.  It supports 40 namespaces.  It is open source.  And even if for some reason you find you can't use it directly, you can still make use of the literally thousands of test cases that come with it.

If it can be done by one person, it can be done by others.  And it is worth doing.  For the user's sake.

Where do we go from here?

Good question.  Essentially, there exists a plurality of standards today.  RSS 1.0 is perhaps the most formal.  RSS 2.0 is the most permissive.  Permissiveness increases adoption rates, but as we see above, at a long term cost of ambiguity.

Atom is not done yet, but Atom's focus has been on interoperability and fidelity.  There already are some conformance tests.  There will be more.  Lots more.

Atom attempts to strike a balance between formality and simplicity.  This comes at a cost of generality.  Example: the rdf:about attribute is required in RSS 1.0 on items.  The corresponding element in RSS 2.0, guid, is optional.  In Atom, entry ids are currently speced as being required.  If you can't generate unique ids for entries, then perhaps Atom is not the format for you.

Atom has more required elements than RSS.  Atom adds type attributes to titles and links to resolve the ambiguity described above.  It has separate elements for summary and content.  If you want, you can read more here or here or here. If you chose to, you can even Get Involved.

So, if you are a tool vendor and would like a little more structure, rigor, and reproducibility, Atom might be a good choice.  But if you chose to hold back until Atom is done, that's OK too.

However, if you want to do something quick and dirty in RSS 2.0, go for it.  Guilt free.  It will get you up and running quickly.

The key takeaway here is to beware of anybody who preaches one true format or one size fits all.  Each format has its strengths.  And none of them are going away any time soon.

Meanwhile, you can help by spreading the word.  The word is détente.  RSS 1.0 has a reason to exist.  RSS 2.0 has a reason to exist.  And Atom has a reason to exist.

And if anybody tells you differently, and won't listen when you suggest détente, take Brent's suggestion and make use of the handy Unsubscribe button.  That's what it is there for.