intertwingly

It’s just data

Preserving Identity


Mark Pilgrim's Identifying Atom article indirectly makes three assertions about what would be ideal in a syndication protocol with respect to ids, which I will paraphrase thus:

One thing that is true of all current versions of RSS and Atom is that #2 and #3 are underspecified.

No one should be surprised that feeds today are syndicated.  Or aggregated.  Or that the results of syndication and aggregation are themselves published.

However, given the current underspecification of ids, we often find that it is the case that a number of "planet" sites (e.g., Apache, Debian, Gnome, Java, LISP, Mozilla, PHP, Python, RDF, Sun) are copying content and summaries from feed entries, but are NOT preserving identity.

Nor do query sites, like Feedster, preserve ids.

The inevitable result: people who subscribe to these feeds will see duplicate entries.  And aggregator authors will get complaints.

The solution: spec text that conveys the requirement that ids must be preserved if an entry is relocated, migrated, syndicated, republished, exported or imported.

Note: none of this needs to wait on Atom becoming final.  People who build sites that aggregate content from various sites should consider preserving ids if they are present in the source feed.  Perhaps the RSS Advisory Board should consider making a similar recommendation.

Comparison

The topic of comparison is secondary to all this but important.  Lacking any other guidance in specifications, producers need to be aware that consumers will be free to perform any of the comparison methods defined in RFC 2396bis in order to lower the risk of false negatives.

If all programmers were Angels, specifying a character by character comparison would be sufficient.  Unfortunately, history has shown that not everybody reads specs carefully, and a number of existing libraries are a wee bit too helpful.  This would make such a requirement a bit fragile.  So it might be worthwhile to consider a design choice that makes things more resilient in face of such deviations.

The initial read on consensus was that requiring canonical ids was effective without being overly burdensome.  Subsequently, in a later read on consensus the requirement for canonicalization was softened to a recommendation.  Even this is still provisional, it could change again.

In any case, what does this mean?  To most people, nothing.  The RFC 2396bis folks were pretty smart and picked a set of rules that pretty much everybody on the planet are following anyway.  But if you do happen to pick an id that is not canonical, your feed will be fine.  The only problem that is likely to occur is if one of those planet sites uses a library which is too helpful.  For that reason, the Feed Validator will be updated to provide a warning - just a warning, not an error - if an id is not canonical.  This warning will be linked to a help page which will indicate that if you are copying an id which is not canonical, you are doing the right thing by preserving the id from the source feed character by character.  It is only if you are generating new ids should you be concerned about canonicalization.  Again, this is not likely to affect very many people.

Summary

And just to repeat for emphasis: if you are syndicating content, please preserve the identity of the entries.  When comparing ids, please do it by comparing character by character.  When copying ids, please do it by copying character by character.

Only when you are generating new ids (just ids, not links, or html) should you consider normalization.   If you don't normalize, and everybody follows the rules, things will still work.  But if you do normalize and somebody's library routine changes your URI in any way, the Feed Validator will provide a warning on their feed.