It’s just data

Preserving Identity

Mark Pilgrim's Identifying Atom article indirectly makes three assertions about what would be ideal in a syndication protocol with respect to ids, which I will paraphrase thus:

One thing that is true of all current versions of RSS and Atom is that #2 and #3 are underspecified.

No one should be surprised that feeds today are syndicated.  Or aggregated.  Or that the results of syndication and aggregation are themselves published.

However, given the current underspecification of ids, we often find that it is the case that a number of "planet" sites (e.g., Apache, Debian, Gnome, Java, LISP, Mozilla, PHP, Python, RDF, Sun) are copying content and summaries from feed entries, but are NOT preserving identity.

Nor do query sites, like Feedster, preserve ids.

The inevitable result: people who subscribe to these feeds will see duplicate entries.  And aggregator authors will get complaints.

The solution: spec text that conveys the requirement that ids must be preserved if an entry is relocated, migrated, syndicated, republished, exported or imported.

Note: none of this needs to wait on Atom becoming final.  People who build sites that aggregate content from various sites should consider preserving ids if they are present in the source feed.  Perhaps the RSS Advisory Board should consider making a similar recommendation.

Comparison

The topic of comparison is secondary to all this but important.  Lacking any other guidance in specifications, producers need to be aware that consumers will be free to perform any of the comparison methods defined in RFC 2396bis in order to lower the risk of false negatives.

If all programmers were Angels, specifying a character by character comparison would be sufficient.  Unfortunately, history has shown that not everybody reads specs carefully, and a number of existing libraries are a wee bit too helpful.  This would make such a requirement a bit fragile.  So it might be worthwhile to consider a design choice that makes things more resilient in face of such deviations.

The initial read on consensus was that requiring canonical ids was effective without being overly burdensome.  Subsequently, in a later read on consensus the requirement for canonicalization was softened to a recommendation.  Even this is still provisional, it could change again.

In any case, what does this mean?  To most people, nothing.  The RFC 2396bis folks were pretty smart and picked a set of rules that pretty much everybody on the planet are following anyway.  But if you do happen to pick an id that is not canonical, your feed will be fine.  The only problem that is likely to occur is if one of those planet sites uses a library which is too helpful.  For that reason, the Feed Validator will be updated to provide a warning - just a warning, not an error - if an id is not canonical.  This warning will be linked to a help page which will indicate that if you are copying an id which is not canonical, you are doing the right thing by preserving the id from the source feed character by character.  It is only if you are generating new ids should you be concerned about canonicalization.  Again, this is not likely to affect very many people.

Summary

And just to repeat for emphasis: if you are syndicating content, please preserve the identity of the entries.  When comparing ids, please do it by comparing character by character.  When copying ids, please do it by copying character by character.

Only when you are generating new ids (just ids, not links, or html) should you consider normalization.   If you don't normalize, and everybody follows the rules, things will still work.  But if you do normalize and somebody's library routine changes your URI in any way, the Feed Validator will provide a warning on their feed.


I remember writing about parts of what you've written here on the Atom wiki. I think I was getting a bit ahead of where everyone else was, though, since everyone was still bickering about whether comments and trackbacks are entries at the time. I still think that feeds which aggregate other feeds are very important, and equally important is that clients can identify the relationship between the authoritative entry and the syndicated entry. This is more involved than just keeping the ids intact. Not all republishing feeds are or will be controllable; some will give you stuff from a fixed set of sources with no control whatsoever, and some might even provide only selected entries from hundreds of sources categorised by humans or software. The upshot of this is that I might end up subscribing to two feeds which chuck me the same entry.

This is alright if they are identical, but as soon as one is different I need to know which one is most authoritative so I can discard the others -- or rather have my software do so for me. The easy option is to somehow flag republished entries as non-authoritative, so authoritativeness is a boolean. Some might say it's valuable to have an amount of authoritativeness, but I'm not sure that's all that useful. As a side-feature, it'd be nice to be able to get a URI at which the authoritative version of an unauthoritative entry can be found, although that will of course be troublesome since entries have a habit of vanishing out of feeds after a while.

I can't actually remember where I wrote about all this in the wiki, since it was a long time ago. Still, I feel strongly about this distribution model as it's obvious that we need to move away from the model where thousands of clients all pull data from one source. The cascading aggregation model is a lot more like USENET's model, which was a good one.

As a last-ditch attempt to remain on-topic, I think I have to say that the only reliable way to compare URIs is by exact string matching. Sure, there'll be little oddities that spring up here and there, but if people don't cater to them I'd hope they'd be squashed pretty quickly, and if they do cater to them it's not a massive problem, as I would expect the incidence of someone publishing an ID of htttp://blah.invalid:80/ and a separate one of htttp://blah.invalid/ is very slim. The principle of being strict in what one produces and liberal in what one accepts seems to apply here. Specify the ideal, but always expect that people will screw it up and think about how much damage it'll do when they do. (not a great deal, in this case)

(I made up a fun new protocol because your comment mangler mangled my HTTP URIs)

Posted by Martin Atkins at

the incidence of someone publishing an ID of htttp://blah.invalid:80/ and a separate one of htttp://blah.invalid/ is very slim.

IF they are published by the same person, agreed.

Note: search engines normalize all the time.  Example.

Posted by Sam Ruby at

Isofarro : Preserving Identity - Sam: Search engines normalise all the time....

Excerpt from HotLinks - Level 1 at

Preserving Identity. Mark Pilgrim"s Identifying Atom article indirectly makes three assertions about what would be ideal in a syndication protocol with respect to ids, which I will paraphrase thus: IDs are mandatory the semantics on how/when IDs are...

Excerpt from Tralla.org : Search : Debian at

I think there's a little discrepancy as far as RSS 1.0 is concerned - that spec says to use URIs (which would conflict with your char-by-char comparison), but since that spec's release RDF has moved to using URI References, and as the specs say: "Two RDF URI references are equal if and only if they compare as equal, character by character, as Unicode strings."

[link]

Posted by Danny at

Danny, your link refers to the abstract syntax of RDF - separate and distinct from any concrete syntax (like RDF/XML).  In fact there is even an example of the distinction between the two: the abstract syntax does not permit relative URI references, whereas the concrete syntax does.

Without trying it myself, I'm fairly confident that any .Net RDF/XML parser will conform to the abstract syntax by canonicalizing the concrete syntax.

Posted by Sam Ruby at

RE: Preserving Identity

Sam,
  Separate and different from the concrete syntax you say? Interesting considering that the concrete syntax spec links to that definition of URI reference when describing how rdf:ID works. See [link] and [link] for details.

Message from Dare Obasanjo

at

On a quick inspection of the RSS feeds of Planet Gnome, Debian and Apache it looks like they are maintaining the <guid>/<link> elements of the aggregated items.  Is there something more that they should be doing?

Posted by James Henstridge at

"the incidence of someone publishing an ID of htttp://blah.invalid:80/ and a separate one of htttp://blah.invalid/ is very slim."

Yahoo's news feeds and news search feeds have links to the same (Yahoo news) articles, but with different URLs that would normalize to the same thing.

Posted by scott reynen at

Dare, one would expect the definition of a concrete syntax to make a reference to the abstract syntax - but that does not make them identical.  And the real question is whether it is clear to every implementer (not just Angels, but every implementer) that Uris are not meant to be normalized.  Any RDF implementation which uses the System.Uri class in any way gets normalization "for free".

James, my ids are not preserved on Planet Apache.

Scott: Thanks!

Posted by Sam Ruby at

RE: Preserving Identity

Sam,
The concrete syntax spec makes a link to the spec defining the concepts behind [not just abstract syntax of ] RDF. It seems quite clear to me and everyone who I've worked with in the other place were URIs are specced this way (XML namespaces) that you are supposed to compare URIs as they appear in the source document.

The only difference between Atom and RDF or XML namespaces is that a lot more average developers will be writing code that processes Atom than developers who've had to write XML or RDF parsers in the past. Such people probably won't read the spec whereas anybody implementing an XML or RDF parser probably will.

For those guys there might be edge cases where canonicalizing URIs bites them on the butt [although the only ones I can think of are contrived unless you involve relative URIs] but the answer to their questions is fairly easy to answer. Use the string class not the URI class when processing Atom identifiers. It's what the folks implementing XML and RDF parsers have had to do as well. Atom developers shouldn't be any difference plus it adds consistency to the Web architecture.

Message from Dare Obasanjo at


Dare: do you know of any .Net RDF parser?  Do any of them they make any use of the System.Uri class?

I don't know about you, but when such things happen, I would like to be able to do more than smugly point to the sentence in the spec that clearly spells out how their horribly broken their software is.  I'd like to make producers aware of the tradition of search engines and system URI libraries (both of which are very much vibrant parts of web) to be slightly overzealous in their quest to eliminate false negatives.

Not with a mandate, a shall, or a MUST.  But with a recommendation that this is something that they might want to be aware of.

Posted by Sam Ruby at

RE: Preserving Identity

Welcome to the world of standards development. As someone who's had to implement all sorts of unnatural behavior because that's what the specs say or told some customer they're app is busted because of some brokenness in some W3C spec I can feel where you are coming from.

However it seems you are optimizing for an edge case. Don't let edge cases dominate your design. It typically leads to unnecessary complexity and overengineering.

Message from Dare Obasanjo at


Dare: by your silence, the first thing I am going to assume is that you are OK with the requirement that IDs are mandatory.  And with the requirement that IDs must be preserved - character by character - if an entry is relocated, migrated, syndicated, republished, exported or imported.  And with the explicit requirement that IDs are to be compared on a character by character basis.

Your only quibble seems to be on a warning.  To be produced by the Feed Validator.  On what you openly admit is an "edge case".  A warning that targets people who "probably won't read the spec".  If it helps, I can promise to make sure that the help page directs people to use whatever string type they can find instead of whatever URI classes which might be available.

Scott has identified Yahoo! feeds as having this problem.  I've also verified that the URI class in a popular platform is what I refer to as an "attractive nuisance".  Finally, we are talking about a suggestion that requires absolutely no changes to RSSBandit.

It's just a warning.  For a real problem.  That requires no changes to RSSBandit to implement, but might reduce the number of duplicate entries that RSSBandit users would see with real feeds that exist today.

Posted by Sam Ruby at

RE: Preserving Identity

Sam,
  I am being an annoying git. The feed validator producing a warning on canonical URIs is fine and as long as Atom doesn't require canonicalization of URIs there isn't anything I have beef with. 

My apologies. :)

Message from Dare Obasanjo at


thud

Posted by Sam Ruby at

Sam: it might be that Planet Apache is using an old version of the planet code.  I just downloaded the latest version, and added your atom feed as a test.  The resulting generated rss20 feed included the following:

<item>
        <title>Sam Ruby: Preserving Identity</title>
        <guid>tag:intertwingly.net:2004:1831</guid>
        <link>http://www.intertwingly.net/blog/2004/08/25/Preserving-Identity</link>
        ...
</item>

I guess that if Planet Apache was updated to a newer version of the aggregator code, it would produce the same output.

Posted by James Henstridge at

James, that's pretty good!  Unfortunately, you would also need to specify isPermaLink="false" in order to be valid RSS 2.0.

Posted by Sam Ruby at

Good catch.  I suppose it should be copying over the isPermalink value for rss2 feeds, and setting it to false for IDs found in atom feeds.  It probably wouldn't be too difficult to do something like that.

Posted by James Henstridge at

James, you might also want to look for rdf:about attributes in RSS 1.0 feeds.  And even in some RSS 2.0 feeds.

Posted by Sam Ruby at

And don't forget to look for atom:id elements in RSS feeds.  Of all versions.  Some people really like to embed Atom in RSS.

Excuse me, I'm going to go sit in the corner and mutter about how wellformedness will save us.

Posted by Mark at

Blood, Sweat and URIs

It’s been proposed that Atom format mandates or recommends that publishers use canonical URIs. If this is accepted, then all the consumers have to do to get fairly accurate URI comparison is use string matching (i.e. char-by-char in the same...

Excerpt from Planet RDF at

Add your comment