It’s just data

HTML to Atom

Edward O’Connor: Hixie’s not the sort of guy to leave things underspecified, so HTML5 defines (in excruciating detail) how to convert an HTML document to Atom, even when the HTML document in question is, shall we say, less than ideal.

The root of the problem is that the atom spec specifies:

The content of an atom:id  element MUST be created in a way that assures uniqueness

... and a willingness to violate that specification.

This has spawned a lengthy thread.  In that thread, I requested (one, two, three times) that:

I suggest that you actually test out how common feed aggregators react when they are presented with the same feed differing only in the entry ids.

Given that I have seen no evidence that this request is going to be taken any time soon, and with the presumption that some of the people who are or were involved with, or even use, a tool that consumes of feeds may still be reading this weblog, at this time I would like to make a lazyweb request and ask if there are anybody out there who are willing to try the request I made above and post the results anywhere they like, and then leave a pointer to that information either in public_html or as a comment to this weblog entry.


Even when the entry IDs are identical, or even when the feeds are identical (apart from a 301 redirect), Google reader treats those entries as distinct items. It does the same for entries marked by “People you follow” that are also in your subscription list. I remember this being one of the bigger Atom feed debates and it was promoted as a very cool feature. At least Google reader has not implemented it and I must say it does not bother me much.

Posted by Anne van Kesteren at

Anne: either you misunderstood what I was suggesting, or (more likely) I was unclear.

I was not intending to talk about different feeds with the same ID.  I was intending to talk about the same feed with different IDs.

To make this clearer: consider a CGI script that implements this algorithm.  Now imagine pointing a feed reader at that CGI, and have it repeatedly fetch the same number of entries with the identical entry content at the same URI, etc., where the only difference is the entry ids produced.

What would you expect to occur?  Given any of the existing user agents that process feeds do, what actually does occur?

Posted by Sam Ruby at

Given my experience with Google reader so far I would expect the ID to be ignored and the feeds to be treated as separate. Much like the same feed with the same IDs is being treated as separate. And the same entries with the same IDs are being treated as separate.

IDs are somewhat nice in theory, but in practice you rarely need them and my feed reader of choice seems to ignore them altogether.

Posted by Anne van Kesteren at

the feeds

I still think we are talking about quite different things.  Same feed.  Same URI.  Refetched immediately.  The only change to the body of the response is in the entry ids.  Given that scenario I don’t see the combination of ignoring IDs and treating such entries as separate as being a plausible outcome.

I encourage you to try this scenario with your reader of choice and report what you see.

Posted by Sam Ruby at

I just tried with RSS Bandit (varying just the entry IDs, nothing else), and got all entries duplicated.

Posted by Julian Reschke at

Hi Sam -
  In an effort to ensure people are running the test according to what you mean to be tested, perhaps you could supply two CGI scripts for the lazyweb to test against;  One producing conforming output and the second producing the output you are asking to be tested.  That way people could add both to their reader and then report back in what ways (if any) they behave differently.

Posted by Kevin H at

Two feeds:

Have at it!

Posted by Sam Ruby at

I haven’t tested this issue with Atom feeds, but I have tested RSS’s guid element (which I don’t think it’s unreasonable to assume would be treated similarly in many cases).

Given two entries with different guids, but everything else the same, 13 out of 20 of the aggregators tested would consider those two entries to be different.

Similarly, given two entries with identical guids, but with other fields different, 16 out of the 20 aggregators treated those items as duplicates (the older entry may or may not be updated depending on the feed reader and other factors).

Both of the above situations would be problematic given the algorithm currently defined in the HTML5 spec (assuming, of course, that Atom ids are treated similarly).

Note that these tests were done with two requests of the feed, i.e. the feed is read once, the entries are updated, and then the feed is read again. Testing with a single request of a feed with duplicate ids would produce different results in many cases.

I’m still curious to know who this algorithm is aimed at. I got the impression from Julian that it might be something that a feed reader would use when a feed wasn’t available for a web page. But speaking as the author of a feed reader, while I might implement similar functionality, I would never in a million years use that form of algorithm, even if it were bug free.

Is there another more sensible use case that I’m unaware of?

Posted by James Holderness at

I haven’t tested this issue with Atom feeds, but I have tested RSS’s guid element (which I don’t think it’s unreasonable to assume would be treated similarly in many cases).

Agreed.

Is there another more sensible use case that I’m unaware of?

Not that I’m aware of.  The closest I have seen is here.  A thread on the subject has provided little in the way of elaboration.

Posted by Sam Ruby at

Ah yeah. The ID does contribute to entry uniqueness within the same feed. (In Google reader anyway.)

Posted by Anne van Kesteren at

Google Reader: example.stable: shows just one post.  example.unstable: shows 22 posts now (having added it three days ago).

Posted by Keith Wansbrough at

Since Anne reports about Google’s feed reader, then someone else needs to report about Opera’s feed reader: For each time one presses the update button, a new ”copy“ of the unstable post is loaded.

Posted by Leif Halvard Silli at

The closest I have seen is here.  A thread on the subject has provided little in the way of elaboration.

Thanks for the links. I’ve actually found that thread quite illuminating. Two messages that were of particular interest:

Ian Hickson: If you prefer a process-based argument: we can’t progress past CR if we can’t find two interoperable implementations of every feature.

Ian Hickson: To put it another way: the goal here is that if someone wants to get their HTML file turned into a feed, they have a set of steps they can follow that reliably give a predictable result, so that they can use off-the-shelf software to do it and can later change to different software and get the same result.

That gives me the impression that all that is required for the algorithm to be considered interoperable is for two independent implementations to produce the same output. That seems ridiculous when you consider that the output is ultimately only useful if it can be parsed successfully by an Atom client. Yet it seems that Hickson doesn’t consider interoperability with clients to be of any importance.

Is that really the way the W3C works? If there were two independent implementations that produced the same output, regardless of whether it was valid Atom, or whether it could be processed by an Atom parser, would that be considered sufficient to have satisfied the interoperability requirements?

Also, to add to your test results: Snarfer shows one item in the example.stable feed and multiple items in the example.unstable feed.

Posted by James Holderness at

Is that really the way the W3C works?

No.

It is instructive to compare the differences between the ways the WHATWG and W3C work.  That and the editor’s response and W3C Tracker Issue.

Disclosure: I’m one of the co-chairs of the HTML WG.  If you have any input on this, or any subject related to HTML5, I encourage you to send it to the public-html mailing list.

Posted by Sam Ruby at

It is instructive to compare the differences between the ways the WHATWG and W3C work.

I don’t have an issue with that part of the W3C process. From what I’ve seen, I think that’s working reasonably well given the WHATWG’s history.

What I was really looking for was the W3C equivalent of the IETF’s standards process (RFC2026), which appears more or less covered by chapter 7 of the above-mentioned W3C process document.

Unfortunately that W3C document isn’t quite as clear as the IETF as to what exactly is meant by “interoperable”, which may be part of the problem. However, I see this was discussed at some length in www-tag a couple of years ago, so I’ll leave it at that.

If you have any input on this

My only input would be to remove the algorithm from the spec, and there’s already a change proposal for that. Anything else is just a waste of time IMO.

Posted by James Holderness at

Add your comment