AggregateFeeds - Atom Wiki

What are Aggregate Feeds?
Why aggregate feeds?

What does Atom need for this?

Problems
Implementation Possibilites
Discussion?

What are Aggregate Feeds?

An aggregate feed consists wholly or partially of entries which are reproduced from other feeds, or indeed obtained from multiple non-feed sources. An example would be the Daily Entries feed from Javablogs.

Why aggregate feeds?

The most obvious and useful purpose for aggregate feeds is to facilitate the creation and use of Atom syndication proxies (see SuperAggregator) which can collect and store multiple feeds on behalf of several, or indeed several thousand people and produce single feeds for each user containing entries from all of the feeds they have subscribed to, probably restricting to entries since the last retrieval. The backend mechanism for this is outside the scope of this discussion.

This can decrease load on the source feeds and can also specialize feeds by taking only a subset of entries from a given feed to aggregate ("Give me articles about weblogs from anywhere"), if there is some kind of editor, whether human or software, at the site running the proxy.

The site Moreover currently produces aggregate RSS feeds based on their slurps of weblogs and news sites. Users cannot choose what goes into these feeds, but instead pick a broad category and Moreover's editors (probably just some software) decide what is relevant to each. Moveover's RSS feeds use the RSS description field to indicate the source of the entry, which is far from ideal.

LiveJournal also produces aggregate feeds in the form of friends pages. The system currently doesn't provide these in a syndicatable format, but users can make them such with a few style hacks. If there is good accomodation for aggregate feeds for Atom, LiveJournal could potentially serve as an Atom aggregation proxy since it already supports syndication of RSS to users and will most probably support Atom too. Of course, enough data would have to be retained to reproduce each entry completely.

What does Atom need for this?

For this to work, it must be possible to identify a feed at entry level. In order to do this unambiguously, possibly yet another unique ID will be needed. Non-aggregate feeds must then give their ID so that when aggregate feeds include data from that feed the originating feed can be identified for the purposes of client-side filtering etc.

Regardless of the mechanism and ambiguity of the specification, it will certainly be necessary to give some indication of the source feed at entry level.

Should the feed-level identification still be required in an aggregate feed? Should an aggregate feed be required to identify itself as such, and possibly thus use a slightly different version of the spec? Or, finally, should all of this be handled in a standardized extension in a different namespace? (See discussion below)

Problems

The major problem with implementing this mechanism is dealing with future versions of Atom. What should a proxy for Atom 1.0 do when confronted with a feed in a hypothetical Atom 2.0?

If it just reads the elements it's familiar with, there will be a loss of data compared to the original feed.
If it reads and reproduces all of the elements regardless of whether it knows them or not, the aggregate feed will not be Atom 1.0 even though it claims to be.
It can refuse to deal with any feed with a version it does not know about. This seems like the most robust solution, but will cause frustration during version transitions as proxies play catch-up (or don't bother) learning about the new version.

Implementation Possibilites

There has been discussion in AggregatorApi which defines aggregate feeds as a composite of a SuperAggregator and a feed producer.

Discussion?

[MartinAtkins, RefactorOk] Having feeds identify themselves as aggregate has the advantage that client software can then give precedence to the authoritative version if it has it, rather than the aggregated copy (which may have lost some information) cobbering the original authoritative version, assuming that there's some overlap between the feeds a user retrieves manually and a users' aggregates.

Maybe it would also be useful to be able to identify at entry level whether an entry is authoritative? LiveJournal, for example, can create guaranteed authoritative versions of its own entries in its own aggregate feeds, but stuff syndicated from elsewhere may not reflect the original feed completely accurately.

The authoritative/proxied relationship might also be important in dealing with edited entries, although the modified date/time should really be covering that one. It must be carefully specified what happens when a non-authoritative source pulls in a newer version (as per the modified time) of a given entry than was retrieved from the authoritative source.

[AsbjornUlsberg] I think this whole "am I an original or syndicated feed?" issue can be quite easilly solved by allowing syndicating aggregators to plunge their alternative representation of the feed/entry into the feed/entry itself. The original URI of the feed will *always* be the original <link>. If the base of the URI in <link> is different from the base of which the entry was retrieved from, then the entry isn't an original. If the two bases are the same, the entry is the original. The originating feed-service MUST NOT deliver anything else than the original entry when it is being queried. The original entry should also not be altered by the originating feed-service.

How syndicating aggregators should plunge their representation of the feed/entry into the feed/entry can be discussed, but one option is having a <link rel="copy" href="..." /> element. This is also my preferred way to notify of alternative format representations of the feed as well, where a feed might be represented in HTML, SVG, etc, as well as the original Echo format; <link rel="alternative" type="text/html" href=".../232.html" />, <link rel="alternative" type="application/svg+xml" href=".../232.svg" />.

MartinAtkins

AsbjornUlsberg

Getting an entry from a local server with good bandwith is much better than getting it from the originating server that is connected with an ISDN line, but it just might be interesting to get the authorative entry as well. Hence, I would like the end-point aggregator or -user to get to chose where the entry/feed should be collected from. If all the syndicating servers that have served the feed in the syndication-chain hasn't attatched an URL to their version of the entry/feed, then the entry/feed can't be collected from them, which is a shame.

MartinAtkins

in addition

AsbjornUlsberg

[MartinAtkins] Thinking about the UI in an nd-user aggregator (which is collecting the aggregate feeds), I guess the program would display some kind of ndication that the entry is non-authoritative and allow the user to request the original if desired, at which point the client could download the entire feed and see if this entry is referenced within. Assuming it is, (one of the entries has a matching ID and is authoritative) the local data is updated. If it finds another non-authoritative version (if an aggregate feed is feeding off another aggregate feed) it should trace back until it gets to an authoritative version. If no authoritative entry can be found, just keep the non-authoritative entry with the most recent last-modified time.

Integrating the rest of the entries in the feeds retrieved during this process into the local data is a separate operation: "Get more from where this came from".