PossibleHTTPExtensionForEfficientFeedTransfer

Should we extend HTTP?

In summary: probably not (at least right now). Instead, let's look at an AggregatorApi to do the same job.

Proposal

[MartinAtkins : RefactorOk] Can we perhaps make an optional addition to the HTTP requests made by aggregators to indicate that an aggregator has current data as of a particular date? Then only new/changed entries can be delivered.

The header, which could be called something like 'X-Last-Polled' would be optional both for the client to send and the server to honour. Small sites may wish to trade bandwidth for the reduced CPU utilization of the feed being a static file served directly from disk or memory.

Clients should be prepared to re-recieve data they've already got, despite their indication that they already had it. This should "just work" anyway, since the repeats can just be interpreted as editing the entry to be the same as it already is. The aggregator should notice the last-modified time hasn't changed and thus not bother with the entry again.
Servers should be prepared to not get this header. When they do, they should just serve some arbitrary amount of entries they feel is sensible. This will most often happen because the client does not actually cache data locally, or just displays what the feed currently contains. It can also happen when an aggregator first subscribes to a feed, and just wants to grab as many current entries as the server will give it.

Whether the server sends edited entries again in response to the modified time being greater than the last polled time is a server decision. Any server software which does not acknowledges the fact that some clients may end up with stale data. Sites which don't support this header at all will have the edited entry in their static file from which they serve the feed.

X-Last-Polled is the same as If-Modified-Since in syntax, but is a request rather than a question. Hopefully everyone can see why If-Modified-Since is not appropriate for this purpose.

Something like this will have to be standardized, even as a "best practice", or else aggregators will start trying to it themselves in incompatible ways and we'll end up having to send five different headers.

Discussion

[MartinAtkins : RefactorOk] I originally considered that If-Modified-Since would work for this, but then realised that there are some users of RSS (and thus, ultimately Atom) feeds which don't make any effort to cache individual entries locally. Instead, they pull down the feed, transform it into something else (usually HTML) and that's the only data they keep. When they request again, they politely use the last-modified time on their HTML file in the If-Modified-Since header and if they don't get a response they just leave the file as it is and wait until next time. If they get a response they replace their HTML file with the new data which, if it's considered to be X-Last-Polled, will now be at worst blank, and at best only contain new stuff, thus losing anything that hadn't been seen in the mean time.

This may well cause problems for some proxies. However, some sites are already beefed up enough to be able to deal with bypassing proxies. LiveJournal.com, for example, always bypasses proxies because the responses generated are dependant on who is making the request. Assuming my implementation were to be used, it would have to be specified that servers MUST use Cache-control: private when honouring X-Last-Polled. That is, unless it's valid to put X-Last-Polled in the Vary header -- I can't remember how exactly Vary is specified. (probably not best to rely on it anyway, as there are plenty of dodgy proxies out there)

This is not really in the spirit of HTTP, but the benefits of including this functionality in some form are at least twofold:

Sites which can afford to dynamically-generate their Atom feeds, and whose feeds change infreqently, can save bandwith by only returning the changes.
It encourages less frequent automated retrievals or no automated retrievals at all, since you aren't going to miss anything by not retrieving for some time. By the current model, it becomes necessary to request frequently because once an item has been pushed off the bottom of the feed it's no longer gettable, thus encouraging clients to request frequently to avoid missing items. This way, my aggregator only needs to make a request when I'm ready to read, at which point it will get everything that has been added or updated since the last update, regardless of how long ago that was.

The second benefit suggests that feeds should indicate their support of such a feature, since an aggregator will need to know the difference between a static feed (which must be checked frequently for updates) and an 'intelligent' feed, which it can be a lot more lax with.

Other implementation suggestions are welcome, since mine was really just there to support the concept.

[JamesAylett RefactorOk] My problem with X-Last-Polled is that it seems different to other HTTP modifier headers. Unlike other HTTP request headers, the entity you are requesting an instance of (to use RFC 3229's terminology) is different with different X-Last-Polled headers. Unlike variable file formats, transfer encodings, instance manipulations - even different languages - having a request header which gives you a different document just feels wrong to me. (You could argue that the XML serialisation of the feed means that X-Last-Polled is a little like Content-Range. I wouldn't.)

[JamesAylett RefactorOk DeleteOk] (Paraphrased from discussion now moved to AggregatorApi) Are you happy to drop the proposal to extend HTTP for the purpose of feed transfer in favour of concentrating on an AggregatorApi?

[AsbjornUlsberg] Why not just use PUSH instead of PULL?

Of course, some clients will still poll. The client part of an aggregating feed proxy would need to poll, so this would be a good application of server-push if the originating server supports it; there will be far fewer aggregating feed proxies than end-users -- at least, that's the idea.

An aggregating feed proxy could indeed have a persistant connection with a set of its clients, although I doubt many will. The aggregated 'pull' model is similar to how users get their USENET news from their ISP's news server, which itself sucks the news from other sources.

I don't really see much harm in also creating a persistant feed consuming protocol, except that we already have two ways for an aggregator to operate: either it polls a static feed, or it asks for a delta feed generated dynamically using the reader protocol. Hopefully, everyone who has the latter will also keep the former, but adding a third option in increases the likelyhood that feed producers will pick only one or two of the options, fragmenting a system which was supposed to make more integration possible, plus making an aggregator much more difficult to write.

[AsbjornUlsberg] Many good points, Martin. Maybe the PUSH method shuold be an extension that can only be done in another namespace? We definetively need to think more about this, so postponing it to after v1.0 sounds like a good idea.

CategoryApi, CategoryArchitecture