intertwingly

It’s just data

Vary: ETag


Another bandwidth reduction idea, compliments of FooCamp04.

This idea doesn't require coordination between vendors, and leverages the ETag support that is present in many existing aggregator clients.  It is complicated to explain, and would be complicated to implement, but the end result would take the form of an Apache Module (and/or IIS filter) that sysadmins of major sites could just "drop in".

Design

The core idea is that, sites that are willing to trade a little CPU for a bandwidth savings, subsetting the feed that is returned on a GET based on the ETag that was provided on the request may make sense.  Such trade-offs have precedent.  There is a concern that varying the response based on the request would not play nicely with HTTP caches, but that's where the Vary HTTP header comes in.

Before going any further, it's worth backing up.  An Atom feed consists of some header information and a set of entries (similarly, an RSS feed consists of some channel information and a set of items).  There is no predefined specified contract as to how many entries a server must, or even should return on every request.  Some servers return the last "n" entries, some return the last "n" days worth of information.  Entries may "drop off" the end of the feed without warning.  Clients already need to deal with this today.

ETags are metadata returned by a web server.  If clients retain this information and provide it on subsequent GET requests, servers can optimize their response.  A concrete example: Mena Trott's Atom feed contains 15 entries.  Returned with the feed is an ETag, which at the moment is the value "819473-7b36-c4b4a500".  This is computed and handled entirely by the Apache web server.

So, at the moment, GET request that provide this information in an If-None-Match header (yea, that's intuitive), obtain a curt response of 304 Not Modified.  This saves a lot of bandwidth.  Particularly as Mena's last post to this particular weblog was on August 17th.

Now consider what would happen if Mena were to do a new post.  Requests with the previous ETag would result in the full feed.  All 15 entries, with full content.  Quite likely, this is an order of magnitude more information that one would need, as the only change is one entry.  If you treat the ETag as sort of bookmark of where you last left off, and warn caches that you are doing this with a Vary: ETag header, then you could safely return a carefully truncated feed - with all the same feed header information, but with only the entries that you haven't seen.

This is more complicated than it seems, it requires an understanding not only of HTTP and the feed format, but also with the usage pattern of the given tool.  I'll use Mena's feed as an example.  Entries are in reverse chronological order, meaning that new entries are added at the top, and dropped off the bottom.  So if Apache module can understand the feed format just enough that it can identify where each entry starts and stops, then it can compute a hash of the full feed, as well as a hash of the full feed minus the last entry, and a hash of the full feed minus the last two entries, etc.  This doesn't have to be a cryptographically secure hash, perhaps as simple as a 32 bit CRC could do.  And not all permutations of entries present or absent need to be computed, just enough to make a difference.

Clients that receive such ETags will continue to return them, just as they do today.  No change is required to the client to make this work.  On the server, if the ETag exactly matches, a 304 response will be returned, just as is done today.  Where things get interesting is if one of the hashes match a hash computed for the current feed with the first entry omitted (or first two, or...).  If so, you know which entries this client has seen before, and therefore don't need to see again.

For this to work, the ETag returned on such a streamlined response needs exactly match the ETag of the full feed (including the omitted entries).  The status code needs to be 200 OK, not 206 Partial Content.  And, of course, the Vary: ETag header needs to be added.

Part of the design is that in all edge cases, the full feed needs to be returned.  The user somehow managed to update an entry in the middle?  Return it all.  The feed can't be parsed?  Return it all.  No ETag is provided on the request, return it all.

When in doubt, return it all.  The purpose of this optimization is to catch enough cases to be cost effective, not to go to heroic efforts to squeeze every last byte out of the response.

Summary

Whew.  I said this was complicated.  But the upside is that there is no change required to the publishing process.  No extra files need to exist on the server.  No database needs to be accessed.  The only change is that a single Apache module needs to be installed.  Casual users with low bandwidth usages today don't need to bother.  This is only for the users with a problem.

However, a single installation on a typepad.com or blogspot.com server could result in a significant bandwidth saving overall.  This not only would benefit servers, but also clients (particularly ones across slow modems), and crawlers (like Feedster or Technorati).

Existing clients that don't provide an ETag on requests won't see any difference.  Existing clients that provide an ETag and retain a memory of what has been seen before will end up with an end result that is indistinguishable from what they see today.  It would only be a client which retains a memory of the ETag but not of the entries that would see any difference.  Such a client would be relying on a specific behavior of the server that wasn't guaranteed to be so by any specification.  And the solution is obvious: if you don't retain the entries, then don't provide an ETag.  (Perhaps retain the Last-Modified value instead).

Update: The correct value of the header would be Vary: If-None-Match. This was pointed out to me by Greg Stein offline, and by Carey Evans below. Thanks!