It’s just data

Vary: ETag

Another bandwidth reduction idea, compliments of FooCamp04.

This idea doesn't require coordination between vendors, and leverages the ETag support that is present in many existing aggregator clients.  It is complicated to explain, and would be complicated to implement, but the end result would take the form of an Apache Module (and/or IIS filter) that sysadmins of major sites could just "drop in".

Design

The core idea is that, sites that are willing to trade a little CPU for a bandwidth savings, subsetting the feed that is returned on a GET based on the ETag that was provided on the request may make sense.  Such trade-offs have precedent.  There is a concern that varying the response based on the request would not play nicely with HTTP caches, but that's where the Vary HTTP header comes in.

Before going any further, it's worth backing up.  An Atom feed consists of some header information and a set of entries (similarly, an RSS feed consists of some channel information and a set of items).  There is no predefined specified contract as to how many entries a server must, or even should return on every request.  Some servers return the last "n" entries, some return the last "n" days worth of information.  Entries may "drop off" the end of the feed without warning.  Clients already need to deal with this today.

ETags are metadata returned by a web server.  If clients retain this information and provide it on subsequent GET requests, servers can optimize their response.  A concrete example: Mena Trott's Atom feed contains 15 entries.  Returned with the feed is an ETag, which at the moment is the value "819473-7b36-c4b4a500".  This is computed and handled entirely by the Apache web server.

So, at the moment, GET request that provide this information in an If-None-Match header (yea, that's intuitive), obtain a curt response of 304 Not Modified.  This saves a lot of bandwidth.  Particularly as Mena's last post to this particular weblog was on August 17th.

Now consider what would happen if Mena were to do a new post.  Requests with the previous ETag would result in the full feed.  All 15 entries, with full content.  Quite likely, this is an order of magnitude more information that one would need, as the only change is one entry.  If you treat the ETag as sort of bookmark of where you last left off, and warn caches that you are doing this with a Vary: ETag header, then you could safely return a carefully truncated feed - with all the same feed header information, but with only the entries that you haven't seen.

This is more complicated than it seems, it requires an understanding not only of HTTP and the feed format, but also with the usage pattern of the given tool.  I'll use Mena's feed as an example.  Entries are in reverse chronological order, meaning that new entries are added at the top, and dropped off the bottom.  So if Apache module can understand the feed format just enough that it can identify where each entry starts and stops, then it can compute a hash of the full feed, as well as a hash of the full feed minus the last entry, and a hash of the full feed minus the last two entries, etc.  This doesn't have to be a cryptographically secure hash, perhaps as simple as a 32 bit CRC could do.  And not all permutations of entries present or absent need to be computed, just enough to make a difference.

Clients that receive such ETags will continue to return them, just as they do today.  No change is required to the client to make this work.  On the server, if the ETag exactly matches, a 304 response will be returned, just as is done today.  Where things get interesting is if one of the hashes match a hash computed for the current feed with the first entry omitted (or first two, or...).  If so, you know which entries this client has seen before, and therefore don't need to see again.

For this to work, the ETag returned on such a streamlined response needs exactly match the ETag of the full feed (including the omitted entries).  The status code needs to be 200 OK, not 206 Partial Content.  And, of course, the Vary: ETag header needs to be added.

Part of the design is that in all edge cases, the full feed needs to be returned.  The user somehow managed to update an entry in the middle?  Return it all.  The feed can't be parsed?  Return it all.  No ETag is provided on the request, return it all.

When in doubt, return it all.  The purpose of this optimization is to catch enough cases to be cost effective, not to go to heroic efforts to squeeze every last byte out of the response.

Summary

Whew.  I said this was complicated.  But the upside is that there is no change required to the publishing process.  No extra files need to exist on the server.  No database needs to be accessed.  The only change is that a single Apache module needs to be installed.  Casual users with low bandwidth usages today don't need to bother.  This is only for the users with a problem.

However, a single installation on a typepad.com or blogspot.com server could result in a significant bandwidth saving overall.  This not only would benefit servers, but also clients (particularly ones across slow modems), and crawlers (like Feedster or Technorati).

Existing clients that don't provide an ETag on requests won't see any difference.  Existing clients that provide an ETag and retain a memory of what has been seen before will end up with an end result that is indistinguishable from what they see today.  It would only be a client which retains a memory of the ETag but not of the entries that would see any difference.  Such a client would be relying on a specific behavior of the server that wasn't guaranteed to be so by any specification.  And the solution is obvious: if you don't retain the entries, then don't provide an ETag.  (Perhaps retain the Last-Modified value instead).

Update: The correct value of the header would be Vary: If-None-Match. This was pointed out to me by Greg Stein offline, and by Carey Evans below. Thanks!


While the big sites would certainly want something as fast as an Apache module, it doesn't have to be one for everyone, does it? Seems to me (without having been in the code for several months) that something that fries up feeds on demand like WordPress could just, as a part of saving a new entry, compute and store hashes for the last n feed windows and the current state, and then look at the ETag in the request to see how many entries it needs to load up.

Posted by Phil Ringnalda at

RFC 2616 says that Vary “indicates the set of request-header fields”, so shouldn’t this be Vary: If-None-Match?

Posted by Carey Evans at

Sam, what you're suggesting is, I think, essentially the same as the "feed-specific" or "entry-oriented" instance manipulation method that I've been proposing that we add to RFC3229. The significant, and nice, addition that you're making is the suggestion that sites could turn on this instance-manipulation method by default rather than requiring that clients request it in HTTP Accept headers. By not requiring clients to be modified, the method will become useful immediately.

While this will serve Atom immediately, I think it makes sense to go the extra step of actually defining the RFC3229 IM method. If this is done, then we'll find that this method can be used to support a wide variety of non-Atom formats such as event logs, and other sequentially updated files. Atom/RSS may be the first application of the IM method, but it will be quite useful in other situations as well. (i.e. using this method on log files would have the same effect as "tail")...

Now, how do we get it built?

bob wyman

Posted by Bob Wyman at

I must be missing something, because I don't see how this works with an Apache module that doesn't know anything beyond the current state of a static file.

You have a feed with entries for 20040910, 20040909, and 20040908. I get it, along with an ETag from hashing that file. You add 20040911, so now your feed consists of 20040911, 20040910, and 20040909. I return an ETag with the 10, 09, 08 hash, but how does the module, which no long gets to look at the entry for 20040908, tell that it's the hash for that? It can compute 11, 10, 9 or 11, 10 or 10, 9, but none of those matches my 10, 9, 8 hash to tell it that I've seen everything but 11.

Posted by Phil Ringnalda at

Phil: yes, frying up feeds on demand makes things considerably simpler.  In fact, it might be worth having separate "ETaglets" on a per entry basis which are simply concatenated (along with perhaps one for the head).  With some smarts, you can base64 fifteen 32 bit hashed into 80 bytes.  Having each hash available makes the job considerably simpler: filter out the entries that have already been seen before.  If none match, return a 304 response.

Corey: yes, thanks! corrected.

Bob: yes, RFC3229 and this proposal can go on in parallel.  While this proposal has the advantage of working with existing clients, RFC3229 has the potential of being more HTTP cache friendly.

If the implementation costs are small enough, a good way to get started would be to prototype this with dynamic feeds on a product such as with WP.  This could then be deployed on an A-list or two bloggers' sites.  Once the concept is proven, the considerably harder task of building an Apache Module could proceed.

Posted by Sam Ruby at

Before I actually read the caching bits of rfc2616, I didn't quite realize that you are likely to get a whole batch of ETags from a cache in If-None-Match, representing all the representations it has cached (thus the name), to which you are supposed to reply with a 304 and the ETag for the one correct one. So if I read it right, client sends the ETag which will tell you that it only needs the latest two entries, but because you need to reply with the ETag for the full current feed, you wind up telling a cache that has that stored to deliver the whole thing to the client. Still benefits for the publisher, but not for the client.

Posted by Phil Ringnalda at

Phil: to explain the original proposal, let's take a really simple example: "entries" that consist of entirely one character, with an identity hash function.  So, lets define an initial feed of:

876543210

The "hash" for that feed would be:

876543210,87654321,8765432,876543

Now assume a new entry is added, thus:

987654321

What you would first do is see if 987654321 is in the input list.  Nope.  Next you would check 87654321, and sure enough, you find a match.  From that, you can deduce the the only entry that hasn't been seen yet is 9.

That being said, this involves at least two passes through the entries (all the hashes can be computed in parallel in one pass).  It might just be simpler to do ETaglets, as described above.

Posted by Sam Ruby at

Ah, now I get it. I see "hash" and I think "big long monolithic string of crud that makes my eyes glaze over." So for my own private purposes, I might use entry_id;mod_date:entry_id;mod_date:..., and if I don't match the whole thing, pull off the first one as something I'll send, see if I match what's left, and keep going until I get back before the last thing I changed, or get to the end and send everything.

Hmm. And that gives me a handle where I decide whether or not to push out an edit: rather than depend on the client telling the difference between typo correction and substantive change, I can decide, and if I just swap ei/ie, I don't change the mod_date, and I can still deliver the corrected version to people who haven't see it, without having to make those who have re-read. That might make frying worth the trouble.

Posted by Phil Ringnalda at

Ironically your rss2 feed includes a <body> tag in the <item> with full text that most readers wont be able to show, and isn't part of rss 2.0 AFAIK.
And the description duplicates a lot of that data again.

Posted by Bart at

It's ingenious, but seems the same in principle to Bob's proposed RFC 3229 (customised to provide per-entry diffs) approach. Either approach could be implemented with no visible effect to non-supporting clients, both require additional handling to be supported - neither offers much as a short-term fix. So I'd tend to favour the RFC 3229-based approach as a more widely-applicable solution long-term.

Posted by Danny at

Ultimately aggregators are heaven-sent for hosting companies, and ISPs who stand to gain in with all the bandwidth consumption.

What's proposed is more like an NNTP-style feed. There's no reason why this couldn't be emulated in HTTP. The E-tag is one way, but the other is to permit additional parameters like [link]. Sites that serve feed statically will fail gracefully, while those which support the additional semantics can save the bandwidth. Servers could be configured to give priority to feeds which use this semantic and provide lower service to aggregators which do not (eg. refuse to provide updates every other hour).

A possible (market-driven) solutions is to let advertisers host the RSS-feeds (Ads can be inserted into the feeds), and proper mirroring to minimize traffic.

Posted by Chui at

Anne van Kesteren : Vary: ETag - After reading the comments I think I got it, seems like another interesting idea for saving feed bandwidth...

Excerpt from HotLinks - Level 1 at

This is risky. If-none-match has a defined purpose already, which is to find out if there is a new version of the page. Your outline only considers the case of a "normal" aggregator which polls the server for changes every now and then. However, there are two other kinds of client which this will break.

The first is newsreaders which don't retain a local copy of each entry but instead retain the entire feed as one lump. This is common for things like news boxes on websites which show the last n entries from a particular feed, like some of Slashdot's nutty little boxes; the software driving it will often just retrieve the feed (with appropriate If-None-Match or If-Modified-Since headers), take the first n entries, format them into some kind of sensible HTML and save the result to disk as a static file. These programs are trying to be polite by using the normal mechanisms to only retrieve new content, but your mechanism punishes them by only returning new stuff and causing them to lose any items from the previous run.

The other kind of client this affects is the good old web browser. If I load your feed in my browser -- which would be a weird thing to do, but I do have my reasons to do this from time to time -- then my browser will likely cache it. If I then load it again, I'll get back a page missing all of the entries that were there when I last loaded it. This is going to confuse and frustrate anyone who was simply trying to get at some of the data thinking it was a static document.

See the Atom Wiki page PossibleHTTPExtensionForEfficientFeedTransfer (which didn't have such a long name when I created it; someone refactored it) where I wrote all this stuff out in a much better way ages ago. You can also see the discussion which resulted from the proposal, some of which would also apply to this ETag proposal.

Posted by Martin Atkins at

Sam Ruby: Vary: ETag

[link]...

Excerpt from del.icio.us/jonas at

Will this actually help much? My feeling (with absolutely no evidence to back it up) is that most bandwidth use if from clients that don't implement conditional-get or handle GZipped feeds.

This proposal will only help decrease the bandwidth of already well-behaved clients, and then only when an extra post is generated.

Say the average feed is 20 kb, the average client polls hourly and the average blogger posts once per day, with a post that is 2 kB. Assume that conditional get uses 200 bytes of server bandwidth for a 304 response.

On the average day a good client will use (23 polls * 0.2 kb) + (1 get * 20 kb) = 4.6 + 20 = 24.6 kb.

Under this proposal it would be something like (23 polls * 0.2 kb) + (1 * 2 kb) = 4.6 + 2 = 6.6 kb.

That's a saving of 18 kb, which sounds excellent, until you realize it doesn't take into account Gzip compression.

A quick experiment shows a 20 kb feed will compress to about 5 kb, and a 2 kb item will compress to about 1 kb.

Now the math is (23 polls * 0.2 kB) + (1 get * 5 kB) = 4.6 + 5 = 9.6 kb vs (23 polls * 0.2 kB) + (1 get * 1 kB) = 4.6 + 1 = 5.6 kB.

That's a saving of only 4 kb per day.

Obviously the more you post the more you save - but conversly the more often clients poll the worse off you are. For some big aggregated feeds it might make a lot of sense, but for the average feed it isn't a big difference.

Having said that, if someone does a server to test against, I'll volunteer to make the Rome Fetcher ([link]) work with it.

Posted by Nick Lothian at

Analysis of Bandwidth savings of Vary: ETag

I posted some quick analysis of how much bandwith Sam Ruby's intersting Vary:ETAG will help. I'm interested in any feedback - especially hard numbers!...

Excerpt from BadMagicNumber at

i think phil was reasonably confused, as most hashing functions don't point to the represented data in any recognizable state, whereas your "hash" includes distinguishable entry identifiers. hash functions are called such because they generally follow the common definition of the word: "A jumble; a hodgepodge". but you're not talking about a jumble. you're talking about an "identity hash function." i've used many hash functions, and none of them identified what the hash represented. maybe such a hash exists, but it's at best a rarity. of course, this has nothing to do with the actual proposal - just the language used to describe it.

nick says "For some big aggregated feeds it might make a lot of sense, but for the average feed it isn't a big difference." but that's going to be true of anything that saves bandwidth. those who have plenty of bandwidth to spare won't have much use for it.

Posted by scott reynen at

Bart: I believe that most aggregators support xhtml:body.  In description, I put a summary, in xhtml:body I put the full text.  For those aggregators that don't support xhtml:body, I have more feeds, including RSS 0.91 and RSS 1.0 feeds.

Danny: The key benefit to this proposal is that clients don't have to change.  Proposals that require both the client and the server to change are often hard to get adopted as there is no immediate befit for the early adopters.

Martin: Such a "nutty little box" implementation could be impacted - but as I said they are relying on a behavior that isn't guaranteed, and if they chose to send a If-Modified-Since header instead of an If-None-Match, they won't be affected.  Your point about the browser is a good one, and one that I hadn't considered.

Nick and Scott: I agree that for an average feed, this won't make a difference.

Scott: I meant Identity function not Identifier. Nobody would use such a function for a hash, except for explanation purposes, which is all I was using it for.

Posted by Sam Ruby at

I don't understand.

I do understand most of this exciting story.

I understand: The client does a conditional-get (good client!) and includes the eTag12, the tag of the version it last fetched.  The server seeing that it's current version is eTagXY then thinks.  "Cool I remeber eTag12, I know how to delta between it and eTagXY."

I notice: what appears to be a subplot in the story told above about how the server might be able to do this without caching a copy of version denoted by eTag12 because the content in question is very sylized.  Yeah, eTag could even have secret messages that help the server figure that out.  Ok, that's a clever a subplot.  Stateless servers are fun.

I get lost: What does the server return?  What HTTP response codes and headers.

I'm feeling dumb.

Posted by Ben Hyde at

Does this work for subsequent requests? In your example above, it seems like the extra '9' entry would have to be served with ETag: 9 (if it’s served with ETag: 987654321 then it’s indistinguishable from the full feed in caches). The subsequent request with If-None-Match: 9 doesn’t match any historical feed, and so gets the full feed again, so you only bumped serving the full feed along to the next poll.

Holding a full set of hashes (so, when 'A' is added, If-None-Match: 9 just gets A, as would If-None-Match: 987 etc.) seems like the only solution, short of requiring clients to reconstitute the full feed.

Posted by Joseph Walton at

Ben, the response would look something like this:

HTTP/1.1 200 OK
ETag: "eTagXY"
Vary: If-None-Match
Content-Length: nnnn

<feed>
  ...
</feed>

Most clients won't even know what they are missing.  ;-)

Posted by Sam Ruby at

Joseph, it would be served with ETag: 987654321.  That, coupled with a Vary: If-None-Match means that if in the future there is a new entry (let's call it A), that should be cached along with the ETAG.

Let's look at a scenario: Joe and Jane use the same ISP which has a smart HTTP cache.

Note that the ETag represents the server's full state, not the partial response.  In the last step, Joe and Jane receive the same data, even though their prior GET returned different data.  The down side is that Joe's second request resulted in a cache miss.

Posted by Sam Ruby at

If 13.6 describes how caches actually work, it sounds to me like Joe's miss goes a little differently. He sends a GET with If-None-Match: "876543210", the cache has expired that representation, so it passes it along as If-None-Match: "876543210", "987654321" because it has an unexpired copy of that from Jane's full fetch, to which the server responds with

304 Not Modified
ETag: "987654321"

and Joe gets the full feed from the cache.

Posted by Phil Ringnalda at

I'm afraid I'm more confused than ever. Isn't the upside what happens in Joe's second call - i.e. a single entry has been added since his last call, he has the previous ETAG version, he should only need the new entry..? 

What's more I don't understand your answer to my previous point - "The key benefit to this proposal is that clients don't have to change." - but surely to get any benefit both client and server would have to change..? (Just as they would with RFC3229).

Posted by Danny at

Phil, that's where the Vary: header comes in. Jane's full fetch returns the feed with a Vary: If-None-Match. Because Joe's request will have a different If-None-Match value than Jane's (absent), the cache will miss. Put another way, a request must have an identical If-None-Match header to a previous request in order to result in a cache hit.

Posted by Robert Sayre at

But just saying "the cache misses" is an oversimplification. It doesn't sound to me like it just says "oops, I don't have that, I'll pass on the request completely untouched." Instead, it says "I don't have anything I can return without asking the server, so I'll ask the server if I have anything that suits, by sending every ETag I have stored." So at the server, you see Joe's ETag, plus Jane's ETag which is your current one, and you can't tell whether the request is from Joe wanting a new item, or Jane sitting on her Refresh Feed button, so you say 304 with Jane's ETag either way.

Now, stir it up a bit more: you've updated twice within your Expires: time. Joe is one item behind, you return one item with the ETag that means "current", Jane, who is two items behind, tries to update, and the cache sends the ETags that mean two items behind (from her last refresh), one item behind (Joe's refresh before his last), and current. You respond with a 304 and the ETag that tells the cache to give Jane just one item, don't you?

Posted by Phil Ringnalda at

Now I'm more confused.  The idea is that the ellipsis would contain only the new elements?  Doesn't that tell the client that the old articles have been pulled?

Posted by Ben Hyde at

Danny (re: your comment at 13:20) What Sam is proposing is that an entry-oriented or "feed-aware" instance-manipulation method be implemented and turned on by default when serving Atom files. The reason the client doesn't have to change is that the RFC3229 IM method is being used whether or not the client asks for it. The client doesn't need to know.

This use of IM methods by default is something that is only possible with the class of resources that contains Atom and RSS feeds. The important property of the class is that you can "truncate" a member of the class and the truncated version is still syntactically valid and useful. In fact, given that virtually all Atom and RSS feeds are simply sliding-windows on a conceptually longer feed, feed truncation is a normal part of life in this world anyway. The difference here is that the window-size is being adjusted on a per-connection basis.

RFC3229 only provides for instance-manipulation on demand of the client. Sam is simply pointing out that for this class of resources, instance-manipulation can be safely done by default.

bob wyman

Posted by Bob Wyman at

Ah, thanks Bob - got it: RFC3229, 7.1 Multiple entity tags in the If-None-Match header.
Need to read (and think!) a little more, but the sliding-window reference does suggest a potential snag - how does the client get the Atom head info without sliding the window right back to day one? I guess this might be another situation where the 'introspection' kind of ideas might be useful - stick the head info at a different URI.

Posted by Danny at

Phil, I think I see what you're saying. 13.6 says a cache should add all the ETags it has to If-None-Match, but doesn't give a course of action if If-None-Match is listed in Vary:. Doing so seems to highlight a contradiction in the spec, so your scenario is possible.

Posted by Robert Sayre at

Not so much a contradiction in the spec as a trap for unwary people trying to sneak deltas through: ordinarily, if you are using Vary: If-None-Match, you are quite happy to find that your current ETag is cached, because, well, it's all of your current instance.

But I think the bigger problem is what rfc3229 discusses in section 5.4: dealing with HTTP 1.0 caches. That seems to say that they will allow unknown headers to flow through on a request, so Joe's If-None-Match: gets to the server, but they ignore unknown headers when looking for cached results, so when Jane requests with her two-behind ETag, or Bob requests for the first time ever with no ETag, the HTTP 1.0 cache in the middle will just happily return the one-item version. If that's true, that they may transmit an If-None-Match and then cache the result without understanding it, then whenever the server sends a delta response, it has to make it uncacheable, which really, really complicates the benefit analysis.

Posted by Phil Ringnalda at

Danny wrote "how does the client get the Atom head info without sliding the window right back to day one?"

I've been struggling with this, and a number of other issues, while trying to write up a formal proposal... Basically, where I am at the moment is defining an "abstract" instance-manipulation method which is made "concrete" when combined with a specific content-type. Thus, Atom would have different concrete rules than RSS or W3C Extended Log File Format. In RSS, an item has no meaning if not accompanied by a channel, thus, applying "feed" IM to RSS would always require that Channel be provided. On the other hand, the current draft of Atom allows "entry" as a root element. Thus, we could define the "feed IM" for Atom as allowing a simple unwrapped sequence of entries. But, we could also define the concrete rules for "Atom feed IM" as requiring the atom:feed element and even the atom:head element. The concrete definition would be up to us.

Personally, I think we should, in fact, require that atom:feed and atom:head be present in Atom feeds that have had the "feed IM" applied to them. The cost of requiring this is, in most cases, minimal. Providing atom:feed and atom:head will reduce client complexity somewhat at the cost of some bloat.

Similar issues arise with other kinds of "feed." For instance, let's say you've got a file in W3C Extended Log File Format ([link] ). What you should probably do is insert at least the "#Fields" header at the start of each response to which the "feed IM" is applied.

bob wyman

Posted by Bob Wyman at

WWW cubed: syndication and scale

The rise of RSS reminds us once again that the web doesn't scale, but it's not time to throw the towel in yet....

Excerpt from Bill de hÓra at

Ben: what line in what spec can you point to that provides any semantic meaning to the absence of a specific entry in a feed?  I'm serious: while at first the answer may seem self-evident, it turns out that it is anything but.

Has your Bee Hive entry been pulled from cozy.org?  It is not in your feed.  Hmmm, ... it still seems to be there.  It even seems that Bloglines still has it.  Perhaps it wasn't pulled after all...

Posted by Sam Ruby at

What about using a 207 multistatus HTTP response code for feeds with updates?

Then for each item in a feed that hasn't changed you return a 304, but a 200 and data for the entry that has been modified.

I suspect client support might not be very good for 207's, though!

Posted by Nick Lothian at

I just blogged about this:

[link]

This is very close to the proposal I just sent to atom-syntax:

[link]

[link]

In short we should just use a date range header.  This is trivial to implement, fast, and supports additional functionality such as archive query.  For example this would allow your aggregator to fetch the last 30 days of posts.

This ETag mechanism is just for future optimization not past optimization.  IE theres no way to go in the past and fetch older articles.

Posted by Kevin Burton at

Ok.  I get it.  Thanks to Bob and Sam.

It makes me wince (sometimes that's the prolog to a smile) to think that different users might record sharply different resources the same eTag depending on past history of their interaction with the source server.

If we have Sam's site, Charlie the caching proxy, with Ed and Egor who use Charlie; then Sam's clever server is going to cause Charlie to miss postings in his responses to  Ed and Egor.

Posted by Ben Hyde at

Hmm... it seems to me that SSFF would have been another approach.  With the feed being reduced to a series of pointers (along with etags, as I had once suggested), you would still be able to use an etag for the feed itself and have a relatively small download (no worse than a single entry in the approach above).  A look at new links or new etags for existing links would tell you which entries to GET.

Of course, I realize this is all old news.  I guess I just feel like it still needs to be pointed out.  When I hear a solution that begins with "[it] is complicated to explain, and would be complicated to implement", I have to wonder if it's really the right solution or if it points to a more fundamental question about the design of the feed format...

Posted by Seairth at

Foo Camp 2004

I just got back from Foo Camp up in Sebastopol. I had a blast, and want to thank Tim O'Reilly and Marc Hedlund for the invite. The intellectual firepower there was amazing, and everyone was really friendly and open. Here are some pictures of the...

Excerpt from wingedpig.com at

Internet choking on RSS - film at 10

The past couple of days have seen quite a bit of discussion in the RSS world concerning the growing amount of RSS data being transferred over the Internet. Robert Scoble started off the heated discussion by posting about how MSDN...... [more]

Trackback from The Silent Penguin

at

Using RFC3229 with Feeds.

The other day I wrote that we really should be adopting RFC3229 "Delta Encoding in HTTP" in order to reduce the amount of bandwidth, etc. that is wasted in serving RSS and Atom files. I'm fairly convinced that if the... [more]

Trackback from As I May Think...

at

Another thing to keep in mind is that Internet Explorer is deeply broken concerning the handling of responses having the "Vary" header. So if an Atom feed gets served through IE with the plan to forward it to an external application (triggered by MIME type), this will fail.

See: [link]

Posted by Julian Reschke at

Passing Things Around

Yesterday I posted what seems to me a reasonably immediate solution to the kind of problem MSDN have had with aggregated feed of the whole Cube. It works with the current spec, what could be better? What very likely could be a lot better is a...

Excerpt from Planet RDF at

Another way to save RSS bandwidth - use Vary and ETags to tradeoff server CPU time for bandwidth

Interesting. I don't know if this is the be all and end all of solutions but it's definitely a start! My gut says that there is a simpler solution. From Sam Ruby: Vary: ETag: QUOTE Another bandwidth reduction idea,...... [more]

Trackback from Roland Tanglao's Weblog

at

Can the hashing be disposed of, if we're following the sliding window/sequence model the wouldn't using the id (URI) of the last entry passed as the ETag be enough..?

Posted by Danny at

Danny: I'm concerned about what happens if somebody updates an older entry.

Posted by Sam Ruby at

Sam Ruby: The core idea is that, sites that are willing to trade a little CPU for a bandwidth savings, subsetting the feed that is returned on a GET based on the ETag that was provided on the request may make sense.  Randy: Some more...

Excerpt from RSS at

How about using WebDAV dead properties to publish the metadata?  It's backward compatible too.  Sounds like ETag will kill one bird with one stone.  Is this bird flying alone or in a fat flock?

Posted by Eric Hanson at

Couln't the ETag be a timestamp, and then the blogging software would just return the posts that have been posted after that If-None-Match timestamp (the time it idicates)?

Posted by Ilkka at

Ilkka: if entries are never modified, yes.

Posted by Sam Ruby at

If a post has a "lastmodified" field, one could compare that to the If-None-Match stamp.

I think I will implement this to ND.

Posted by Ilkka Huotari at

I think Kevin has it. Why not define a new range-unit (per rfc2616) like "created-or-modified-after" which takes a timestamp? Then send a Range request like Range:created-or-modified-after=1095192707213.

I don't follow Atom but does it put Last modified times in each post? then you could use xpath as a range unit.

Posted by matt at

Why is it that proposed solutions seem to focus on complicating the HTTP layer instead of making the feed more Web-like and splitting it up?

Why not give each item a URL, interlink next/previous items and provide a constant URL that redirects to the latest item? This could even be expressed in chromeless (X)HTML using the link element without inventing new syntax.

Posted by Henri Sivonen at

Why to define a new range unit? The range (timestamp or hash(entryURI,time) identifing the entries in the feed stack) could be passed opaquelly to the client as the ETag.

Posted by Laurian Gridinoc at

ETag as timestamp was implemented here...
[link]

I'm not sure why the trackback doesn't appear in this thread.

Posted by Ilkka Huotari at

James Robinson reports on his blog that he has implemented a change to WordPress to support RFC3229 delta encoding. See:
[link]

I've asked that he extend his support to include the "feed" IM method that I describe at:
[link]
If he, or someone else, does that, then it would be a short step from there to Sam's suggestion of having delta encoding turned on by default -- and we can do it with minimal modification to existing standards!

bob wyman

Posted by Bob Wyman at

Syndication with RFC3229

James E. Robinson, III:  In addition to the lack of Apache support that James mentions, I see three problems:  If the bloginfo changes, it will never be resent. The client is responsible for remembering the original encoding, which again, hopefully n... [more]

Trackback from Sam Ruby

at

re. using the URI of the last entry published as the ETag, I don't see the problem with revisions. However the deltas are done the server has to remember the sequence /somehow/ (to be able to patch up the partials), so looking back to the_last_time a particular URI had its representation published should do the trick, no?

Posted by Danny at

Scratch that. But I'm still not convinced it's necessary to record/hash a seq of entries, it does seem unnecessarily complicated.

Posted by Danny at

James Robinson has updated his WordPress RFC3229 support to include support for the "feed" instance-manipulation method that I proposed on my blog. His comments and source code can be found at: [link]

So, Wordpress is the first blogging software to have support for RFC3229. Who can we convince to be second?

bob wyman

Posted by Bob Wyman at

Bob, my read is that he didn't address any of the issues I raised.  I'll comment there.

Posted by Sam Ruby at

Sam, your three issues are addressed by incorporating the "feed" IM method -- although they remain for "diffe" IM. Your issues were:
1. "* If the bloginfo changes, it will never be resent."
2. "* The client is responsible for remembering the original encoding, which again, hopefully never changes."
3. "* Few platforms (actually, none that I know of) have support for parsing XML fragments (sets of elements)."

None of these are issues with the "feed" instance-manipulation method since the result of a GET is always a complete RSS or Atom document which includes bloginfo (rss:channel or atom:head) and encoding instructions, and is not a fragment.

RFC3229 was carefully designed to ensure that bad things didn't happen with caches (although less will be cached.) What other issues remain -- if we talk only of "feed" IM, not "diffe". I accept that there are many issues with the "diffe" IM method. However, I think "feed" is more appropriate for blogging use even though it is less efficient then diffe or other byte-oriented methods. It is still massively more efficient then what we do today.

bob wyman

Posted by Bob Wyman at

Why not using Range/If-Range and specify a range unit? (see rfc2616, section 3.12)

Posted by Yves at

Caching

eTags for XMPP.When a Jabber client logs into its server, it gets a lot of information — mainly the user’s “roster” (see RFC 3921) and service discovery information about the server and its associated services (see JEP-0030). Oftentimes, that...

Excerpt from one small voice at

a heavy atom?

Recently, Sam Ruby posted Vary: ETag which discusses a method for solving a fundamental problem with Atom. The problem is that, because the feed includes complete entries, updating or adding one entry causes all others (which are unchanged) to be...

Excerpt from Seairth Jacobs' Jotspace at

So, What's New?

Ok, so the draft-sayre-atompub-protocol-basic-02 is pretty close to matching the current WG draft in capability. The question is whether that’s......

Excerpt from franklinmint.fm at

Sam Ruby: Vary: ETag

Sauvons de la bande passante :) !...

Excerpt from Public marks from user benoit with tag http at

Sam Ruby: Vary: ETag

This idea doesn’t require coordination between vendors, and leverages the ETag support that is present in many existing aggregator clients....

Excerpt from del.icio.us/alan.dean/etag at

Add your comment