Another bandwidth reduction idea, compliments of FooCamp04.
This idea doesn't require coordination between vendors, and
leverages the
ETag support that is present in many existing aggregator
clients. It is complicated to explain, and would be
complicated to implement, but the end result would take the form of
an Apache Module (and/or IIS filter) that sysadmins of major sites
could just "drop in".
Design
The core idea is that, sites that are willing to trade a little
CPU for a bandwidth savings, subsetting the feed that is returned
on a GET based on the ETag that was provided on the request may
make sense. Such trade-offs have
precedent.
There is a concern that varying the response based on the request
would not play nicely with HTTP caches, but that's where the
Vary HTTP header comes in.
Before going any further, it's worth backing up. An Atom
feed consists of some header information and a set of entries
(similarly, an RSS feed consists of some channel information and a
set of items). There is no predefined specified contract as
to how many entries a server must, or even should return on every
request. Some servers return the last "n" entries, some
return the last "n" days worth of information. Entries may
"drop off" the end of the feed without warning. Clients
already need to deal with this today.
ETags are metadata returned by a web server. If clients
retain this information and provide it on subsequent GET requests,
servers can optimize their response. A concrete example:
Mena Trott's
Atom
feed contains 15 entries. Returned with the feed is an
ETag, which at the moment is the value
"819473-7b36-c4b4a500". This is computed and handled entirely
by the Apache web server.
So, at the moment, GET request that provide this information in
an
If-None-Match header (yea, that's intuitive), obtain a curt
response of
304 Not Modified. This saves a lot of bandwidth.
Particularly as Mena's last post to this particular weblog was on
August 17th.
Now consider what would happen if Mena were to do a new
post. Requests with the previous ETag would result in the
full feed. All 15 entries, with full content. Quite
likely, this is an order of magnitude more information that one
would need, as the only change is one entry. If you treat the
ETag as sort of bookmark of where you last left off, and warn
caches that you are doing this with a Vary: ETag header, then you
could safely return a carefully truncated feed - with all the same
feed header information, but with only the entries that you haven't
seen.
This is more complicated than it seems, it requires an
understanding not only of HTTP and the feed format, but also with
the usage pattern of the given tool. I'll use Mena's feed as
an example. Entries are in reverse chronological order,
meaning that new entries are added at the top, and dropped off the
bottom. So if Apache module can understand the feed format
just enough that it can identify where each entry starts and stops,
then it can compute a hash of the full feed, as well as a hash of
the full feed minus the last entry, and a hash of the full
feed minus the last two entries, etc. This doesn't have to be
a cryptographically secure hash, perhaps as simple as a 32 bit CRC
could do. And not all permutations of entries present or
absent need to be computed, just enough to make a difference.
Clients that receive such ETags will continue to return them,
just as they do today. No change is required to the client to
make this work. On the server, if the ETag exactly matches, a
304 response will be returned, just as is done today. Where
things get interesting is if one of the hashes match a hash
computed for the current feed with the first entry omitted
(or first two, or...). If so, you know which entries this
client has seen before, and therefore don't need to see again.
For this to work, the ETag returned on such a streamlined
response needs exactly match the ETag of the full feed
(including the omitted entries). The status code needs
to be
200 OK, not
206 Partial Content. And, of course, the Vary: ETag
header needs to be added.
Part of the design is that in all edge cases, the full feed
needs to be returned. The user somehow managed to update an
entry in the middle? Return it all. The feed can't be
parsed? Return it all. No ETag is provided on the
request, return it all.
When in doubt, return it all. The purpose of this
optimization is to catch enough cases to be cost effective, not to
go to heroic efforts to squeeze every last byte out of the
response.
Summary
Whew. I said this was complicated. But the upside is
that there is no change required to the publishing process.
No extra files need to exist on the server. No database needs
to be accessed. The only change is that a single Apache
module needs to be installed. Casual users with low bandwidth
usages today don't need to bother. This is only for the users
with a problem.
However, a single installation on a typepad.com or blogspot.com
server could result in a significant bandwidth saving
overall. This not only would benefit servers, but also
clients (particularly ones across slow modems), and crawlers (like
Feedster or
Technorati).
Existing clients that don't provide an ETag on requests won't
see any difference. Existing clients that provide an ETag and
retain a memory of what has been seen before will end up with an
end result that is indistinguishable from what they see
today. It would only be a client which retains a memory of
the ETag but not of the entries that would see any
difference. Such a client would be relying on a specific
behavior of the server that wasn't guaranteed to be so by any
specification. And the solution is obvious: if you don't
retain the entries, then don't provide an ETag. (Perhaps
retain the
Last-Modified value instead).
Update: The correct value of the header would be
Vary: If-None-Match. This was pointed out to me
by Greg Stein offline, and by Carey Evans below. Thanks!
While the big sites would certainly want something as fast as an Apache module, it doesn't have to be one for everyone, does it? Seems to me (without having been in the code for several months) that something that fries up feeds on demand like WordPress could just, as a part of saving a new entry, compute and store hashes for the last n feed windows and the current state, and then look at the ETag in the request to see how many entries it needs to load up.
Sam, what you're suggesting is, I think, essentially the same as the "feed-specific" or "entry-oriented" instance manipulation method that I've been proposing that we add to RFC3229. The significant, and nice, addition that you're making is the suggestion that sites could turn on this instance-manipulation method by default rather than requiring that clients request it in HTTP Accept headers. By not requiring clients to be modified, the method will become useful immediately.
While this will serve Atom immediately, I think it makes sense to go the extra step of actually defining the RFC3229 IM method. If this is done, then we'll find that this method can be used to support a wide variety of non-Atom formats such as event logs, and other sequentially updated files. Atom/RSS may be the first application of the IM method, but it will be quite useful in other situations as well. (i.e. using this method on log files would have the same effect as "tail")...
I must be missing something, because I don't see how this works with an Apache module that doesn't know anything beyond the current state of a static file.
You have a feed with entries for 20040910, 20040909, and 20040908. I get it, along with an ETag from hashing that file. You add 20040911, so now your feed consists of 20040911, 20040910, and 20040909. I return an ETag with the 10, 09, 08 hash, but how does the module, which no long gets to look at the entry for 20040908, tell that it's the hash for that? It can compute 11, 10, 9 or 11, 10 or 10, 9, but none of those matches my 10, 9, 8 hash to tell it that I've seen everything but 11.
Phil: yes, frying up feeds on demand makes things considerably simpler. In fact, it might be worth having separate "ETaglets" on a per entry basis which are simply concatenated (along with perhaps one for the head). With some smarts, you can base64 fifteen 32 bit hashed into 80 bytes. Having each hash available makes the job considerably simpler: filter out the entries that have already been seen before. If none match, return a 304 response.
Corey: yes, thanks! corrected.
Bob: yes, RFC3229 and this proposal can go on in parallel. While this proposal has the advantage of working with existing clients, RFC3229 has the potential of being more HTTP cache friendly.
If the implementation costs are small enough, a good way to get started would be to prototype this with dynamic feeds on a product such as with WP. This could then be deployed on an A-list or two bloggers' sites. Once the concept is proven, the considerably harder task of building an Apache Module could proceed.
Before I actually read the caching bits of rfc2616, I didn't quite realize that you are likely to get a whole batch of ETags from a cache in If-None-Match, representing all the representations it has cached (thus the name), to which you are supposed to reply with a 304 and the ETag for the one correct one. So if I read it right, client sends the ETag which will tell you that it only needs the latest two entries, but because you need to reply with the ETag for the full current feed, you wind up telling a cache that has that stored to deliver the whole thing to the client. Still benefits for the publisher, but not for the client.
Phil: to explain the original proposal, let's take a really simple example: "entries" that consist of entirely one character, with an identity hash function. So, lets define an initial feed of:
876543210
The "hash" for that feed would be:
876543210,87654321,8765432,876543
Now assume a new entry is added, thus:
987654321
What you would first do is see if 987654321 is in the input list. Nope. Next you would check 87654321, and sure enough, you find a match. From that, you can deduce the the only entry that hasn't been seen yet is 9.
That being said, this involves at least two passes through the entries (all the hashes can be computed in parallel in one pass). It might just be simpler to do ETaglets, as described above.
Ah, now I get it. I see "hash" and I think "big long monolithic string of crud that makes my eyes glaze over." So for my own private purposes, I might use entry_id;mod_date:entry_id;mod_date:..., and if I don't match the whole thing, pull off the first one as something I'll send, see if I match what's left, and keep going until I get back before the last thing I changed, or get to the end and send everything.
Hmm. And that gives me a handle where I decide whether or not to push out an edit: rather than depend on the client telling the difference between typo correction and substantive change, I can decide, and if I just swap ei/ie, I don't change the mod_date, and I can still deliver the corrected version to people who haven't see it, without having to make those who have re-read. That might make frying worth the trouble.
Ironically your rss2 feed includes a <body> tag in the <item> with full text that most readers wont be able to show, and isn't part of rss 2.0 AFAIK.
And the description duplicates a lot of that data again.
It's ingenious, but seems the same in principle to Bob's proposed RFC 3229 (customised to provide per-entry diffs) approach. Either approach could be implemented with no visible effect to non-supporting clients, both require additional handling to be supported - neither offers much as a short-term fix. So I'd tend to favour the RFC 3229-based approach as a more widely-applicable solution long-term.
Ultimately aggregators are heaven-sent for hosting companies, and ISPs who stand to gain in with all the bandwidth consumption.
What's proposed is more like an NNTP-style feed. There's no reason why this couldn't be emulated in HTTP. The E-tag is one way, but the other is to permit additional parameters like [link]. Sites that serve feed statically will fail gracefully, while those which support the additional semantics can save the bandwidth. Servers could be configured to give priority to feeds which use this semantic and provide lower service to aggregators which do not (eg. refuse to provide updates every other hour).
A possible (market-driven) solutions is to let advertisers host the RSS-feeds (Ads can be inserted into the feeds), and proper mirroring to minimize traffic.
This is risky. If-none-match has a defined purpose already, which is to find out if there is a new version of the page. Your outline only considers the case of a "normal" aggregator which polls the server for changes every now and then. However, there are two other kinds of client which this will break.
The first is newsreaders which don't retain a local copy of each entry but instead retain the entire feed as one lump. This is common for things like news boxes on websites which show the last n entries from a particular feed, like some of Slashdot's nutty little boxes; the software driving it will often just retrieve the feed (with appropriate If-None-Match or If-Modified-Since headers), take the first n entries, format them into some kind of sensible HTML and save the result to disk as a static file. These programs are trying to be polite by using the normal mechanisms to only retrieve new content, but your mechanism punishes them by only returning new stuff and causing them to lose any items from the previous run.
The other kind of client this affects is the good old web browser. If I load your feed in my browser -- which would be a weird thing to do, but I do have my reasons to do this from time to time -- then my browser will likely cache it. If I then load it again, I'll get back a page missing all of the entries that were there when I last loaded it. This is going to confuse and frustrate anyone who was simply trying to get at some of the data thinking it was a static document.
See the Atom Wiki page PossibleHTTPExtensionForEfficientFeedTransfer (which didn't have such a long name when I created it; someone refactored it) where I wrote all this stuff out in a much better way ages ago. You can also see the discussion which resulted from the proposal, some of which would also apply to this ETag proposal.
Will this actually help much? My feeling (with absolutely no evidence to back it up) is that most bandwidth use if from clients that don't implement conditional-get or handle GZipped feeds.
This proposal will only help decrease the bandwidth of already well-behaved clients, and then only when an extra post is generated.
Say the average feed is 20 kb, the average client polls hourly and the average blogger posts once per day, with a post that is 2 kB. Assume that conditional get uses 200 bytes of server bandwidth for a 304 response.
On the average day a good client will use (23 polls * 0.2 kb) + (1 get * 20 kb) = 4.6 + 20 = 24.6 kb.
Under this proposal it would be something like (23 polls * 0.2 kb) + (1 * 2 kb) = 4.6 + 2 = 6.6 kb.
That's a saving of 18 kb, which sounds excellent, until you realize it doesn't take into account Gzip compression.
A quick experiment shows a 20 kb feed will compress to about 5 kb, and a 2 kb item will compress to about 1 kb.
Now the math is (23 polls * 0.2 kB) + (1 get * 5 kB) = 4.6 + 5 = 9.6 kb vs (23 polls * 0.2 kB) + (1 get * 1 kB) = 4.6 + 1 = 5.6 kB.
That's a saving of only 4 kb per day.
Obviously the more you post the more you save - but conversly the more often clients poll the worse off you are. For some big aggregated feeds it might make a lot of sense, but for the average feed it isn't a big difference.
Having said that, if someone does a server to test against, I'll volunteer to make the Rome Fetcher ([link]) work with it.
I posted some quick analysis of how much bandwith Sam Ruby's intersting Vary:ETAG will help. I'm interested in any feedback - especially hard numbers!...
i think phil was reasonably confused, as most hashing functions don't point to the represented data in any recognizable state, whereas your "hash" includes distinguishable entry identifiers. hash functions are called such because they generally follow the common definition of the word: "A jumble; a hodgepodge". but you're not talking about a jumble. you're talking about an "identity hash function." i've used many hash functions, and none of them identified what the hash represented. maybe such a hash exists, but it's at best a rarity. of course, this has nothing to do with the actual proposal - just the language used to describe it.
nick says "For some big aggregated feeds it might make a lot of sense, but for the average feed it isn't a big difference." but that's going to be true of anything that saves bandwidth. those who have plenty of bandwidth to spare won't have much use for it.
Bart: I believe that most aggregators support xhtml:body. In description, I put a summary, in xhtml:body I put the full text. For those aggregators that don't support xhtml:body, I have more feeds, including RSS 0.91 and RSS 1.0 feeds.
Danny: The key benefit to this proposal is that clients don't have to change. Proposals that require both the client and the server to change are often hard to get adopted as there is no immediate befit for the early adopters.
Martin: Such a "nutty little box" implementation could be impacted - but as I said they are relying on a behavior that isn't guaranteed, and if they chose to send a If-Modified-Since header instead of an If-None-Match, they won't be affected. Your point about the browser is a good one, and one that I hadn't considered.
Nick and Scott: I agree that for an average feed, this won't make a difference.
Scott: I meant Identity function not Identifier. Nobody would use such a function for a hash, except for explanation purposes, which is all I was using it for.
I understand: The client does a conditional-get (good client!) and includes the eTag12, the tag of the version it last fetched. The server seeing that it's current version is eTagXY then thinks. "Cool I remeber eTag12, I know how to delta between it and eTagXY."
I notice: what appears to be a subplot in the story told above about how the server might be able to do this without caching a copy of version denoted by eTag12 because the content in question is very sylized. Yeah, eTag could even have secret messages that help the server figure that out. Ok, that's a clever a subplot. Stateless servers are fun.
I get lost: What does the server return? What HTTP response codes and headers.
Does this work for subsequent requests? In your example above, it seems like the extra '9' entry would have to be served with ETag: 9 (if it’s served with ETag: 987654321 then it’s indistinguishable from the full feed in caches). The subsequent request with If-None-Match: 9 doesn’t match any historical feed, and so gets the full feed again, so you only bumped serving the full feed along to the next poll.
Holding a full set of hashes (so, when 'A' is added, If-None-Match: 9 just gets A, as would If-None-Match: 987 etc.) seems like the only solution, short of requiring clients to reconstitute the full feed.
Joseph, it would be served with ETag: 987654321. That, coupled with a Vary: If-None-Match means that if in the future there is a new entry (let's call it A), that should be cached along with the ETAG.
Let's look at a scenario: Joe and Jane use the same ISP which has a smart HTTP cache.
Initially, the feed looks like 876543210
Joe fetches it, and gets a full feed with an ETAG of 876543210
Feed gets updated, now looks like 987654321
Jane fetches it, and gets a full feed with an ETAG of 987654321
Joe fetchs it, the cache misses, he gets a small feed with just a 9 and an ETAG of 987654321
Feed gets updated, now looks like A98765432
Jane fetches it, the cache misses, and she gets a small feed with just an A and an ETAG of A98765432
Joe fetches it, and the cache returns the same response and Jane just got.
Note that the ETag represents the server's full state, not the partial response. In the last step, Joe and Jane receive the same data, even though their prior GET returned different data. The down side is that Joe's second request resulted in a cache miss.
If 13.6 describes how caches actually work, it sounds to me like Joe's miss goes a little differently. He sends a GET with If-None-Match: "876543210", the cache has expired that representation, so it passes it along as If-None-Match: "876543210", "987654321" because it has an unexpired copy of that from Jane's full fetch, to which the server responds with
I'm afraid I'm more confused than ever. Isn't the upside what happens in Joe's second call - i.e. a single entry has been added since his last call, he has the previous ETAG version, he should only need the new entry..?
What's more I don't understand your answer to my previous point - "The key benefit to this proposal is that clients don't have to change." - but surely to get any benefit both client and server would have to change..? (Just as they would with RFC3229).
Phil, that's where the Vary: header comes in. Jane's full fetch returns the feed with a Vary: If-None-Match. Because Joe's request will have a different If-None-Match value than Jane's (absent), the cache will miss. Put another way, a request must have an identical If-None-Match header to a previous request in order to result in a cache hit.
But just saying "the cache misses" is an oversimplification. It doesn't sound to me like it just says "oops, I don't have that, I'll pass on the request completely untouched." Instead, it says "I don't have anything I can return without asking the server, so I'll ask the server if I have anything that suits, by sending every ETag I have stored." So at the server, you see Joe's ETag, plus Jane's ETag which is your current one, and you can't tell whether the request is from Joe wanting a new item, or Jane sitting on her Refresh Feed button, so you say 304 with Jane's ETag either way.
Now, stir it up a bit more: you've updated twice within your Expires: time. Joe is one item behind, you return one item with the ETag that means "current", Jane, who is two items behind, tries to update, and the cache sends the ETags that mean two items behind (from her last refresh), one item behind (Joe's refresh before his last), and current. You respond with a 304 and the ETag that tells the cache to give Jane just one item, don't you?
Now I'm more confused. The idea is that the ellipsis would contain only the new elements? Doesn't that tell the client that the old articles have been pulled?
Danny (re: your comment at 13:20) What Sam is proposing is that an entry-oriented or "feed-aware" instance-manipulation method be implemented and turned on by default when serving Atom files. The reason the client doesn't have to change is that the RFC3229 IM method is being used whether or not the client asks for it. The client doesn't need to know.
This use of IM methods by default is something that is only possible with the class of resources that contains Atom and RSS feeds. The important property of the class is that you can "truncate" a member of the class and the truncated version is still syntactically valid and useful. In fact, given that virtually all Atom and RSS feeds are simply sliding-windows on a conceptually longer feed, feed truncation is a normal part of life in this world anyway. The difference here is that the window-size is being adjusted on a per-connection basis.
RFC3229 only provides for instance-manipulation on demand of the client. Sam is simply pointing out that for this class of resources, instance-manipulation can be safely done by default.
Ah, thanks Bob - got it: RFC3229, 7.1 Multiple entity tags in the If-None-Match header.
Need to read (and think!) a little more, but the sliding-window reference does suggest a potential snag - how does the client get the Atom head info without sliding the window right back to day one? I guess this might be another situation where the 'introspection' kind of ideas might be useful - stick the head info at a different URI.
Phil, I think I see what you're saying. 13.6 says a cache should add all the ETags it has to If-None-Match, but doesn't give a course of action if If-None-Match is listed in Vary:. Doing so seems to highlight a contradiction in the spec, so your scenario is possible.
Not so much a contradiction in the spec as a trap for unwary people trying to sneak deltas through: ordinarily, if you are using Vary: If-None-Match, you are quite happy to find that your current ETag is cached, because, well, it's all of your current instance.
But I think the bigger problem is what rfc3229 discusses in section 5.4: dealing with HTTP 1.0 caches. That seems to say that they will allow unknown headers to flow through on a request, so Joe's If-None-Match: gets to the server, but they ignore unknown headers when looking for cached results, so when Jane requests with her two-behind ETag, or Bob requests for the first time ever with no ETag, the HTTP 1.0 cache in the middle will just happily return the one-item version. If that's true, that they may transmit an If-None-Match and then cache the result without understanding it, then whenever the server sends a delta response, it has to make it uncacheable, which really, really complicates the benefit analysis.
Danny wrote "how does the client get the Atom head info without sliding the window right back to day one?"
I've been struggling with this, and a number of other issues, while trying to write up a formal proposal... Basically, where I am at the moment is defining an "abstract" instance-manipulation method which is made "concrete" when combined with a specific content-type. Thus, Atom would have different concrete rules than RSS or W3C Extended Log File Format. In RSS, an item has no meaning if not accompanied by a channel, thus, applying "feed" IM to RSS would always require that Channel be provided. On the other hand, the current draft of Atom allows "entry" as a root element. Thus, we could define the "feed IM" for Atom as allowing a simple unwrapped sequence of entries. But, we could also define the concrete rules for "Atom feed IM" as requiring the atom:feed element and even the atom:head element. The concrete definition would be up to us.
Personally, I think we should, in fact, require that atom:feed and atom:head be present in Atom feeds that have had the "feed IM" applied to them. The cost of requiring this is, in most cases, minimal. Providing atom:feed and atom:head will reduce client complexity somewhat at the cost of some bloat.
Similar issues arise with other kinds of "feed." For instance, let's say you've got a file in W3C Extended Log File Format ([link] ). What you should probably do is insert at least the "#Fields" header at the start of each response to which the "feed IM" is applied.
Ben: what line in what spec can you point to that provides any semantic meaning to the absence of a specific entry in a feed? I'm serious: while at first the answer may seem self-evident, it turns out that it is anything but.
Has your Bee Hive entry been pulled from cozy.org? It is not in your feed. Hmmm, ... it still seems to be there. It even seems that Bloglines still has it. Perhaps it wasn't pulled after all...
In short we should just use a date range header. This is trivial to implement, fast, and supports additional functionality such as archive query. For example this would allow your aggregator to fetch the last 30 days of posts.
This ETag mechanism is just for future optimization not past optimization. IE theres no way to go in the past and fetch older articles.
It makes me wince (sometimes that's the prolog to a smile) to think that different users might record sharply different resources the same eTag depending on past history of their interaction with the source server.
If we have Sam's site, Charlie the caching proxy, with Ed and Egor who use Charlie; then Sam's clever server is going to cause Charlie to miss postings in his responses to Ed and Egor.
Hmm... it seems to me that SSFF would have been another approach. With the feed being reduced to a series of pointers (along with etags, as I had once suggested), you would still be able to use an etag for the feed itself and have a relatively small download (no worse than a single entry in the approach above). A look at new links or new etags for existing links would tell you which entries to GET.
Of course, I realize this is all old news. I guess I just feel like it still needs to be pointed out. When I hear a solution that begins with "[it] is complicated to explain, and would be complicated to implement", I have to wonder if it's really the right solution or if it points to a more fundamental question about the design of the feed format...
I just got back from Foo Camp up in Sebastopol. I had a blast, and want to thank Tim O'Reilly and Marc Hedlund for the invite. The intellectual firepower there was amazing, and everyone was really friendly and open. Here are some pictures of the...
The past couple of days have seen quite a bit of discussion in the RSS world concerning the growing amount of RSS data being transferred over the Internet. Robert Scoble started off the heated discussion by posting about how MSDN......
[more]
The other day I wrote that we really should be adopting RFC3229 "Delta Encoding in HTTP" in order to reduce the amount of bandwidth, etc. that is wasted in serving RSS and Atom files. I'm fairly convinced that if the...
[more]
Another thing to keep in mind is that Internet Explorer is deeply broken concerning the handling of responses having the "Vary" header. So if an Atom feed gets served through IE with the plan to forward it to an external application (triggered by MIME type), this will fail.
Yesterday I posted what seems to me a reasonably immediate solution to the kind of problem MSDN have had with aggregated feed of the whole Cube. It works with the current spec, what could be better? What very likely could be a lot better is a...
Another way to save RSS bandwidth - use Vary and ETags to tradeoff server CPU time for bandwidth
Interesting. I don't know if this is the be all and end all of solutions but it's definitely a start! My gut says that there is a simpler solution. From Sam Ruby: Vary: ETag: QUOTE Another bandwidth reduction idea,......
[more]
Can the hashing be disposed of, if we're following the sliding window/sequence model the wouldn't using the id (URI) of the last entry passed as the ETag be enough..?
Sam Ruby: The core idea is that, sites that are willing to trade a little CPU for a bandwidth savings, subsetting the feed that is returned on a GET based on the ETag that was provided on the request may make sense. Randy: Some more...
How about using WebDAV dead properties to publish the metadata? It's backward compatible too. Sounds like ETag will kill one bird with one stone. Is this bird flying alone or in a fat flock?
Couln't the ETag be a timestamp, and then the blogging software would just return the posts that have been posted after that If-None-Match timestamp (the time it idicates)?
I think Kevin has it. Why not define a new range-unit (per rfc2616) like "created-or-modified-after" which takes a timestamp? Then send a Range request like Range:created-or-modified-after=1095192707213.
I don't follow Atom but does it put Last modified times in each post? then you could use xpath as a range unit.
Why is it that proposed solutions seem to focus on complicating the HTTP layer instead of making the feed more Web-like and splitting it up?
Why not give each item a URL, interlink next/previous items and provide a constant URL that redirects to the latest item? This could even be expressed in chromeless (X)HTML using the link element without inventing new syntax.
Why to define a new range unit? The range (timestamp or hash(entryURI,time) identifing the entries in the feed stack) could be passed opaquelly to the client as the ETag.
James Robinson reports on his blog that he has implemented a change to WordPress to support RFC3229 delta encoding. See: [link]
I've asked that he extend his support to include the "feed" IM method that I describe at: [link]
If he, or someone else, does that, then it would be a short step from there to Sam's suggestion of having delta encoding turned on by default -- and we can do it with minimal modification to existing standards!
James E. Robinson, III: In addition to the lack of Apache support that James mentions, I see three problems: If the bloginfo changes, it will never be resent. The client is responsible for remembering the original encoding, which again, hopefully n...
[more]
re. using the URI of the last entry published as the ETag, I don't see the problem with revisions. However the deltas are done the server has to remember the sequence /somehow/ (to be able to patch up the partials), so looking back to the_last_time a particular URI had its representation published should do the trick, no?
James Robinson has updated his WordPress RFC3229 support to include support for the "feed" instance-manipulation method that I proposed on my blog. His comments and source code can be found at: [link]
So, Wordpress is the first blogging software to have support for RFC3229. Who can we convince to be second?
Sam, your three issues are addressed by incorporating the "feed" IM method -- although they remain for "diffe" IM. Your issues were:
1. "* If the bloginfo changes, it will never be resent."
2. "* The client is responsible for remembering the original encoding, which again, hopefully never changes."
3. "* Few platforms (actually, none that I know of) have support for parsing XML fragments (sets of elements)."
None of these are issues with the "feed" instance-manipulation method since the result of a GET is always a complete RSS or Atom document which includes bloginfo (rss:channel or atom:head) and encoding instructions, and is not a fragment.
RFC3229 was carefully designed to ensure that bad things didn't happen with caches (although less will be cached.) What other issues remain -- if we talk only of "feed" IM, not "diffe". I accept that there are many issues with the "diffe" IM method. However, I think "feed" is more appropriate for blogging use even though it is less efficient then diffe or other byte-oriented methods. It is still massively more efficient then what we do today.
eTags for XMPP.When a Jabber client logs into its server, it gets a lot of information — mainly the user’s “roster” (see RFC 3921) and service discovery information about the server and its associated services (see JEP-0030). Oftentimes, that...
Recently, Sam Ruby posted Vary: ETag which discusses a method for solving a fundamental problem with Atom. The problem is that, because the feed includes complete entries, updating or adding one entry causes all others (which are unchanged) to be...