Yesterday, I more fully integrated Joe’s threading work into Venus. From an end user’s perspective, one benefit of this is that the first time you specify spider_threads, you will see immediate benefit as the Last-Modified and Etag header values that had been previously captured and stored in the Venus cache will be used.
In debugging this, I took at look at ETag and Last-Modified usage, and found a few surprises. Sure, I found a few sites that provided neither, yet would return the same data, byte for byte, again and again. Most of these sites appeared to compute their feeds dynamically on each request; this includes sites such as IBM Developer Works: example.
As I said, this wasn’t a surprise.
Some sites, most typically ones powered by WordPress, would provide both ETag and LastModified headers, but would always provide the full content if an If-None-Match header was provided, but would respect If-Modified-Since, but only if If-None-Match was not provided. These sites are typically ones powered by WordPress: example. Anne’s feed also falls into this category, but I can’t determine how it was produced.
While that was surprising, even more puzzling is the fact that there are some feeds out there that intermittently support Etags. And by intermittently, I do mean occasionally, like anywhere from one time in two to about one time in four or less. All such feeds that I could find come from blogs.msdn.com: example.
You can verify this yourself with the following Python 2.4 script. Simply pass one or more URIs as command line parameters:
import urllib2, sys
for uri in sys.argv[1:]:
if uri.startswith('-'): continue
headers = urllib2.urlopen(uri).headers
request = urllib2.Request(uri)
if headers.has_key('etag') and '-e' not in sys.argv:
if headers.has_key('last-modified') and '-m' not in sys.argv:
print uri, urllib2.urlopen(request).code
Additionally, you can specify either or both of -e and -m to cause the associated header to be omitted.
What you want to see is HTTP Error 304: Not Modified. If, instead, you simply see 200, then the full content was sent both times.
Recommendation to feed producers: don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.
And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time. I’ve now implemented this for Venus.
It’s a bit strong to tell people to only use ETags and Last-Modified when they support validation; they can be used for other things as well.
LM is useful for calculating heuristic freshness in caches, if the server hasn’t bothered to make it explicit.
ETags can be used for optimistic concurrency on writes; see [link]
Besides, validation is an optimisation; it doesn’t cost the cache any more than an extra header to try it out, and the Web still works correctly without it. Of course, it’s very helpful when it is supported.
Not sure what’s happening with Anne’s feed, but FWIW the Last-Modified date I see is malformed (it’s missing GMT). That will mess up some clients...
I believe the intermittent etag support will be from the ASP.NET caching infrastructure, where the etag support only works while it has a cached copy of the response in memory, as soon as that cache is flushed (typically on time) it’ll regenerate the feed from the original ASP.NET code, and recache, so you’ll see new etags etc. even thought the content hasn’t changed. Its kinda dumb, which is why my .NET blogging engine generates a real file for the feed and relies on IIS’s etag support.
Bloglines does both ETags, Last Modified, and multiple levels of hashing.
1) hash whole feed/http body if different than last, continue.
2) Parse feed into objects, hash contents of objects, if different from last, continue.
3) Some detection of bad Content Producers who modify every item every time you fetch it (such as including a timestamp in an <!-- escaped area)
Ok, the wordpress code is seriously lame. The ETag itself is just a hash of the last-modified and it’s getting corrupted by their retarded string escaping. I just removed the header("ETag...") from line 1637 of classes.php (as of wordpress 2.0.5) and I’ll just let last-modified do its thing.
You said: “Recommendation to feed producers: don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.”
Some people will read that as “don’t send ETag and Last-Modified unless you support validation on them.” Note that I was talking about server, not client, behaviour; I was just pointing out that these headers can be used for other things too.
BTW, a lot of the time you’ll see validation not seeming to work (especially with ETags) because the server is actually a farm, and they’re not syncing their metadata. In the case of Last-Modified, this can happen when there are clock sync problems; for ETags, it’s often because Apache uses the inode to calculate the ETag’s value, by default, and it’s different across the farm. See: [link]
Recommendation to feed producers: don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing....
Just curious (with a cowardly cop-out that it’s late and I’m well into a few bottles of Timothy Taylor Landlord) but what’s so bad about using the value of the Last-Modified header to generate an ETag value? That seems reasonably sensible in my current state of thought...
Much as I enjoyed Why PUT and DELETE, I have to question Eliotte’s advice. When crafting a Web API, it’s worth knowing when to use GET over POST, and understanding the value of eTag is going to reap rewards, but why would a publisher...
Sometime last week I read this piece by Sam Ruby , which summarized says this: …don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some...