Recommendation to feed producers: don’t send Etag
and Last-Modified
headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.
And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time. I’ve now implemented this for Venus.
Yesterday, I more fully integrated Joe’s threading work into Venus. From an end user’s perspective, one benefit of this is that the first time you specify spider_threads, you will see immediate benefit as the Last-Modified
and Etag
header values that had been previously captured and stored in the Venus cache will be used.
With this change, the HttpLib2 cache becomes optional, but may soon provide additional benefits.
In debugging this, I took at look at ETag
and Last-Modified
usage, and found a few surprises. Sure, I found a few sites that provided neither, yet would return the same data, byte for byte, again and again. Most of these sites appeared to compute their feeds dynamically on each request; this includes sites such as IBM Developer Works: example.
As I said, this wasn’t a surprise.
Some sites, most typically ones powered by WordPress, would provide both ETag
and LastModified
headers, but would always provide the full content if an If-None-Match
header was provided, but would respect If-Modified-Since
, but only if If-None-Match
was not provided. These sites are typically ones powered by WordPress: example. Anne’s feed also falls into this category, but I can’t determine how it was produced.
While that was surprising, even more puzzling is the fact that there are some feeds out there that intermittently support Etags. And by intermittently, I do mean occasionally, like anywhere from one time in two to about one time in four or less. All such feeds that I could find come from blogs.msdn.com: example.
You can verify this yourself with the following Python 2.4 script. Simply pass one or more URIs as command line parameters:
import urllib2, sys for uri in sys.argv[1:]: if uri.startswith('-'): continue headers = urllib2.urlopen(uri).headers request = urllib2.Request(uri) if headers.has_key('etag') and '-e' not in sys.argv: request.add_header('If-None-Match',headers.get('etag')) if headers.has_key('last-modified') and '-m' not in sys.argv: request.add_header('If-Modified-Since',headers.get('last-modified')) try: print uri, urllib2.urlopen(request).code except urllib2.HTTPError,e: print e
Additionally, you can specify either or both of -e
and -m
to cause the associated header to be omitted.
What you want to see is HTTP Error 304: Not Modified
. If, instead, you simply see 200
, then the full content was sent both times.
Recommendation to feed producers: don’t send Etag
and Last-Modified
headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.
And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time. I’ve now implemented this for Venus.
Update: WordPress ETag bug [via David Terrell]
It’s a bit strong to tell people to only use ETags and Last-Modified when they support validation; they can be used for other things as well.
LM is useful for calculating heuristic freshness in caches, if the server hasn’t bothered to make it explicit.
ETags can be used for optimistic concurrency on writes; see [link]
Besides, validation is an optimisation; it doesn’t cost the cache any more than an extra header to try it out, and the Web still works correctly without it. Of course, it’s very helpful when it is supported.
Not sure what’s happening with Anne’s feed, but FWIW the Last-Modified date I see is malformed (it’s missing GMT). That will mess up some clients...
And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time.
Are you hashing the entire stream returned by the server or are you hashing the content of individual nodes in the XML returned?
It’s a bit strong to tell people to only use ETags and Last-Modified when they support validation
I’m confused. Where did I say that? If by supporting validation you mean supporting headers like If-None-Match
, then your reaction surprises me.
Let’s look at a scenario: I fetch a feed. In the headers, I get both an ETag
and a Last-Modified
header. As a respectful consumer, what headers should I send on my next request?
Are you hashing the entire stream returned by the server or are you hashing the content of individual nodes in the XML returned?
The HTTP Message body, i.e., the stuff that would be passed as input to the feed parser. Alternate suggestions welcome.
Yeah, that seems to be the case for Mike’s feed
GET /mikechampion/atom.xml HTTP/1.1
User-Agent: curl/7.15.3 (i586-pc-mingw32msvc) libcurl/7.15.3 OpenSSL/0.9.7d zlib/1.2.2
Host: blogs.msdn.com
HTTP/1.1 200 OK
...
X-Powered-By: ASP.NET
Bloglines does both ETags, Last Modified, and multiple levels of hashing.
For hashing:
1) hash whole feed/http body if different than last, continue.
2) Parse feed into objects, hash contents of objects, if different from last, continue.
3) Some detection of bad Content Producers who modify every item every time you fetch it (such as including a timestamp in an <!-- escaped area)
Hey Sam,
You said: “Recommendation to feed producers: don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.”
Some people will read that as “don’t send ETag and Last-Modified unless you support validation on them.” Note that I was talking about server, not client, behaviour; I was just pointing out that these headers can be used for other things too.
BTW, a lot of the time you’ll see validation not seeming to work (especially with ETags) because the server is actually a farm, and they’re not syncing their metadata. In the case of Last-Modified, this can happen when there are clock sync problems; for ETags, it’s often because Apache uses the inode to calculate the ETag’s value, by default, and it’s different across the farm. See: [link]
Cheers,
Sometime last week I read this piece by Sam Ruby, which summarized says this:
…don’t send Etag and Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some p...
...