Yesterday, I more fully integrated Joe’s threading work into Venus. From an end user’s perspective, one benefit of this is that the first time you specify spider_threads, you will see immediate benefit as the
Etag header values that had been previously captured and stored in the Venus cache will be used.
With this change, the HttpLib2 cache becomes optional, but may soon provide additional benefits.
In debugging this, I took at look at
Last-Modified usage, and found a few surprises. Sure, I found a few sites that provided neither, yet would return the same data, byte for byte, again and again. Most of these sites appeared to compute their feeds dynamically on each request; this includes sites such as IBM Developer Works: example.
As I said, this wasn’t a surprise.
Some sites, most typically ones powered by WordPress, would provide both
LastModified headers, but would always provide the full content if an
If-None-Match header was provided, but would respect
If-Modified-Since, but only if
If-None-Match was not provided. These sites are typically ones powered by WordPress: example. Anne’s feed also falls into this category, but I can’t determine how it was produced.
While that was surprising, even more puzzling is the fact that there are some feeds out there that intermittently support Etags. And by intermittently, I do mean occasionally, like anywhere from one time in two to about one time in four or less. All such feeds that I could find come from blogs.msdn.com: example.
You can verify this yourself with the following Python 2.4 script. Simply pass one or more URIs as command line parameters:
import urllib2, sys for uri in sys.argv[1:]: if uri.startswith('-'): continue headers = urllib2.urlopen(uri).headers request = urllib2.Request(uri) if headers.has_key('etag') and '-e' not in sys.argv: request.add_header('If-None-Match',headers.get('etag')) if headers.has_key('last-modified') and '-m' not in sys.argv: request.add_header('If-Modified-Since',headers.get('last-modified')) try: print uri, urllib2.urlopen(request).code except urllib2.HTTPError,e: print e
Additionally, you can specify either or both of
-m to cause the associated header to be omitted.
What you want to see is
HTTP Error 304: Not Modified. If, instead, you simply see
200, then the full content was sent both times.
Recommendation to feed producers: don’t send
Last-Modified headers unless you really mean it. But if you can support it, please do. It will save you some bandwidth and your readers some processing.
And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time. I’ve now implemented this for Venus.
Update: WordPress ETag bug [via David Terrell]