It’s just data

Detecting Not Modified Reliably

Yesterday, I more fully integrated Joe’s threading work into Venus.  From an end user’s perspective, one benefit of this is that the first time you specify spider_threads, you will see immediate benefit as the Last-Modified and Etag header values that had been previously captured and stored in the Venus cache will be used.

With this change, the HttpLib2 cache becomes optional, but may soon provide additional benefits.

In debugging this, I took at look at ETag and Last-Modified usage, and found a few surprises.  Sure, I found a few sites that provided neither, yet would return the same data, byte for byte, again and again.  Most of these sites appeared to compute their feeds dynamically on each request; this includes sites such as IBM Developer Works: example.

As I said, this wasn’t a surprise.

Some sites, most typically ones powered by WordPress, would provide both ETag and LastModified headers, but would always provide the full content if an If-None-Match header was provided, but would respect If-Modified-Since, but only if If-None-Match was not provided.  These sites are typically ones powered by WordPress: exampleAnne’s feed also falls into this category, but I can’t determine how it was produced.

While that was surprising, even more puzzling is the fact that there are some feeds out there that intermittently support Etags.  And by intermittently, I do mean occasionally, like anywhere from one time in two to about one time in four or less.  All such feeds that I could find come from example.

You can verify this yourself with the following Python 2.4 script.  Simply pass one or more URIs as command line parameters:

import urllib2, sys

for uri in sys.argv[1:]:
  if uri.startswith('-'): continue
  headers = urllib2.urlopen(uri).headers
  request = urllib2.Request(uri)
  if headers.has_key('etag') and '-e' not in sys.argv:
  if headers.has_key('last-modified') and '-m' not in sys.argv:
    print uri, urllib2.urlopen(request).code
  except urllib2.HTTPError,e:
    print e

Additionally, you can specify either or both of -e and -m to cause the associated header to be omitted.

What you want to see is HTTP Error 304: Not Modified.  If, instead, you simply see 200, then the full content was sent both times.


Recommendation to feed producers: don’t send Etag and Last-Modified headers unless you really mean it.  But if you can support it, please do.  It will save you some bandwidth and your readers some processing.

And to feed consumers, while supporting these headers can save you bandwidth, computing a hash on the content may save you processing time.  I’ve now implemented this for Venus.

Update: WordPress ETag bug [via David Terrell]