It’s just data

The H stands for Hyper

Everybody seems to be linking to Pete Lacey’s The S stands for Simple.  And for good reason.  In addition to being quite funny, I can honestly say — having lived through it myself — that it is quite accurate.  In fact, if one of the four flavors of feeds that Pete provides were Atom 1.0, I would gladly add his feed to Planet Intertwingly.  Perhaps he will find one of these pointers helpful.

But as Paul Harvey is wont to say, it is time for the Rest of the Story (no pun intended).  I share this because I believe that the only thing that those who simply poke fun at the alternatives without realistically describing the pitfalls of REST achieve is to convert blissful WS-* developers into despairing REST developers.

So with that introduction, I want to share a contemporary example involving REST, and the excellent scripting language named Python.

The Rest of the Story

It started out with a simple feature request for Planet: Anyway i can download the images in to cache itself ?  Given the nature of HTTP proxies, this is a common requirement.

The first problem is that, unlike SOAP, image data when transported natively over HTTP is not self contained.  Separate from the data itself is a number of HTTP headers, and often — but not always — this data is important too.  Because enough people miss this fact, there is a lot of content sniffing going on, and that causes problems too, but let’s not go there, let’s try to do the right thing and capture the headers too.

Now httplib2 does that, and optimizes requests based on cache control headers, and even stores the data into flat files by default; files that can almost be served asis.  Some small tweaks are required, like changing 304 status codes to 200, and there are some headers like transfer-encoding that only apply to the transfer that already happened (retrieving the image in the first place), and not necessarily to the one that is going to happen (namely, serving the image from the cache).

So, to test this out, I issued the following magic incantation:

sudo a2enmod asis
sudo /etc/init.d/apache2 force-reload

And then I created a .htaccess file with a single line: SetHandler send-as-is, symlinked my httplib2 cache into my public_html directory, and manually edited a cache entry using vim.

And it didn’t work.

The problem turned out to be that httplib2 makesmade use of a module named rfc822 to both read and write rfc 822 style headers.  And despite the fact that this portion of rfc 822 is relatively simple (don’t even get me started on the date format), the Python runtime library manages to get it wrong.  It gets it wrong in Python 2.2.  And in Python 2.3.  And in Python 2.4.  And in Python 2.5.

Instead of putting a CRLF between headers, it only puts a LF.  Even on Windows, and presumably even on MacOSX.  Of course, the same module is liberal on reading, so it has no problem consuming the invalid messages that it produces.  But not everything is quite so liberal in quite the same way, and somewhere between Apache and Firefox (I haven’t debugged it further), my first test didn’t work.

This turns out to be easy to fix, and here is my initial stab at the code:

headers = rfc822.Message(StringIO(data[:divider+4]))
status = headers.get('status',None)
if status == '304': status='200'
for header in ['status','content-encoding','transfer-encoding']:
  if headers.has_key(header): del headers[header]
headers = str(headers).strip().replace('\n','\r\n')
if status: headers = 'status: %s\r\n%s' % (status, headers)
data = headers + data[divider:]

To be fair, embedded in those few lines is quite a bit of knowledge.  Not only of the workaround and status codes and headers changes that I mentioned above, but also a few other things.  Status isn’t really a HTTP header, but many tools (most notably CGI) find it convenient to pretend like it is one, and others have picked up on this convention.  Of course, it only works if this “header” is first, something that isn’t mentioned in the documentation, not even in small print.  It’s just something that “everybody knows”.

There’s also another subtle bug.  Not only does the rfc822 module get the line-endings wrong, it puts two blank lines between the headers and the body.  Effectively this means that the last blank line is considered a part of the body.  This is a problem for binary data.  It even is a problem with XML, if there is an XML prolog involved.

Let me repeat something for emphasis.  RFC822 is “simple”.  Simple enough that the Python runtime library gets it wrong.  And I haven’t even mentioned the various problems and deficiencies that urllib, urllib2, and httplib have that lead Joe Gregorio to conclude that it was time to create httplib2.

And if any of you noticed that rfc822 module is deprecated in favor of email, let me save you the trouble: the new email module has the same bug.


If you got this far, congratulations.  But if you have come to the conclusion that REST and WS-* are both equally bad, and the primary difference is that WS-* has a more comprehensive approach to tooling, then I failed to adequately convey my key point which I will now restate for emphasis: with REST, this turns out to be easy to fix.

In addition to all the architectural benefits of REST, as well as all the pragmatic experience the web has built up over time with caching and intermediaries— benefits and experience that WS-* forsakes — there is one other key difference.  HTTP wasn’t a home run all by itself, it was the pair of HTTP and HTML that were successful.  Key to this success is the fact that HTML is a file format that can be authored by a mere mortal in a text editor.  And yes, while I have seen HTML files produced by contemporary versions of Microsoft word (as well as the SVG files produced by Adobe Illustrator) none of this prevents me from doing something simple myself using only the tools that I have available.

By contrast, WSDL was clearly designed to be produced by tools and consumed by tools.

This difference is crucial.  In simple, pragmatic, operational terms, this difference enables me to always get my job done using only duct tape.

And, in this case, the difference is doubly important, as Joe has already started committing these changes/workarounds into httplib2.  Every indication is that with the next version of httplib2, those that try to serve the cache it produces and maintains asis, they will find that “it just works”.