intertwingly

It’s just data

RFC 3986bis


TL;DR: URL parsers consume URLs and generate URIs.  Such URIs are not RFC 3986 complaint.  I’d like to fix that.

- - -

Let’s talk a bit about nomenclature.

On the web, particularly in places like values of attributes named href, there are things that people have, at various times, attempted to call web addresses or IRIs.  Neither term has stuck.  In common uses these are called URLs.

In between the markup and servers, there are user agents.  One such user agent is a browser.  Browsers don’t passively send URLs along, they reject some outright, and transform others.  There should be a name for the set of outputs of the various cleanups that browsers perform.

Since browsers are programmable, you can directly observe this transformation.  The WHATWG URL specification defines an API which has already been implemented by Firefox and Chrome, and is being evaluated by Microsoft and Apple.  Create a JavaScript console and enter the following:

new URL("hTtP:/EXamPLe.COM/").href

The output you will see is:

"http://example.com/"

The output is clearly much cleaner and more consistent than the input.  In fact, in this case the output is RFC 3986 compliant.

Unfortunately, in the general case, this isn’t true.  Browsers (and more generally: other libraries like the ones found in pretty much every modern programming language) can produce things that aren’t RFC 3986 compliant.

I’m looking at every browser and every library I can.  I’m specifically looking for differences.  In some cases, I’m pointing out where such outputs are clearly wrong and need to be fixed.

In other cases, the output may not be RFC 3986 compliant, but actually are useful and actually work.  What this means in practice is that the set of things that consumers need to be able to correctly process is not defined by RFC 3986 but by these tools.

People can learn this the hard way by starting out to implement RFC 3986 and then finding that they need to reverse engineer other tools.  We can do better.  We can set out to update RFC 3986 or otherwise document what the actual set of inputs that can be expected to interoperably process is.

In general, I have found that it isn’t difficult to talk about places where RFC 3986 can be tightened up.  Where there has been push-back is exploring any notion of loosening the definition.  The reaction generally is expressed along the lines of “doing so would break things”.

I can see how some see such a position as reasonable.  I don’t, and I’ll tell you why.  What is effectively being said is that documenting how things actually work will break things, which is clearly untrue.

What such an effort will do is not break things, but uncover uncomfortable truths.  To build upon an example from Dave Cridland, one such uncomfortable truth may be that the sets of things that everybody except LDAP schemas can handle is different than the sets of things LDAP schemas can handle.

There are three ways to handle that.  One would be to change everybody to conform to what LDAP can handle.  One would be to change LDAP.  And one would be to document clearly that the set of things LDAP can handle and the set of things that everybody else expects to be handled are separate sets.  Largely overlapping, yes, but not identical sets.

While documenting three sets (the set of things Chrome and other browser supports, the set of things HTTP and other protocols support, and the set of things LDAP supports) would not be my first choice, but it may be the only option available given the constraints.

If you look at those three sets, ideally each would be a proper subset of these that precede it.  That’s not currently the case at the moment, but as I mentioned proposals made with clear rationale provided to tighten up RFC 3986 don’t seem to be getting much push-back.

What we need then it three names.  URIs seem to be the obvious choice for name of the set of “things LDAP schemas support”.  For better or worse, URLs seem to be the name that has stuck for the first set.

At this point, a number of people seeing an opening suggest IRIs as the name for the set in the middle.  Um, no.  Except for fragments, this set is 100% pure ASCII.  The name for what IRIs attempted to define is URLs.

So this means that we need to define a new name.  That’s not so bad, is it?  It could be worse, at least we don’t have to define a cache invalidation strategy.