It’s just data

Meet the New Boss

Ian Hickson: For what it’s worth, HTML5 requires text/html to only be used for HTML, and defines exact parsing rules, and requires XHTML to only be sent with XML MIME types, and requires that an XML parser then be used, making the whole “send XHTML as text/html” thing completely invalid, and making the whole “but XML is stricter” argument false at the same time.

The more I look at HTML5, the more puzzled I get.

A while back, I was surprised to find that a proposed patch of mine was rejected by the Python folks for the sgmllib module.  Informally, I had thought of SGML as essentially a lax super-set of XML, and the basis for languages like HTML4.  It turns out that upon closer inspection, neither were precisely true.  There are a number of things in XML that aren’t in HTML4, like support for numeric entities expressed in hexadecimal, and well defined support for characters that can’t be expressed in one byte in Unicode.

At least HTML5 fesses up and states

While the HTML form of HTML5 bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.

This means that projects like the Feed Parser and Beautiful Soup have to monkey patch sgmllib in order to deal with HTML as practiced.  And projects that use both, like Venus, have to be careful to make sure that these patches are consistent.

Before I go on, I feel the need to state that a lax superset is not inherently a bad thing.  In fact, a fully interoperable lax superset is quite arguably a good thing.  If more documents can be processed successfully and consistently — lets just say that this is a significant benefit.

Some will point out that there are some compensating drawbacks, and will differ on how to weigh these factors.  That’s OK.  This is essentially the Perl vs Python argument: There Is More Than One Way To Do It vs There should be one—and preferably only one—obvious way to do it.

The Lax Choice

Tim Berners Lee recently wrote The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn’t work.

That appears to be off the mark to me.

Let’s take those one at a time.

HTML allows, but does not require, quotes around attribute values.  I don’t see that heavily used outside of Google, but this rule is consistent with the general perception that HTML is a lax superset of XHTML.  No quibbles there.

No slashes in empty tags I want to talk about further, so lets push that one on the stack — I’ll get back to it.

xmlns attributes are not defined in the extant HTML DTDs, but in practice unknown attributes are ignored, so it is quite possible to serve well formed XHTML to ignorant user agents like IE7 and have the results processed successfully, so — again in practice — one could think of HTML as XHTML where the namespace declarations are optional.  And again, this is consistent with the mental model of HTML being a lax superset of XHTML.  Now admittedly, this is a more complex subject than I have given it here, but I do believe that problems in this area are solvable and I don’t wish to derail my larger point.

From what I see, more common than any of these are a number of other changes.  I do see some usage of UPPERCASE TAGS, but that appears to be on the wane.  I see a strong and persistent desire to use deprecated tags like <font>.  I see some improper nesting of tags, most commonly on formatting elements like <b> and <i>.  But mostly what I see is a lot of implicitly closed block tags like <p>, <li>, and <tr>.

In each of these cases, save one (slashes) or possibly two (namespaces), HTML5 consistently makes the lax choice.


Recapping: at the grammar level, HTML is unquestionably more lax than XML.  Pesky rules like closing all elements that you open and always nest properly are venial sins at most.

At the lexical level, HTML is quite a different grammar, though HTML5 does make a few changes to close the gap.  &apos; has been added and numeric entities encoded in hex are allowed.

But the one place where there is a clear difference is in how elements with a content model of empty are to be handled.  By this, I am referring to elements like img and link that can never have any child elements.

Slashes (a.k.a. U+002F SOLIDUS) are never allowed in the Before attribute name state, and furthermore end tokens for these elements are never allowed.

In both cases, the result is a parse error.  It is not clear to me what the meaning of that is outside of the context of a conformance checker, but it sounds ominous.  Particularly when paired with Ian’s quote above.

And entirely unnecessary.  And counter productive, to boot.

Look at mozilla’s site.  An XHTML style image tag there.  Look at Microsoft’s.  Same tag, twice, but this time craftily enclosed in a string and emitted via document.write.  Look at the WhatWG blogLachlan Hunt did a yeoman’s job, but stopped short of correcting the several hundred! places within WordPress that would need to be “corrected”.

But why?  HTML5 has already broken ranks with SGML.  And the reason given was a good one:

Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed Web browsers interoperably implemented a different representation — has resulted in this version of HTML returning to a non-SGML basis.

Why can’t a similar line of reasoning apply here?


Start with a minimal XHTML core.  Toss out problematic and rarely used features like PI’s anywhere but at the top (used for stylesheets).  Toss out internal DTDs, and probably even external ones too.  This would mean that only the five built-in named entites can be used, but so be it.  Toss out namespace prefixes for element names — this will make a number of XML people wince, but default namespaces can still be used; and, in fact, are pretty much the only way in which people who author XHTML today get their job done.

Only a statistically insignificant percentage of the web pages on the Internet will ever conform to that core, but that’s OK.

On top of that core, go wild with additions.  Quoting of attributes?  Optional!  Closing your list items?  Optional!  Proper nesting of formatting elements?  Don’t bother!  Case sensitive element names?  OK, that still needs to be worked out.  But you get the idea.

Why can’t this work?

The upside?  People who code a single parser could consume all HTML5 documents, and incidentally most legacy HTML and XHTML documents too.  And people who are conservative in what they produce could still have their content processed by existing XML based tools.