It’s just data

Meet the New Boss

Ian Hickson: For what it’s worth, HTML5 requires text/html to only be used for HTML, and defines exact parsing rules, and requires XHTML to only be sent with XML MIME types, and requires that an XML parser then be used, making the whole “send XHTML as text/html” thing completely invalid, and making the whole “but XML is stricter” argument false at the same time.

The more I look at HTML5, the more puzzled I get.

A while back, I was surprised to find that a proposed patch of mine was rejected by the Python folks for the sgmllib module.  Informally, I had thought of SGML as essentially a lax super-set of XML, and the basis for languages like HTML4.  It turns out that upon closer inspection, neither were precisely true.  There are a number of things in XML that aren’t in HTML4, like support for numeric entities expressed in hexadecimal, and well defined support for characters that can’t be expressed in one byte in Unicode.

At least HTML5 fesses up and states

While the HTML form of HTML5 bears a close resemblance to SGML and XML, it is a separate language with its own parsing rules.

This means that projects like the Feed Parser and Beautiful Soup have to monkey patch sgmllib in order to deal with HTML as practiced.  And projects that use both, like Venus, have to be careful to make sure that these patches are consistent.

Before I go on, I feel the need to state that a lax superset is not inherently a bad thing.  In fact, a fully interoperable lax superset is quite arguably a good thing.  If more documents can be processed successfully and consistently — lets just say that this is a significant benefit.

Some will point out that there are some compensating drawbacks, and will differ on how to weigh these factors.  That’s OK.  This is essentially the Perl vs Python argument: There Is More Than One Way To Do It vs There should be one—and preferably only one—obvious way to do it.

The Lax Choice

Tim Berners Lee recently wrote The attempt to get the world to switch to XML, including quotes around attribute values and slashes in empty tags and namespaces all at once didn’t work.

That appears to be off the mark to me.

Let’s take those one at a time.

HTML allows, but does not require, quotes around attribute values.  I don’t see that heavily used outside of Google, but this rule is consistent with the general perception that HTML is a lax superset of XHTML.  No quibbles there.

No slashes in empty tags I want to talk about further, so lets push that one on the stack — I’ll get back to it.

xmlns attributes are not defined in the extant HTML DTDs, but in practice unknown attributes are ignored, so it is quite possible to serve well formed XHTML to ignorant user agents like IE7 and have the results processed successfully, so — again in practice — one could think of HTML as XHTML where the namespace declarations are optional.  And again, this is consistent with the mental model of HTML being a lax superset of XHTML.  Now admittedly, this is a more complex subject than I have given it here, but I do believe that problems in this area are solvable and I don’t wish to derail my larger point.

From what I see, more common than any of these are a number of other changes.  I do see some usage of UPPERCASE TAGS, but that appears to be on the wane.  I see a strong and persistent desire to use deprecated tags like <font>.  I see some improper nesting of tags, most commonly on formatting elements like <b> and <i>.  But mostly what I see is a lot of implicitly closed block tags like <p>, <li>, and <tr>.

In each of these cases, save one (slashes) or possibly two (namespaces), HTML5 consistently makes the lax choice.

Solidus

Recapping: at the grammar level, HTML is unquestionably more lax than XML.  Pesky rules like closing all elements that you open and always nest properly are venial sins at most.

At the lexical level, HTML is quite a different grammar, though HTML5 does make a few changes to close the gap.  &apos; has been added and numeric entities encoded in hex are allowed.

But the one place where there is a clear difference is in how elements with a content model of empty are to be handled.  By this, I am referring to elements like img and link that can never have any child elements.

Slashes (a.k.a. U+002F SOLIDUS) are never allowed in the Before attribute name state, and furthermore end tokens for these elements are never allowed.

In both cases, the result is a parse error.  It is not clear to me what the meaning of that is outside of the context of a conformance checker, but it sounds ominous.  Particularly when paired with Ian’s quote above.

And entirely unnecessary.  And counter productive, to boot.

Look at mozilla’s site.  An XHTML style image tag there.  Look at Microsoft’s.  Same tag, twice, but this time craftily enclosed in a string and emitted via document.write.  Look at the WhatWG blogLachlan Hunt did a yeoman’s job, but stopped short of correcting the several hundred! places within WordPress that would need to be “corrected”.

But why?  HTML5 has already broken ranks with SGML.  And the reason given was a good one:

Some earlier versions of HTML (in particular from HTML2 to HTML4) were based on SGML and used SGML parsing rules. However, few (if any) web browsers ever implemented true SGML parsing for HTML documents; the only user agents to strictly handle HTML as an SGML application have historically been validators. The resulting confusion — with validators claiming documents to have one representation while widely deployed Web browsers interoperably implemented a different representation — has resulted in this version of HTML returning to a non-SGML basis.

Why can’t a similar line of reasoning apply here?

Recommendation

Start with a minimal XHTML core.  Toss out problematic and rarely used features like PI’s anywhere but at the top (used for stylesheets).  Toss out internal DTDs, and probably even external ones too.  This would mean that only the five built-in named entites can be used, but so be it.  Toss out namespace prefixes for element names — this will make a number of XML people wince, but default namespaces can still be used; and, in fact, are pretty much the only way in which people who author XHTML today get their job done.

Only a statistically insignificant percentage of the web pages on the Internet will ever conform to that core, but that’s OK.

On top of that core, go wild with additions.  Quoting of attributes?  Optional!  Closing your list items?  Optional!  Proper nesting of formatting elements?  Don’t bother!  Case sensitive element names?  OK, that still needs to be worked out.  But you get the idea.

Why can’t this work?

The upside?  People who code a single parser could consume all HTML5 documents, and incidentally most legacy HTML and XHTML documents too.  And people who are conservative in what they produce could still have their content processed by existing XML based tools.


xmlns attributes are not defined in the extant HTML DTDs, but in practice unknown attributes are ignored, so it is quite possible to serve well formed XHTML to ignorant user agents like IE7 and have the results processed successfully, so — again in practice — one could think of HTML as XHTML where the namespace declarations are optional

To my surprise, not a single browser implements XML namespaces. It’s not just “ignorant MSIE”, but really none. Not a single browser allows for <x:html xmlns:x="http://www.w3.org/1999/xhtml"/>. Quite surprising, as all browsers implement some XSLT processing or other, which totally relies on namespaces. And there is no way of making an error here - if the document is wellformed, and contains namespace declarations, how could this be any kind of an error? Also, it’s not exactly difficult to implement namespaces. Why don’t even the good guys (Mozilla, KHTML, Opera) implement this?

Posted by Martin at

Hexadecimal character references are actually available in HTML and SGML, and have been for a long time.  They were introduced with the Annex K Web SGML Adaptations.

I don’t really understand your recommendations.  Are you suggesting that we make those changes to XHTML, or XML in general?

Things like PIs, internal DTD subsets and namespaces are defined by the XML and XMLNS specs.  XHTML can’t change that.  However, I agree with removing DTDs entirely from XML, but that’s a separate issue.  But, FWIW, XHTML5 doesn’t use a DTD at all.  I don’t see what the problem with PIs are, though there are presently few good uses for them.  I’ll stay out of the whole namespace issue for now.

Also, XHTML can’t allow optional quoting of attributes, optional end tags or case insensitivity.  But those features are already available in HTML.  You seem to be suggesting that we turn XHTML into another form of tag soup.  Have I completely misunderstood you?

Posted by Lachlan Hunt at

“Toss out namespace prefixes for element names — this will make a number of XML people wince, but default namespaces can still be used; and, in fact, are pretty much the only way in which people who author XHTML today get their job done.”

Why not toss out namespaces? Really, who actually needs them?

Posted by Bill de hOra at

Marin, they do. Try again. (They being Opera, Konqueror, Safari, Firefox and others perhaps.)

Posted by Anne van Kesteren at

Also, XHTML can’t allow optional quoting of attributes, optional end tags or case insensitivity.  But those features are already available in HTML.  You seem to be suggesting that we turn XHTML into another form of tag soup.  Have I completely misunderstood you?

I’m suggesting that the following be ditched:

For compatibility with existing content and prior specifications, this specification describes two authoring formats: one based on XML (referred to as XHTML5), and one using a custom format inspired by SGML (referred to as HTML5). Implementations may support only one of these two formats, although supporting both is encouraged.

In its place, there should be one authoring format, roughly described as the union of SGML and XHTML.  The primary change I see to HTML5 syntax would be to allow for empty elements.  If there is a strong enough need, one could consider allowing in arbitrary PIs and DTDs, etc., but I am not advocating that.

Given such a format, some people could optionally chose to produce documents that conform to an XML compatible subset, if they desire.

Meanwhile, let me turn this around.  What possible value is there in disallowing empty elements that exist in such places as the Mozilla, Microsoft, and WhatWG Blog websites?

Why not toss out namespaces? Really, who actually needs them?

I personally believe that the embedding of languages such as MathML and SVG is a worthy feature, one whose growth has been so far been stunted by the requirement that the embedding document be served with an XML mime type which is both not universally supported and triggers a non-forgiving browsing mode.

Posted by Sam Ruby at

Sam, please send your feedback to the whatwg list (whatwg@whatwg.org) or directly to me by e-mail (ian@hixie.ch) so I can make sure it is tracked and taken into consideration, and given a proper response.

Regarding the suggestion you give above — only have one serialisation, not two — we basically can’t. We have to support the “lax” version of HTML, since that’s what the Web uses, and it has to be backwards-compatible (which is why we can’t make trailing “/” characters meaningful — it would change how about 49% of the Web is parsed). We also have to support the XML version, because the HTML version is supported by turning it into a DOM, and you can always serialise a DOM as XML. Basically, the XML version isn’t actually in the HTML5 spec, it’s just that the XML and DOM specs define how to take an XML document and turn it into a DOM, and HTML5 is defined in terms of the DOM, so we automatically get that serialisation, for free if you will.

Posted by Ian Hickson at

The WHATWG blog now outputs conforming HTML 5!

Posted by Lachlan Hunt at

please send your feedback to the whatwg list

done

The WHATWG blog now outputs conforming HTML 5!

Sweet!  Were any lines of WordPress code harmed in the process?

Posted by Sam Ruby at

Sweet!  Were any lines of WordPress code harmed in the process?

Only 2 files: index.php and wp-header.php.  We added a function to the beginning of index.php to strip the slashes from any occurrence of ‘/>’, used ob_start() to capture the output buffer and send it through the callback function, and then had to remove the the call to gzip_compression(); from wp-blog-header.php so it would work.

It’s not a perfect solution because if someone enters /> into a comment, intending it to be output as a code example, instead of /&gt;, that slash will also get stripped.  However, we can always edit the comments to fix up that issue when it occurs.

Posted by Lachlan Hunt at

[from jonas] Sam Ruby: Meet the New Boss

[link]...

Excerpt from del.icio.us/network/mcroydon at

links for 2006-11-29

SmartSearch ¦ Firefox Add-ons ¦ Mozilla Corporation Search for the selected text through your context menu, using any of your bookmark keywords. (tags: extension firefox) [...]...

Excerpt from Appunti Disordinati Di Viaggio at

Only 2 files

From a recent post to WHATWG, it looks like the ob_start call could be moved to your theme.  Searching further, I found this and this.

None of these are precisely necessary for your usage, but could make both sharing your solution with others and upgrading to newer versions of WP easier.

Posted by Sam Ruby at

Sam Ruby: Meet the New Boss

Ian Hixie ?...

Excerpt from Public marks with tag html at

Sam Ruby: Meet the New Boss

Sam Ruby: Meet the New Boss by znarf & 1 other(s) html xml whatwg Copy | React (0) [link]...

Excerpt from Public marks with tag html at

From my reading of the HTML5 spec, this basically maintains the status quo.

While it says that a slash in “Before attribute name state” or “Attribute name state” is a parse error, it also says that user agents don’t need to abort on parse errors (and judging by behaviour of current web browsers, I wouldn’t expect HTML5 web browsers to abort).  Using the error recovery in the spec, “<foo/>” would parse equivalent to “<foo>”, and “<foo bar=baz/>” would parse equivalent to “<foo bar=baz>”.

So an XHTML document that conforms to the “HTML Compatibility Guidelines” will be parsed correctly by an HTML5 parser that doesn’t abort on parse errors.  This doesn’t seem to be a backward step from how things are currently handled.

Posted by James Henstridge at

I thought of another reason to be cautious about adding XML-style empty element syntax: compatibility with pre-HTML5 browsers.  Adding empty element syntax causes the same problems as adding new implicitly closed elements.

For example, consider how the HTML fragment “<p>foo <i/>bar</p>” would be interpreted by existing HTML parsers and a theoretical HTML5-plus-xml-empty-element-syntax parser.  Is the word “bar” in italics or not?

I suppose the potential damage could be limited if the empty-element syntax could only be used for elements that are implicitly closed in HTML, but given that you could already do that with HTML5 parsers that recover from errors, it’d be of limited benefit.

Posted by James Henstridge at

James: you might be interested in this thread.

Posted by Sam Ruby at

The Beginning of the End

HTML-safe generation of embedded MathML.... [more]

Trackback from Musings

at

xsi:type is Evil

Using a sad old cliché I hereby declare the train wreck that is xsi:type should be considered harmful: First of all it assumes the receiver has a W3C XML Schema. Wrong! Schemas are just one of a number of different ways I might have used to...

Excerpt from Paul Downey at

Add your comment