Sam Ruby

The White Pebble

2006-12-01T09:49:52-08:00

Ian Hickson: Regarding your original suggestion: based on the arguments presented by the various people taking part in this discussion, I’ve now updated the specification to allow “/” characters at the end of void elements.

This is big. PHP’s nl2br function is now HTML5 compliant. WordPress won’t have to completely convert to HTML4 before people who wish to author documents targeting HTML5 can do so using this software. Such efforts can now afford to proceed much more incrementally. This is much more sensible and practical possibility.

To illustrate the larger context, consider that the universe of documents bodies which were simultaneously both valid HTML5 and valid XHTML5. Only a few days ago, such bodies could not include any images. Now they can.

Now the remaining differences amount to a few edge cases, and a restriction that exists in the current draft but is in every way as meaningless in real life as the prior prohibition against trailing slashes in void elements. And every bit as provisional: the statement “xmlns attributes on elements are disallowed in HTML 4 and in the WHATWG draft for HTML 5 as it exists on 1 December, 2006” has precisely the same validity as “closing slashes are disallowed on elements in HTML 4 and in the WHATWG draft for HTML 5 as it existed on 29 November, 2006”.

Modulo this one arbitrary — and frankly artificial — difference the effective overlap between “pure” HTML and “pure” XHTML has been greatly increased. This means that people can incrementally evolve towards one or the other — if they should chose to do so.

Dogma

So now that pebble has been cast, the landslide is sure to follow. The right questions are already being asked. And Ian’s weak joke concerning atheism is already backfiring.

The truth is that most HTML is authored by pagans. Ones who don’t understand arguments such as these which amount to stating that the meaning of your document can only be interpreted in the context of some knowledge that doesn’t exist in this universe at all, as it only exists in another plane of existence entirely. Only high priests with AllowOverride FileInfo credentials are permitted to speak to these gods. Which would be fine, if the only difference between Thor and Zeus were that one is forgiving and the other is vengeful. And if these magic incantations could be trusted to work.

Unfortunately, in the real world, they often don’t. Futhermore, the fact is that these two gods will judge your documents differently. They will produce different DOM trees for documents such as the XML specification based on how it is served. And, ironically, the XML specification is served as text/html.

This is a exceedingly subtle point. One that unfortunately does not leap out at you in the existing WHATWG document.

I believe that for HTML5 to be more than an intellectual exercise, it needs to include the pagan view. One that, in the final analysis, is a much simpler one. Pagans are like that.

Pagans might understand the notion that there are two authoring formats if one were, say, based on S-expressions and the other were based on XML. But we are talking angle brackets vs. angle brackets here. Where neither the element names, nor even (generally) the case of those names change. To a pagan’s untrained eye, such documents are indistinguishable.

In the pagan world view, there are documents that are HTML, and there are documents that are XML, and the overlap is called XHTML. In this view, there is a preferred MIME type for “simply HTML”, and a preferred mime type for “simply XML”, and a preferred mime type when you feel the urge to affirmatively declare that your document is both.

In this world view, if you take a document which targets this overlap, a conformance checker for HTML5 would identify one set of errors. Another conformance checker for XML’s well-formedness constraint would identify a possibly different set of errors. What truly would be surprising to such a pagan is for a conformance checker which simultaneously targets both to identify less errors than the union of the two. If an empty anchor tags trigger parse errors in HTML5, then by &deity; it should trigger the same parse error in XHTML5, no?

Grace

When all the religion was stripped away from the trailing slash in always-empty HTML elements discussion, only one question remained: I think basically the argument is “it would help people” and the counter argument is “it would confuse people”. This is a eminently sane way to approach discussions such as these.

I would argue that it would both help people and reduce confusion if a void element continued to be invalid HTML5 and, by implication, be invalid in XHTML5. By invalid, I simply mean that a parse error would be reported by a conformance checker whenever such constructs are found in a document. Non-draconian user agents can, of course, chose to recover from this error.

The HTML5/XHTML5 specification can detail the different recovery rules for this parse error based on whether the document is being parsed in HTML5 mode or XHTML5 mode. There are ample historical reasons for this divergence, and I’m certainly not suggesting that they be changed — merely that they be documented.

And, in a somewhat ironic twist, people will find that xml parsers won’t halt on this particular parse error. They simply will silently produce the “wrong” DOM for this invalid document.

The only realistic alternative? Don’t document this difference in behavior. Leave it as an exercise for the student. The prevailing opinion on the WhatWG working group seems to be that the XML serialization is “free” in that somebody else has already done the work. I will counter that it is only free for spec designers. It certainly isn’t free to implementers who must implement two parsers with two test suites and deal with two sets of bugs. And it certainly isn’t free to authors who much deal with the uncanny valley and cognitive dissonance implications of this needless split.

Epilogue

To commemorate this occasion, I’ve gone and updated planet intertwingly to use the (X)HTML5 doctype.

I’ve also gone ahead and created a small SVG icon for WhatWG, one that I can use in place of the comparatively bloated PNG image.

It is my hope that someday a pagan will take a fancy to one of my icons, will view source, and proceed to copy and paste said icon into their CMS. And when it doesn’t work as expected, they will proceed to file a bug report.

Meanwhile somebody who is entirely a-political and working on a browser feature to replace the graphics substrate will decide to humor this pagan. The other browser vendors will then shortly follow suit.

The White Pebble

2006-12-01T10:35:54-08:00

I will surely live to regret this, but I feel compelled to point out that paganism is not always the same as atheism. No doubt someone will find a way to tie their differences back into an on-topic counterargument, and I apologize in advance for the confusion that such a counterargument will cause.

[from wearehugh] Sam Ruby: The White Pebble

2006-12-01T10:45:30-08:00

[link]...

The White Pebble

2006-12-01T10:53:40-08:00

I intentionally did not chose the term atheism, and intentionally did chose the term paganism.

In my opinion, Ian analogy to atheism misses the mark. While a number of WHATWG members seem to take great pride in the fact that they don’t worship one particular false god, in many cases they seem to have a blind spot that causes them to fail to recognize that they have merely replaced one false god with another false god.

When all is said and done, I want a universal HTML parser to replace the monkey patched sgmllib approach that we’ve been using so far. A parser that is not based on how the SGML spec says that things are supposed to be parsed, but based on careful analysis of how HTML is practiced today.

At the present time, the WHATWG is the closest I’ve ever seen to undertake the analysis necessary to make that work. Now if we could just address the matter of that one tiny blind spot, we’d be there.

Sam Ruby: The White Pebble

2006-12-01T11:16:25-08:00

wearehugh : Sam Ruby: The White Pebble Tags : html5 webstandards whatwg...

The White Pebble

2006-12-01T11:46:28-08:00

I have issues with most of your points, but I don’t have time to address everything now, so here’s just a few to get started.

The other argument against the trailing slash, which sadly I didn’t realise at first, but which has just become more and more apparent is that allowing the trailing slash gives people the crazy notion that they can process HTML documents with XML tools and XHTML documents with HTML tools. They are different formats and they must be treated as such.

The reason for providing both serialisations was precisely to avoid this kind of nonsense: XHTML is for processing as XML, HTML is for processing as HTML! Either syntax can be used (in most cases) to represent exactly the same document. There is absolutely no reason whatsoever to ever process one format with tools designed for the other. It’s just completely unnecessary.

The White Pebble

2006-12-01T12:20:49-08:00

Great post.

The White Pebble

2006-12-01T12:45:46-08:00

Maybe my ADD is showing but, Sam, do you mean pagan instead of pegan? Also, I really like the new, more graphically adorned posts. As for HTML 5, what exactly is the overarching goal of this spec. Reading the WHATWG introduction to Web Applications 1.0 (which is where an I’m feeling lucky Google search will direct you) I get the impression that it’s more about adding stuff that’s missing to HTML. Things like richer form controls and the like. All of this seems like ground that’s been tread a thousand times, and I confess that my faith is suffering a bit lately because of it. I converted to XHTML a few years ago when I realized there were very clear benefits: the ability to parse my XHTML with an XML parser, improved consistency of rendering across browsers in standards compliance mode. Now I’m on board with extending HTML to contain new, powerful and native constructs so we don’t have to use Javascript and CSS hackery to reinvent the slider but forgiving my ignorance could you please explain why we don’t just take the XHTML view going forward? Why continue to produce new revisions of HTML that will look like tag soup to XML parsers? Why create something that needs a separate parser? It seems like a giant wheel reinvention exercise to me. As for the billions of documents that don’t enclose attribute values - to hell with them. Live and let live. The web will survive. At least by using XHTML as the one true way going forward, people who want all the new cool features of HTML 5 would be somewhat forced to finally produce markup that works in XML parsers and we could salvage all the work done by James Clark and company. So now that I’ve confessed my sins, please preach the good news.

The White Pebble

2006-12-01T12:51:28-08:00

I just read Lachlan’s post. I guess my problem is that I view it as unnecessarily wasteful to start over (creating more parsers) when existing technologies and tools can just find some middle ground and proceed. The landscape is bound to become more and more confusing for people. That said, my knowledge of the devilish little details clearly falls short of the experts, so I’m humbly awaiting a Zen slap to enlightenment.

The White Pebble

2006-12-01T12:56:22-08:00

just take the XHTML view ... billions of documents ... to hell with them ... XHTML as the one true way ... forced to finally ... salvage all the work ...

Wow ... I ... are you ... you’re not ... oh my ... you must be ... um ... wow.

The White Pebble

2006-12-01T13:07:51-08:00

Mark,

The billions of HTML < 5 will continue to work just the same they always have. I’m not saying we should magically delete them from the web. I’m saying that less choices for producers of markup to have to make going forward - not more - are a good idea. I think it’s perfectly reasonable to say, Hey you want all this cool new stuff, stop producing this garbage and follow a few simple rules. If the payoff is big enough, people will finally quote attribute values and put trailing slashes on image tags.

The White Pebble

2006-12-01T13:18:57-08:00

In the pagan world view, there are documents that are HTML, and there are documents that are XML, and the overlap is called XHTML. In this view, there is a preferred MIME type for “simply HTML”, and a preferred mime type for “simply XML”, and a preferred mime type when you feel the urge to affirmatively declare that your document is both.

Pagans know about MIME-types?

Next, you’ll tell me that they build landing strips.

The White Pebble

2006-12-01T13:19:38-08:00

stop producing this garbage and follow a few simple rules

rubys@rubypad:~$ python
Python 2.4.4c1 (#2, Oct 11 2006, 21:51:02) 
[GCC 4.1.2 20060928 (prerelease) (Ubuntu 4.1.1-13ubuntu5)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> from urllib import urlopen
>>> from xml.dom import minidom
>>> minidom.parse(urlopen('http://www.xml-blog.com/'))
Traceback (most recent call last):
  File "", line 1, in ?
  File "/usr/lib/python2.4/site-packages/_xmlplus/dom/minidom.py", line 1915, in parse
    return expatbuilder.parse(file)
  File "/usr/lib/python2.4/site-packages/_xmlplus/dom/expatbuilder.py", line 930, in parse
    result = builder.parseFile(file)
  File "/usr/lib/python2.4/site-packages/_xmlplus/dom/expatbuilder.py", line 207, in parseFile
    parser.Parse(buffer, 0)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 159, column 71
>>>

Next question?

The White Pebble

2006-12-01T13:30:09-08:00

I’m not claiming my software is bug-free, and I sincerely thank you for pointing that out to me. But isn’t the world a better place now that I have to go fix my Typo installation? Perhaps I’m being naive but wouldn’t it be great if more people had reason to do the same?

The White Pebble

2006-12-01T13:44:30-08:00

Perhaps I’m being naive but wouldn’t it be great if more people had reason to do the same?

The issue is that Utopian vision that you describe requires everybody to be bug free, always, in order to work.

Don’t get me wrong. HTML5 is a significant improvement over HTML4. With XHTML (served with the right mime type), either you did things right, or things broke. With HTML4, there was a description of how do to things right, but vendors were expected to reverse engineer IE or NN or something to figure out what to do when things went wrong. With HTML5, the parse error recover behavior will be interoperably described.

Oh, and re: pegan. Fixed. Thanks!

The White Pebble

2006-12-01T13:46:43-08:00

Oh man, the worst thing in the world has got to be baiting Sam Ruby into validating your blog.

Sam, you should consider teaching a course at UNC on markup and standards. My only regret is that I’m no longer there and wouldn’t be able to take it.

The White Pebble

2006-12-01T13:46:52-08:00

Hmm. There is also a handy link at the bottom of the page to the W3C validator:

[link]

Heh. In the time it took me to write this, the error count has gone from 41 to 36.

The White Pebble

2006-12-01T14:05:29-08:00

Justin,

I have enough self-esteem to think it’s worth a little egg on my face if I learn something in the process. :) And I would also enroll in any class Sam would teach.

Sam,

I usually prefer for things to fail fast and fail loudly. That way I can just fix it and move on. I really do wish it were easier to consume markup reliably. Chewing on your words for a bit, maybe my answer is wrong: forcing people to fix their bugs to play in the HTML5 world. Maybe lax is better. Sometimes I focus only one half of Postel’s Law. In any case, thanks for um... responding ... in a constructive way ... that ... helps ... me to actually ... learn something.

The White Pebble

2006-12-01T14:07:26-08:00

Michael,

Glad you noticed I’m not trying to be a hypocrite. I want my stuff to be correct. :)

The White Pebble

2006-12-01T14:16:03-08:00

Michael,

Sorry, couldn’t resist a little good natured chiding clicking on the link to your site - it appears to be down.

The White Pebble

2006-12-01T14:27:07-08:00

I usually prefer for things to fail fast and fail loudly

Change your MIME type to application/xhtml+xml, and you will get your wish. It can be done. I’ve done it for years (press control-i to check). Also the Planet software that I maintain consistently converts a dog’s breakfast of feeds in pretty much every format known to man into a consistently well formed output. Upgrade to Atom 1.0, and your feed will also be included on my personal planet.

The White Pebble

2006-12-01T14:35:44-08:00

Sam,
Thanks for the MIME tip. And thanks for offering to add my feed to your planet - that alone is reason enough to upgrade. I’ve frankly been a bit lazy when it comes to my blog software (I’m still running on a pre-4.0 release of Typo).

Also, I received the following error posting this comment, so forgive me if it appears twice:

CGI Failure

traceback:Traceback (most recent call last):
File “gateway.cgi”, line 39, in ?
post(url)
File “/home/rubys/mombo/post.py”, line 386, in post
writeComment(entry, title, body, decache=False)
File “/home/rubys/mombo/post.py”, line 230, in writeComment
raise message
POST limit exceeded

Probably a velocity issue?

The White Pebble

2006-12-01T14:48:40-08:00

Probably a velocity issue?

Yes.

The White Pebble

2006-12-01T15:27:27-08:00

Modulo this one arbitrary — and frankly artificial — difference

The motivation for making HTML5 and XHTML5 artificially disjoint is, I believe, to make conformance checking always fail if an author gets the MIME type and serialization paired the wrong way around. That is, to make the mistake obvious and not silently let it pass in special cases.

The truth is that most HTML is authored by pagans.

They are supposed to stick to HTML5 (at least for the time being) and not even try to use XHTML5.

Only high priests with AllowOverride FileInfo credentials are permitted to speak to these gods.

You are assuming that “pagans” need to serve something other than just HTML5 as text/html.

I believe that for HTML5 to be more than an intellectual exercise, it needs to include the pagan view. One that, in the final analysis, is a much simpler one. Pagans are like that.

The simple answer is supposed to be: Use HTML5 and end your file names with .html and everything will be fine.

A big part of the problem would be solved if the “pagans” weren’t taught that XHTML is somehow cooler than HTML.

Pagans might understand the notion that there are two authoring formats if one were, say, based on S-expressions and the other were based on XML. But we are talking angle brackets vs. angle brackets here. Where neither the element names, nor even (generally) the case of those names change. To a pagan’s untrained eye, such documents are indistinguishable.

Yeah, this is a problem.

In this world view, if you take a document which targets this overlap, a conformance checker for HTML5 would identify one set of errors.

I understand why you want to target the overlap, but I think “pagans” would be better off not trying to target it. Hence, my “professional driver on closed road” remark on the mailing list.

What truly would be surprising to such a pagan is for a conformance checker which simultaneously targets both to identify less errors than the union of the two.

Aren’t you advocating for such a weird surprise? (BTW, mine targets both but only one per run.)

If an empty anchor tags trigger parse errors in HTML5, then by &deity; it should trigger the same parse error in XHTML5, no?

No.

When all the religion was stripped away from the trailing slash in always-empty HTML elements discussion, only one question remained: I think basically the argument is “it would help people” and the counter argument is “it would confuse people”. This is a eminently sane way to approach discussions such as these.

Indeed.

I would argue that it would both help people and reduce confusion if a void element continued to be invalid HTML5 and, by implication, be invalid in XHTML5.

That would entail tampering with XML. What’s the point of having an XML serialization if it isn’t an XML serialization but something yet different? Is your goal actually attacking XML Draconianness by calling for application/xhtml+xml to be processed using something other than a pure XML processor? Breaking XML is too politically incorrect even for the WHATWG.

Non-draconian user agents can, of course, chose to recover from this error.

But there must be no such user agents, unless you want to attack XML.

The prevailing opinion on the WhatWG working group seems to be that the XML serialization is “free” in that somebody else has already done the work. I will counter that it is only free for spec designers.

Rather, an XML serialization couldn’t be wished away nowadays, so it is better to define it in the spec than leaving it for someone else to formulate ad hoc.

The White Pebble

2006-12-01T15:42:32-08:00

Oh, and the MIME type change does not only lead to a parser change. Scripting and CSS behave differently, which will cause much more grief than the serialization differences to those who try to do both at the same time.

The White Pebble

2006-12-01T15:56:18-08:00

Christian,

Yeah, I’m switching hosts soon.

The White Pebble

2006-12-01T16:14:36-08:00

That would entail tampering with XML.

No, it would not. It would be an unusual constraint, I grant you, but conceptually no different than the requirement that SOAP doesn’t permit PIs. It also would be a constraint that would be hard to implement when viewed only through the eyes of the infoset, but I digress.

I’m not suggesting that anybody “breaks” XML, I’m just saying that not all well formed XML documents are valid XHTML5 (Duh!).

Trust me as somebody who consistently produces well formed XHTML and serves it with the proper mime type whenever possible: there is a lot that people don’t tell you. For example, Opera don’t read no external DTD.

I would much prefer that XHTML5 said simply: you must use , and therefore limit yourself to the five predefined named entities. If you don’t like that, than use HTML5. For this to work, there can’t be any “gotchas” with HTML5 like “oh, you are using SVG? Sucks to be you.”

Robert’s (as of yet unanswered) question is a good one. It would not take much to add an “if the element has an xmlns attribute” to the A start tag token not covered by the previous entries state in How to handle tokens in the main phase section of the document.

Scripting and CSS behave differently, which will cause much more grief than the serialization differences to those who try to do both at the same time.

Tell me about it. All my application/xhtml+xml pages are served as text/html to IE. And my planet has a reasonable amount of both javascript and css. But what does the WHATWG document say about this? It says that there are two authoring formats, and you can convert between them by simply reading one into a DOM and producing the other. Riiight.

This is fixable. But first the WHATWG has to decide that the XML serialization is not “somebody else’s problem”.

The White Pebble

2006-12-01T17:31:10-08:00

Christian wrote:

I usually prefer for things to fail fast and fail loudly. That way I can just fix it and move on.

A thought experiment for you.

Sam wrote:

consistently converts a dog’s breakfast of feeds in pretty much every format

Including, I might add, non-wellformed XML. (For those who don’t know, I wrote 99% of the feed parser that powers Sam’s planet software. Sam wrote the last 1% that helps guarantee well-formed XHTML output in all cases.)

Henri wrote:

A big part of the problem would be solved if the “pagans” weren’t taught that XHTML is somehow cooler than HTML.

Indeed. Won’t somebody please think of the gerbils?

Henri also wrote:

Is your goal actually attacking XML Draconianness ...

Suddenly I feel my ears ringing.

Sam wrote:

All my application/xhtml+xml pages are served as text/html to IE.

This is, and has always been, the biggest practical problem with the XHTML MIME type: "It might be difficult for some user-agents." It is also the most difficult point to get across to web standards wannabes who haven’t actually used the technologies they defend. (No offense, Christian, you seem willing to learn.) That makes Sam the most deadly kind of advocate — one armed with experience.

The White Pebble

2006-12-01T18:03:18-08:00

Sam wrote the last 1% that helps guarantee well-formed XHTML output in all cases.

Rough design for this logic: if the output of all the cleansing produced by the Feed Parser and by Beautiful Soup is still impure, escape the impure bits. It took me a few iterations to track all those down, but it wasn’t all that hard.

Most of my value add came in after that. By using this software on a day in and day out basis with a large number of rather ugly feeds, I found a number of common errors that could be corrected. I’m still seeing areas where this could be improved, most recently on Bill’s Semantic Review post. I could point out the obvious irony, but that would be just to easy.

That makes Sam the most deadly kind of advocate

Hmmm. Is that a good thing or a bad thing? No wait — don’t answer that. :-)

From my online life. Around Dec 01, 2006.

2006-12-01T18:15:21-08:00

Links from a recent trip around the World Wide Web.saladwithsteve, Scaling data on the cheapLike Anna Karina’s Sweater, Speak, MemoryMashable!, Breaking: MySpace Debuts Google SearchPaul Kedrosky’s Infectious Greed, The Secret to Managing a Board of...

The White Pebble

2006-12-01T19:44:24-08:00

Tell me about it. All my application/xhtml+xml pages are served as text/html to IE. And my planet has a reasonable amount of both javascript and css. But what does the WHATWG document say about this? It says that there are two authoring formats, and you can convert between them by simply reading one into a DOM and producing the other.

Anyone who thinks that has the slightest resemblance to the real world is invited to fill out a form.

(I should say that one of the biggest headaches about upgrading to MT 3.3x was tracking down instances of this particular impedance mismatch between text/html and application/xhtml+xml in the MT Admin interface.)

There’s a long list of things (about 1200 lines of my 1783 line patch file for MT 3.3x) that demand more than just a reserialization.

Rounded Corners - 69

2006-12-01T20:46:10-08:00

The Perverse Nature of Performance Tuning. This explains why I often get it wrong. Scrapes. From PMade, another approach to Web scraping with Ruby. %s/SHOULD/DOES/g. Half of what standard bodies do is being benevolent. They do a good job writing...

The White Pebble

2006-12-02T06:25:14-08:00

I would much prefer that XHTML5 said simply: you must use , and therefore limit yourself to the five predefined named entities.

XHTML 5 is even better than that, there is no DOCTYPE at all and you are restricted to the 5 predefined entities. The only reason the DOCTYPE is even present in HTML is to trigger standards mode, otherwise there wouldn’t be one for it either.

The White Pebble

2006-12-02T06:55:49-08:00

The only reason the DOCTYPE is even present in HTML is to trigger standards mode

Sounds like a pretty good reason to me.

I’m entering this comment using lynx. I don’t get to see the SVG icon and it doesn’t remember my name, but that’s what graceful degradtion is all about.

Of course, I get a much better experience in IE7, and get to see the full content in FireFox and Opera.

The page is even served with the XHTML mime type to browsers that support it.

What’s not to like?

The White Pebble

2006-12-02T07:00:53-08:00

but that’s what graceful degradtion is all about.

Oh, and I don’t get spell check either. ;-)

The White Pebble

2006-12-02T14:13:32-08:00

For example, Opera don’t read no external DTD.

No browser does. [link] has some information on that. Your weblog is also the reason I put that question there.

The White Pebble

2006-12-02T15:00:34-08:00

Your weblog is also the reason I put that question there

Was. I no longer use named entities beyond the predefined ones.

The White Pebble

2006-12-03T01:39:22-08:00

I’m using a Macintosh and I do not have access to other platforms, but I would be happy to hear the results for other people. Let’s remove a bit XHTML 1.0 from the debate, and focus on XHTML 1.1 which is stricter by definition and has to be served as application/xhtml+xml.

I created an XHTML 1.1 utf-8 file on my machine (no PI),

I introduced a simple br, no trailing slah to make it invalid.

Loaded it with browsers (Camino, Safari, Firefox), so basically outside the HTTP Web server environment. When ending by .html, the file is parsed as tag soup, When ending by .xhtml, the file is parsed as application/xhtml+xml and logically fail with a “meaningful error” (for geeks) message.

That would be the first step to educate people about the choice of serving XHTML 1.0 or XHTML 1.1 with application/xhtml+xml. I will publish an article on QA Weblog about it… but before that I need more results from other browsers, so feel free to share.

When we choose to serve as application/xhtml+xml, you indeed cut the access to the user agents not supporting it, which is depending on the sources, around 50% minimum. This is known from most HTTP geek people. Most people, even designers and Web developers, do not, most of the time, know about HTTP and HTML. And I’m not sure they should know. The problem is more how to evangelize implementers. That is the hard (utopian) task.

The discussions are encouraging but there is something I’m worried about. Rules of any kind are by definition meant to not be respected. When Web Apps 1.0 would have defined a new set of rules (even including parsing error mechanisms) we will have a new set of mistakes, that we have not thought about.

Christian: about failing massively. Yes for a developer, more difficult for customers services of big companies. There is a Cascading criSiS going on when you do that. On personal Web site, you can choose to do it. I did it (I have said bye bye to IE.) BUT on commercial Web site, it becomes not acceptable. Though I think it would help to fix many implementations in the world but at a cost which seems very difficult to justify (it has been done a few years ago for CSS served as text/plain and even before for table not closed, but it was other times.)

The White Pebble

2006-12-03T03:09:01-08:00

I introduced a simple br, no trailing slah to make it invalid.

Well, first of all, you will need to make clear that the bar to be surmounted is well-formedness, not validity.

This page and this one are unfailingly well-formed XHTML 1.1+MathML+SVG, served as application/xhtml+xml to compatible browsers. They are, however, almost never valid.

Moreover, opening up a .xhtml page in a browser is not a reliable method. Introduce the entity &foo; onto your page. If you open it in Safari, the page will display OK (with the entity replaced by a literal &foo;). Open the same page in Firefox, and you will get a Yellow Screen of Death.

Even more fun, introduce the entities ∮ and © onto your page. Now, whether Firefox displays a Yellow Screen of Death depends on what DOCTYPE you declare at the top of your page.

Rather than testing their pages in every XHTML-capable browser, it would be wiser for your readers to use an automated tool to flag errors. Of course, you should tell them not to bother with the W3C Validator, because it doesn’t actually check for well-formedness, as can plainly be seen by including the link

Well?

in your test page.

The White Pebble

2006-12-03T03:39:38-08:00

Saving for future reference:

The White Pebble

2006-12-03T09:48:17-08:00

Saving for future reference:

With regard to the second of your two messages, I have an even more basic question. Over and over, I hear the mantra

One DOM, Two Serializations

That is, the HTML5 Spec defines what is (and is not) a valid HTML5 DOM, and specifies two serializations, one suitable for text/html and the other suitable for application/xhtml+xml.

On the other hand, we’re told that

is valid in the XHTML5 serialization, even though the corresponding DOM cannot be reserialized as HTML5.

Moreover

in the HTML5 serialization cannot be reserialized as XHTML5, without serious data-loss.

Even before we start getting into extending (X)HTML5 by including MathML and SVG content, I see real problems with the above mantra.

If the mantra were really true, then I see no in principle insuperable problem with including (well-formed!) subtrees of the form

in an HTML5 document. After all, parsing it to a DOM, and then reserializing as XHTML5 would produce a well-formed XML document which could be consumed by an XML parser. Presumably, a real-world UA could be made to skip the intervening step.

The White Pebble

2006-12-03T09:54:52-08:00

Hmmm...

I guess that, with the current parsing algorithm, SVG would be a bad example. MathML would work better.

The White Pebble

2006-12-03T11:08:39-08:00

Jacques, what’s wrong with escaping the newlines in the attribute as numeric character references? Anyway, yes, of course there are limitations to what a certain markup language can handle. [link] points that out.

The White Pebble

2006-12-03T11:17:57-08:00

I guess that, with the current parsing algorithm, SVG would be a bad example. MathML would work better.

Relying on no overlap in tag names tightly couples these two efforts. Treating the required math element, or its associated xmlns attribute as a trigger to a new state, something like consume foreign markup, would be a superior approach.

The White Pebble

2006-12-03T11:33:19-08:00

Rough design for this logic: if the output of all the cleansing produced by the Feed Parser and by Beautiful Soup is still impure, escape the impure bits.

A live example of this can be found on my planet at the moment (search for "". The place to fix this would be in sgmllib, which is what lead me to the desire to create a replacement for this library which could handle (X)HTML as practiced (and as embedded in feeds), which lead me to WHATWG.

Independent of whether the WHATWG officially recognizes SVG, I plan to implement their algorithm as well as the additional states required for handing markup in foreign namespaces, and will make this code available to others.

The White Pebble

2006-12-03T11:45:05-08:00

what’s wrong with escaping the newlines in the attribute as numeric character references?

There’s nothing wrong with doing all sorts of things. I am simply pointing out that it is incorrect to state that a valid HTML5 document can be parsed to a DOM and reserialized as XHTML5 without data-loss. We already know that the converse statement is false.

My point is, simply, that if it were true that the DOMs that can be serialized (without data loss) to XHTML5 were a strict superset of the DOMs that can be serialized to HTML5, then the response to questions about extending HTML5 with foreign content (SVG, MathML):

Oh, for that you want to use the XHTML5 serialization.

would hold more water. There are plenty of other reasons why that response is inadequate. But the fact that XHTML5 is not a strict superset trumps all of the other objections.

Treating the required math element, or its associated xmlns attribute as a trigger to a new state, something like consume foreign markup, would be a superior approach.

Yes, for two reason.

1. You don’t want to mistakenly stick the svg:a element in the XHTML namespace (for instance).
2. You want to allow empty-element syntax (not just on void HTML5 elements) with this subtree.

A new parsing state would take care of both concerns.

The White Pebble

2006-12-03T11:58:12-08:00

I am simply pointing out that it is incorrect to state that a valid HTML5 document can be parsed to a DOM and reserialized as XHTML5 without data-loss.

I’ve started a wiki page that can be used to capture these differences.

The White Pebble

2006-12-03T17:27:16-08:00

Jacques: Sam Ruby's website is unlikely to be valid to anything ;) he switched to doctype HTML. He might be conformant at a point if there is a specification describing this type of page. Though, I'm not sure how it will be possible to identify which language, but that's an unrelated question.

On the things, that I was trying to do is to raise awareness easily for the user on his desktop *outside of an HTTP environment*. It's why I ask for more results on other browsers.
I do not get the "Moreover, opening up a .xhtml page in a browser is not a reliable method." reliable method for what?

I have inserted &foo; in the XHTML 1.1 document.

In Firefox 2.0 I got

{{{
Erreur d'analyse XML : entité non définie
Emplacement : file:///***/xhtml-test/xhtml11-dtd-nopi.xhtml
Numéro de ligne 13, Colonne 22 :

XHTML 1.1 Test &foo;

----------------------------^
}}}

In Safari 2.0.4, I got

{{{
This page contains the following errors:

error on line 13 at column 90: Entity 'foo' not defined
error on line 15 at column 110: Opening and ending tag mismatch: br line 0 and li
Below is a rendering of the page up to the first error.

XHTML 1.1 Test

Markup error
}}}

What do you call "a Yellow Screen of Death."? A feature of Firefox windows?

My intent was not about testing the W3C Validator. We are well aware of [http://www.w3.org/Bugs/Public/buglist.cgi?query_format=specific&order=relevance+desc&bug_status=__open__&product=Validator&content= its own limitations] and it is why we are trying to move to Unicorn to have a [http://www.w3.org/QA/2006/09/unicorn_public_preview.html more modular approach] and [http://www.w3.org/QA/Tools/qa-dev/ Contributions are welcome]

The White Pebble

2006-12-03T19:50:52-08:00

It’s why I ask for more results on other browsers.

I’m not exactly sure what your question is. You were the one who, earlier, pointed me to a list of XHTML UA’s.

Are you interested in the detail of how they handle well-formedness errors?

That Firefox displays a Yellow Screen of Death (the annoying yellow-backgrounded page that you have seen), at any well-formedness error, whereas Safari will attempt to render the page up to the first well-formedness error and Opera (like Firefox) will refuse to render anything, but will offer to re-parse the page as HTML.
That Firefox treats an undefined entity as a well-formedness error and produces a YSoD, whereas Safari issues a warning, escapes the entity, and renders the page and Opera renders the page without even bothering to issue a warning.
etc.?

I do not get the “Moreover, opening up a .xhtml page in a browser is not a reliable method.” reliable method for what?

Not a reliable method for determining if the page will be treated as well-formed in another browser. (Recall that Opera will silently escape an undefined entity, whereas Firefox will produce a Yellow Screen of Death. If your page has an XHTML+MathML DOCTYPE, and contains the entity , Firefox will render the page without errors, whereas Safari will issue a parsing error. And so on ...)

We are well aware of ... and it is why we are trying to move to Unicorn.

I’m not sure how moving to Unicorn impacts the issue at hand. Are you planning on replacing OpenSP with another parser (at least, for parsing XML documents)? If not, then moving to Unicorn doesn’t change anything. If yes, might I be so bold as to ask what you are planning on using instead of OpenSP?

The White Pebble

2006-12-04T03:59:31-08:00

[...] the corresponding DOM cannot [always] be reserialized as HTML5 [and] the HTML5 serialization cannot [always] be reserialized as XHTML5, without serious data-loss.

This is true, and it has been known from the beginning. While every attempt has been made to reduce the number of inconsistencies between the two serialisations, backwards compatibility constraints do prevent it from happening in some cases. Such differences include not allowing a p element to contain structured inline level elements in HTML, a table not being able to contain child tr elements, the inability to include processing instructions in HTML, noscript elements in XHTML and many more.

But the point is that these aren’t significant issues. Those differences have existed for a long time between HTML4 and XHTML1, yet that hasn’t stopped people serving their XHTML as text/html. Just keep in mind that, for all practical purposes, serving XHTML as text/html is equivalent to reserialsing as HTML anyway.

I’ve started a wiki page that can be used to capture these differences.

I’ve completed the list with many more differences that prove just how significantly different the HTML and XHTML serialisations are, despite their similarities on the surface. In the scheme of things, you should now see just how insignificant and irrelevant minor changes, such as allowing the trailing slash and a meaningless xmlns attribute really are in HTML.

I also moved the discussions from the main page to the talk page and responded to a few points.

The White Pebble

2006-12-04T06:06:56-08:00

I’ve completed the list with many more differences that prove just how significantly different the HTML and XHTML serialisations are, despite their similarities on the surface. In the scheme of things, you should now see just how insignificant and irrelevant minor changes, such as allowing the trailing slash and a meaningless xmlns attribute really are in HTML.

XHTML and HTML5 are really only as different as en-us and en-au. Yes, one can compile a large list of differences, but the fact remains that meaningful communication is possible.

I also moved the discussions from the main page to the talk page and responded to a few points.

Excellent!

Defining HTML

2006-12-04T16:15:30-08:00

There’s a new thing in the world. Since I got mixed up in the Web a dozen years ago, there’ve always been groups of people trying to standardize HTML (at the IETF, at the W3C, wherever) and as long as I can remember, they’ve been genially ignored,...

The White Pebble

2006-12-04T17:25:14-08:00

About yellow screen of death. I found it useful, but I’m more on the side of geeks for these. Normal people will certainly not understand what is happening if they have this kind of screen. I have always wished that browsers had a two modes behaviour (a preference), one for developers which would be unforgiving, and one for people. Though it is wishful thinking.

It is a bit like the W3C Validator, whatever options we decide to keep, add or remove, there will be a group of people to tell that is wrong. Some people want them to be exclusively a validator, some of them a fixing tool, some a conformance checker, some a tool to help people to develop documents which are of exotic nature (multinamespace documents). The W3C Validator has a very long history, with phase where it has been completely stalled for two years. Olivier Théreaux pushed mountains to foster energy again around it, but it takes time to do things. The way it is working the Validator is using OpenSP messages to output error messages. Changing that is not possible without a major refactoring of the code. It is what had been started by Bjoern, Terje, Nick and Olivier a couple of years ago. But all these people have also other things on their plate, plus private life.

Unicorm is a framework, a kind of online pipe. So basically when we give an URI, it can distribute it on one or more tools and it gathers results from difference source.
[link]

Practically it means you can plug behind an RDF validator, an SVG validator, a platypus document validator based on RNG or NVDL, etc. So yes the plan is to get rid of OpenSP for XML documents.

[link]
[link]

Do no forget that people will still produce documents based on HTML 4.01, HTML 3.2 specifications and XHTML 1.0 served as text/html… no matter what you advocate. Creating a new language doesn’t remove the old documents and does not change the practices of people. Lesson learned from working more than 6 years at W3C. Last week, I have read a master of multimedia class notebook promoting “font” element and other things of this type. And when people will complain about their WebForms or canvas element not working in their old browsers, the only possible answer will be “please, upgrade to a new browser”. Fun time ahead.

A bit of digression…

It was easy to create for the Web 10 years ago for Europe and North America, because we were still in rapid growth and then quick replacement of products.

In Asia and Africa, the story is a bit different, the rapid growth is in the mobile world. Whatch out Minimo.
[link]
Africa they do not have computers and land lines in most places, but they do have mobile access and they use it. That’s another part of the story, that sometimes we, western people, tend to forget. In Asia, people use mobile phone mainly, computer is used only if you are rich enough to get one.

[link]
Look at this image of stats in July on the BBC Web site
[link]

The White Pebble

2006-12-04T20:27:45-08:00

The way it is working the Validator is using OpenSP messages to output error messages. Changing that is not possible without a major refactoring of the code.

You’ll note that in my local version of the W3C Validator, I didn’t attempt to replace OpenSP; I merely added XML::LibXML as an additional check when OpenSP declared an XML document to be “valid.” And I didn’t attempt to intercept the messages from libxml2 and replace them with your more “user-friendly” messages, as is done with the messages from OpenSP.

On the other hand, it only took me an hour to implement.

Do no forget that people will still produce documents based on HTML 4.01, HTML 3.2 specifications and XHTML 1.0 served as text/html… no matter what you advocate.

I, personally, don’t advocate anything. But I do put a high premium on ensuring that XHTML pages that pass validation are, in fact, well-formed.

Try the (valid XHTML, according to OpenSP)

fubar

in my comment form.

The White Pebble

2006-12-05T01:31:25-08:00

That would entail tampering with XML.
No, it would not.

It would mean limiting the syntactic sugar of a lower layer spec from a higher layer. I think that counts as tampering with the lower layer spec. Likewise, it would be inappropriate for HTTP to micromanage TCP ARQ or IP MTU.

It would be an unusual constraint, I grant you, but conceptually no different than the requirement that SOAP doesn’t permit PIs.

It is conceptually very different. PIs are something that an XML processor reports to an application. The choice of syntactic sugar is something that the XML processor abstracts away from the app.

It also would be a constraint that would be hard to implement when viewed only through the eyes of the infoset, but I digress.

I think a higher-level spec requiring checking something that is not seen through the SAX2 ContentHandler interface (ignoring qNames) is a better indicator of a layering violation.

I’m not suggesting that anybody “breaks” XML, I’m just saying that not all well formed XML documents are valid XHTML5 (Duh!).

To keep the layer cake sound, the constraining should happen on top the XML processor—not inside it.

Trust me as somebody who consistently produces well formed XHTML and serves it with the proper mime type whenever possible: there is a lot that people don’t tell you. For example, Opera don’t read no external DTD.

Actually, browsers not reading DTDs is a natural expectation given the XML spec (especially Tim Bray’s annotated version).

I would much prefer that XHTML5 said simply: you must use , and therefore limit yourself to the five predefined named entities. If you don’t like that, than use HTML5.

Actually, it tells authors not to use a doctype at all in XHTML5, but banning the doctype would overstep the authority of a spec that is supposed to take XML 1.0 seriously.

For this to work, there can’t be any “gotchas” with HTML5 like “oh, you are using SVG? Sucks to be you.”

Yes, the “just stick to HTML5 and text/html” party line doesn’t quite work for you. I’d appreciate it if you could share your implementation experience with Venus on the mailing list. My trial balloon on the mailing list was unresearched.

But what does the WHATWG document say about this? It says that there are two authoring formats, and you can convert between them by simply reading one into a DOM and producing the other. Riiight.

It says there is a magic flag.

Is your goal actually attacking XML Draconianness ...
Suddenly I feel my ears ringing.

Mark, what would you like the WHATWG to do here?

Moreover

in the HTML5 serialization cannot be reserialized as XHTML5, without serious data-loss.

You can have line breaks in attributes in XML if you escape them. See the last line in the table in section 3.3.3 of the XML 1.0 spec. No fatal dataloss. The mantra can go on.

links for 2006-12-05

2006-12-05T03:47:43-08:00

The Fuss About Gmail and Privacy: Nine Reasons Why It’s Bogus (meiner ansicht nach auch bogus, obwohl ich gmail benutze) (tags: gmail privacy Google) Gmail Encryption "So, you like using Gmail, but don’t want other people to be able to......

The Beginning of the End

2006-12-05T09:52:15-08:00

HTML-safe generation of embedded MathML....

The White Pebble

2006-12-06T07:39:52-08:00

Mark, what would you like the WHATWG to do here?

I am not involved in WHATWG in any way, and — for the time being — I need to avoid even the appearance of collaboration.

application/xhtml

2006-12-17T21:52:22-08:00

In a matter of a weekend; rusty on COM, unfamiliar with the Mozilla codebase, class libraries, build process, trace facilities, test suites, and debugging aids; and therefore armed only with make, vi, and fprintf; I came up with this mode...

Sam Ruby: The White Pebble

2006-12-28T15:15:29-08:00

[link]...

The danger of validating your XHTML

2007-02-09T11:15:11-08:00

The danger of validating your XHTML The danger of validating XHTML is that the validation is almost certainly not doing what you believe it’s doing. The problem is that all the common online validators ignore the HTTP Content-Type of what your web...