I see that Henri Sivonen is once again being snarky without backing his position. I’ll state my position, namely that something like the polyglot specification needs to exist, and why I believe that to be the case.
It makes sense for authors who may produce a handful of pages to be processed by an uncountable number of imperfect tools to agree on restrictions that may go well behond the minimal logical consequences from normative text elsewhere if those restrictions increase the odds of the document produced being correctly processed.
Such restrictions are not a bad thing. In fact, such restrictions are very much a good thing.
I see that Henri Sivonen is once again being snarky without backing his position. I’ll state my position, namely that something like the polyglot specification needs to exist, and why I believe that to be the case.
The short version is that I have developed a library that I believe to be polyglot compatible, and by that I mean that if there are differences between what this library does and what polyglot specifies that one or both should be corrected to bring them into compliance.
I didn’t write this library simply because I am loonie, but very much to solve a real problem.
The problem is that HTML source files exist that contain artifacts like consecutive <td> elements; people process such documents using tools such as anolis; and such libraries often depend on — for good reasons — libraries such as libxml2 which do an imperfect job of parsing HTML correctly. The output produced by such tools when combined with such libraries are incorrect.
Note that I stop well short of recommending that others serve their content as application/xhtml+xml. Or that tools should halt and catch fire if they are presented with incorrect input. In fact, I would even be willing to say that in general people SHOULD NOT do either of these things.
Now that I have provided instance proofs of the problem and the solution, I’ll proceed with the longer answer. I will start by noting that Postel’s law has two halves, and while the HTML WG has focused heavily on the second half of that law, the story should not stop there.
To get HTML right involves a number of details that people often get wrong. Details such as encoding and escaping. Details that have consequences such as XSS vulnerabilities when the scenario involves integrating content from untrusted sources. Scenarios which include comments on blogs or feed aggregators. Scenarios that lead people to write sanitizers and employ the use of imperfect HTML parsers.
It is well and good that Henri maintains — on a best effort basis only — a superior parser for exactly one programming language. Advertising this library more won’t solve the problem for people who code in languages such as C#, Perl, PHP, Python, or Ruby. Fundamentally, a tools will save us response is not an adequate response when the problem is imperfect tools.
This problem that needs to be addressed is very much the flip side, and complement to, the parsing problem that HTML5 has competently solved. Given a handful of browser vendors and an uncountable number of imperfect documents, it very much make sense for the browser vendors to get together and agree on how to handle error recovery. By the very same token, it makes sense for authors who may produce a handful of pages to be processed by an uncountable number of imperfect tools to agree on restrictions that may go well beyond the minimal logical consequences from normative text elsewhere if those restrictions increase the odds of the document produced being correctly processed.
Yes, it would be great if this weren’t necessary and all tools were perfect. Similarly, it would be great if browser vendors didn’t have to agree on error recovery as this makes the creation of streaming parsers more difficult. The point is that while both would be great, neither will happen, at least not any time soon.
These restrictions may indeed go beyond “always explicitly close all elements” and “always quote all attribute values”. It may include such statements as “always use UTF-8”.
Such restrictions are not a bad thing. In fact, such restrictions are very much a good thing.
Henri: have you written a bug report on this?
Better structured is indeed debatable. I would, however, defend being careful of matters such as escaping does result in a higher quality output.
Ah, my bad. As to the article, I can see how an author of a HTML validator would care about — and indeed advocate for — people writing documents that can be processed by more tools. Even tools that are, as I stated, not conforming to the latest standards.
I might have said “more explicitly express their structure” than “better structured”. But I do agree about “higher quality”.
And therefore, I disagree with “Being polyglot has nothing to do with better structure or quality”. It is helpful and pragmatic advice to people who face the flip side of the very problem that you have focused on for the past several years.
“It is helpful and pragmatic advice to people who face the flip side of the very problem that you have focused on for the past several years.”
If the flip side is writing input that works with non-compliant HTML parsers (as opposed to writing input that works with XML parsers), focusing on Polyglot is missing the point completely. If the problem is “I want to write HTML that the non-compliant HTML parser in libxml2 can parse.”, it would make more sense to document a profile that works in a particular set of widely-used non-compliant HTML parsers than to document the what works in XML parsers and hope that the same thing helps with non-compliant HTML parsers, too.
Henri, I encourage you to read what standards libxml2 purports to support. Specifically:
HTML4 parser: [link]
As a person who often codes in Ruby, I make heavy use of Nokogiri, which is based on libxml2.
Henri, I encourage you to read what standards libxml2 purports to support.
That the parser in question is non-compliant to HTML5 because it does not even try to be is immaterial to my point that Polyglot is the wrong solution for working with HTML5-non-compliant HTML parsers.
Forgive me, Henri, but I see that statement as being every bit as false as the following strawman:
That the page is question is non-compliant HTML5 because it does not even try to be is immaterial to the point that the HTML5 specification is the wrong solution for working with HTML5-non-compliant HTML pages.
Non-compliant parsers, as well as non-compliant pages, are a reality. They outnumber you. They are both beyond your personal power to correct. That is reality.
Non-compliant parsers, as well as non-compliant pages, are a reality. They outnumber you. They are both beyond your personal power to correct. That is reality.
Correct, but presenting Polyglot as a solution is non sequitur.
I disagree. In fact, I have indisputable evidence to the contrary. My pages (such as this one) are polygot. They undeniably work better with non-conformant HTML parsers than the source to the HTML5 specification itself does. And they do so because they don’t make the assumption that every parser is aware of every special case parsing rule that exists in the HTML5 specification.
Henri: you certainly can make the case the Polyglot specification can be improved (bug reports welcome!). Or you can make the case that it isn’t the only solution to this problem (proposals welcome!). In fact, if you can point to another solution, you can even make the case that Polyglot isn’t the best solution available.
The one case you can’t make is that the restrictions that are present in the HTML5 specification alone as it currently exists are sufficient.
There is very likely overlap between the set of restrictions needed to make libxml2’s HTML parser behave and the set of restrictions that make a document Polyglot. But promoting the second set of restrictions instead of the first one is likely to lead to the same kind of detachment from truth as XHTML advocacy of the previous decade.
If you have a solid use case for one set of restrictions, I’d much rather see you promote that set of restrictions instead of promoting another overlapping set of restrictions that has the sort of labeling that will fascinate the uninformed in the same ways that Appendix C did.
I’m not writing down the set of restrictions that I’d prefer you to promote instead, because I don’t have use cases for that set of restrictions.
Detachment from truth? GMAFB
Henri: as previously stated unclosed consecutive <td> elements cause problems not only with anolis when configured to use libxml2, but also with the Ruby nokogori gem.
This is truth, and you can deny it, but doing so will have about the same effect as denying global warming.
Over time, I have developed a successful set of coping mechanisms to deal with this. For example, I not only always use utf-8, but I also always declare such BOTH in a meta tag AND in the content type.
Should the core HTML5 specification require such? Absolutely not. Should an optional profile be defined which extends the specification to provide entirely voluntary additional constraints that have been proven to make your content more likely to be understood by a variety of consumers? Absolutely.
“The short version is that I have developed a library that I believe to be polyglot compatible, and by that I mean that if there are differences between what this library does and what polyglot specifies that one or both should be corrected to bring them into compliance.”
Sam, are there not there a bunch of tests or some external-can-be-pointed at constraints that lets people determine what compatible is? Otherwise it seems subject to whomever can furnish winning rhetoric.
“It is well and good that Henri maintains — on a best effort basis only — a superior parser for exactly one programming language. Advertising this library more won’t solve the problem for people who code in languages such as C#, Perl, PHP, Python, or Ruby. Fundamentally, a tools will save us response is not an adequate response when the problem is imperfect tools.”
See also: https://github.com/rubys/feedvalidator - I want to believe you know where this ends up :)
Excellent article; your loving and clarity oriented language is a joy to read.
ITYM s/compliment/complement/
are there not there a bunch of tests
Never enough. First installments: wunderbar, builder. Admittedly, both are the “wrong” way, namely they test serialization instead of deserialization.
See also: https://github.com/rubys/feedvalidator
The best base to build a polyglot validator upon would be Henri’s excellent validator.nu.
ITYM s/compliment/complement/
Fixed. Thanks!
Not only do I agree with Sam Ruby in that polyglot documents are easier to process for non-compliant parsers (and thus are a good idea on that basis alone), but I even think that the “detachment from truth as XHTML advocacy of the previous decade” has made HTML better. Without it, we surely wouldn’t have lower-case being the preferred way to write HTML, we wouldn’t have attribute quotes being preferred (however optional they may be) and we wouldn’t be closing tags unless not doing so made stuff look weird in our preferred browser.
XHTML has made HTML better. It’s not a scientifically provable fact, but anyone involved in the field of web development since its inception got to admit that it’s the truth. It’s at least hard to disprove the value XHTML advocacy has had wrt enforcing Be conservative in what you do on HTML. Without it, HTML would basically be on the same qualitative level it was 15 years ago. While I agree it was a detour and that HTML5 is better in every way, XHTML has worked as a Sergeant Hartman on all web developers and made the web, and HTML, better.
Sam, regarding declaring UTF-8 both via HTTP and meta element, if that is an idea you have for Polyglot Markup then perhaps it should be captured and justified in a bug report? I believe it ihas not yet been captured.
Polyglot Markup offers 3 encoding declaration methods for HTML (BOM, meta@charset, HTTP) and 3 methods for XML (BOM, default, HTTP).
There appears to be some quite legacy parsers that don’t understand HTML5’s new meta@charset element - and such parsers may not understand or have trouble with the BOM as well. (I think about text browsers - Lynx and the like.) Is that your motivation? Or perhaps motivation is to make sure that the glitch is catched via HTTP in case the author forgets to declare it in the file?