It’s just data

Implied Warranty of Fitness

A number of people advocate avoiding templates when producing XML, lest they produce output that is not well formed.  Yet I use a templates for this weblog.

Venus produces a DOM, and serializes it via XSLT, so it is pretty safe... or so you would think.  Here a few ways I have found in which one can produce a DOM which can’t be serialized as well-formed XML:

from xml.dom import minidom

doc = minidom.getDOMImplementation().createDocument(None,None,None)
root = doc.createElement('9')
root.setAttribute(';',u'\x0C')
root.appendChild(doc.createTextNode(u'\uFFFF'))
root.appendChild(doc.createComment('-'))
doc.appendChild(root)

try:
  minidom.parseString(doc.toxml('utf-8'))
except Exception, e:
  print e

print
print doc.toxml()

Am I missing anything?

Venus currently handles all of these cases, and it is my intention that it will continue to do so — as well as handle any other cases that I may have missed — as I transition from sgmllib based processing to html5lib based processing.


Looks like you’ve got a bug or seven to file against a Python library or another there.

#!/usr/bin/perl
use strict;
use warnings;

use XML::LibXML;

my $doc = XML::LibXML::Document->new();

my $root = $doc->createElement( '9' );
$root->setAttribute( ';', "\x0C" );
$root->appendChild( $doc->createTextNode( "\x{FFFF}" ) );
$root->appendChild( $doc->createComment( '-' ) );
$doc->setDocumentElement( $root );

eval { XML::LibXML->new()->parse_string( $doc->toString ); 1 }
	or die $@;

print "\n", $doc->toString;

=begin output

Unicode character 0xffff is illegal at badxml.pl line 11.
bad name at badxml.pl line 9.

By gradually removing errors I found that the malformed comment slips by libxml2’s eyes, but it catches all other errors (and in the cases of the control character in the attribute, silently drops it rather than throwing, which appears to be acceptable under the “whitespace in attributes is normalised” legislation).

PS.: for some reason, your comment system replaces \n in block code sections with an actual newline character; I had to add a second backslash to make it come out correctly.

Posted by Aristotle Pagaltzis at

So, Henri Sivonen’s argument against text-based templates says: “Making mistakes with them is extremely easy and taking all cases into account is hard. These systems have failed smart people who have actively tried to get things right.”

On the other hand, what you’ve done here is actively tried to get things wrong — and, more importantly, in every case minidom could be fixed to detect the error. There’s no single component in, say, WordPress which could check for all these errors, because the page is generated by concatenation of a mixture of literal and generated strings that are spread out all over the codebase. That’s a huge difference, isn’t it?

Posted by Adam Fitzpatrick at

It was pointed out to me several times when I was on the XML team at Microsoft that well-formedness checking isn’t part of the contract of the DOM. One of the main reasons for this is performance.

I’m not sure there is any DOM implementation out there that guarantees it will always generate well-formed markup.
...
Wait, I might be wrong. I remember ERH was working on XOM to address this and he seems to have according to [link] so there is perhaps one implementation of an XML DOM across the various platforms out there that does well-formedness checking.

Unfortunately generating well-formed XML is harder than most people expect it to be.

Posted by Dare Obasanjo at

One thing I noticed recently is that the Atom output from Venus (as viewed on Planet OpenID) turns all of the yucky escaped HTML that my blog on LiveJournal generates and turns it into xhtml which it includes as XML in the Atom feed. Then LiveJournal parses it and, for some inexplicable reason, runs an entity de-escape on it like it would if it were an escaped HTML entry and turns all of the &lt;s into literal <s which promptly screw up the HTML.

But then LiveJournal really doesn’t like your blog’s Atom feed either, completely failing to get the entry links out of the entries. I’ve been meaning to investigate this for a while and submit a patch, but it hasn’t concerned me enough yet.

But it does make me wonder whether it’s worthwhile concerning one’s self about the quality of one’s feeds when many consumers are so collosally bad at consuming them that they screw it up anyway.

Posted by Martin Atkins at

what you’ve done here is actively tried to get things wrong

Two comments.

What I actually have done here is distilled down some test cases of DOMs that can (and actually ARE) produced by html5lib.  Things for which if I don’t explicitly test for separately would mean that Planet Intertwingly] would not render on Firefox, Safari, or Opera as the page is served as XHTML — something that is (currently) a requirement in order to deal with MathML and SVG.

Second, my point here wasn’t that DOM is no worse than templates — it clearly is a step up — but that (and much to my surprise), the very tools that were meant to save me, leave it up to me to discover exactly what they will protect me from and what is my responsibility to deal with.

This has been a common theme.

But it does make me wonder whether it’s worthwhile concerning one’s self about the quality of one’s feeds when many consumers are so collosally bad at consuming them that they screw it up anyway.

So many tools?  Suggestion: why not see how your tool measures up?

I used to produce a number of feeds, each varying a different aspect so that people that cared to could experimentally find the one that worked best for them.  But over time, this got to be a drag.  Every time I made a change to accommodate one tool, it would break another.

So I now produce exactly one feed, and in an unambiguous manner.  Those that care to will process it correctly.  And as to those that don’t care to... well, I guess it isn’t important to them.  There are plenty enough smiley faces on that page for people to choose from.

Posted by Sam Ruby at

root.appendChild(doc.createProcessingInstruction("xml","?>"))
Posted by Robert Sayre

at

for some reason, your comment system replaces \n in block code sections with an actual newline character; I had to add a second backslash to make it come out correctly.

The reason is that that’s the way that Python’s re.sub works.  My bad.  Fixed:

\n

Thanks!

Posted by Sam Ruby at

What I actually have done here is distilled down some test cases of DOMs that can (and actually ARE) produced by html5lib.

the very tools that were meant to save me, leave it up to me to discover exactly what they will protect me from and what is my responsibility to deal with.

Actually, DOM trees produced by the HTML5 parsing algorithm are, by design, not meant to save you in the well-formedness sense. I think this is unfortunate, but compatibility with legacy browser behavior has been valued higher than the DOM being XMLizable in all cases.

Posted by Henri Sivonen at

DOM trees produced by the HTML5 parsing algorithm are, by design, not meant to save you in the well-formedness sense

That sentence apparently is still true if you remove the words “produced by the HTML5 parsing algorithm”.

Posted by Sam Ruby at

links for 2007-02-03

From the blogroll… Google Notebook Update Amazon Pampers deal: $25 off $99-worth of Pampers products Implied Warranty of Fitness From around the web… Low-Fi Usability Testing magic number is 15, but 6 will get ya well above the 80% point...

Excerpt from The Robinson House at

Add your comment