Most people discussing this now are likely unaware that Tim Bray was the singular force behind getting this "fail on first error" language into the XML specification. Virtually everyone in the XML working group disagreed with him, and many people pleaded for a sane method of error recovery, which, by the way, XML is uniquely suited to provide (since it has so much redundant information, it provides easier reentry points to recover after a parsing error, unlike most binary formats).
"""Browsers do not just need a well-formed XML document. They need a well-formed XML document with a stylesheet in a known location that is syntactically correct and semantically correct (actually applies reasonable styles to the elements so that the document can be read). They need valid hyperlinks to valid targets and pretty soon they may need some kind of valid SGML catalog. There is still so much room for a document author to screw up that well-formedness is a very minor step down the path. The idea that well-formedness-or-die will create a "culture of quality" on the Web is totally bogus. People will become extremely anal about their well-formedness and transfer their laziness to some other part of the system."""
And lo and behold, Disney's well-formed-but-useless RSS proves him right. Sadly, it is 7 years too late to rescue the XML specification from Tim's dogmatic draconianism. But that doesn't mean it was a good idea in the first place.
it seems that you are blaming XML's draconian error handling -- causing most of the content being wellformed in the first place -- for content showing up that's well-formed but not DTD-valid. Did I get this right?
RE: Tim Bray. XML 1.0 has three editors listed. From the mail thread you referred to, it seems that at least two of them clearly were in favor of draconian error handling.
BTW: Sam's experiments show one thing clearly -- although clients are liberal in what they expect, there is no interoperability at all for invalid content. Different clients accept different kinds of errors, just like with HTML. The same mess.
Sadly, it is 7 years too late to rescue the XML <strong>1.x</strong> specification from Tim's dogmatic draconianism. But that doesn't mean it was a good idea in the first place.
Write a new, backward compatible, specification that implements error handling. Develop interesting technologies on top of it and demonstrate that it succeeds where XML has been a notable failure (human -> human communications). If, at this stage, it's still important to you, try to get the W3C to declare your spec to be XML 2.0.
I invite you to take this to xml-dev and do a statistic about how many people appreciate the way XML has been defined. It's certainly not the case that everybody dislikes it. In fact, I'd be surprised if there wouldn't be a majority in favor of it.
Historically, the working group decided to use draconian error handling because the authors of the spec believed it would work, and because both Netscape and Microsoft said they wanted that. Some members of the community said it wouldn't work, but the W3C is a membership driven organization, and it's decision processes work differently from - for instance - the IETF's. Do I like how the W3C works? Not always. For instance, I think XMLNS 1.1 is a desaster (my complaints are archived on the appropriate mailing list).
But guess what? Draconian error reporting does work. Almost all content that's targeted at XML parsers indeed is wellformed, and if isn't, it usually gets fixed quickly. On the other hand, formats that don't have strict error handling, such as HTML and sadly RSS, suffer from interoperability problems and/or extreme complexity to implement (for instance, Microsoft is said to be stopping development of IE's rendering engine because it got so complicated that it has become extremely hard to maintain). In fact, exactly this blog shows that although RSS readers try to be liberal, there's no interoperability.
Of course someone will say that this is because the other implementations aren't as smart as yours. However I'd call that a lame argument, because this replaces a formal standard by "whatever some specific ultra-liberal client accepts today" (that would be some kind of a single-implementation-defined spec, such as Perl -- in some cases this may make sense, but I don't think it's ok for a data interchange format).
The XML spec is about an application component, the XML Processor. It makes no demands of the application as a whole. One legal reaction to an XML parsing error is for the application to alter the input and resend it. Saying the XML spec has been a success is irrelevant to an implementor of a "browser" type application. For an implementor of an Atom Processor, the spec is more relevant. Tim Bray has suggested introducing the concept of an Atom Processor into the spec, but I think he's been guilty of conflating the processor and the application in other discussions.
Secondly, no matter what tricks IE's parser performs, one can assume it passes coherent parsed input on to the rendering engine. So, has development stopped on IE because the parser is too complicated, or the rendering engine is too complicated? Two different problems, and IE handles a superset of HTML at any rate, much of which no one uses. When was the last time you saw client-side VBScript?
Should Atom strive to avoid being defined by the operation of a specific client program? Of course, but an over-reaching spec is a great way to achieve that, in my opinion.
"One legal reaction to an XML parsing error is for the application to alter the input and resend it."
That's an interesting point. It sort-of implies that there is a class of byte streams that a compliant XML parser must reject, while an application may be able to fix.
However, for each of the fixes that will be done I'd be extremly careful that they'll never ever change the intended semantics of the message. For instance, if a client sends
Is the attribute value missing a closing quote character, or did the sender accidentally end the value with an apostrophe instead of a quote? Is this recoverable?
XML-wellformedness has the benefit that you don't need to think about these kind of issues. Stop processing, and tell the sender to fix the document. This will work best when he'll get the same feedback from all clients, not just you.
From what I hear, one of the most popular mistakes is that authors do not put in an encoding declaration at all (although they use an encoding that required it), or send an incorrect encoding declaration. Frequently the reasons for these issues are because people don't understand the APIs they are using (for instance the difference between a string and a byte stream), or simply not understanding code pages, vs standard encodings vs Unicode. In all of these cases, it makes a lot of sense to educate those people. Once they've learned it, they'll unlikely to make the same mistakes again.
Is the attribute value missing a closing quote character, or did the sender accidentally end the value with an apostrophe instead of a quote?
It doesn't really matter. Decide which is the more common case and specify that as the correct behavior for recovery.
Is this recoverable?
Yes. Postel's Law (in the sense of the law predicting what clients will do in order to satisfy their users) demands that clients recover from errors like this. However, the author has no right to expect an invalid document to be interpreted correctly. If an author wishes to ensure that their document is unambiguosly interpreted, they should also ensure that it's valid.
I'd say a large portion of character encoding issues are due to sending Windows-1252 characters in the gremlin range as Unicode, which happens when you copy/paste from Windows apps. I would not expect the parser to perform this conversion for me, but an application may choose to.
Recovery from bogus input always carries with it the chance of changing the original meaning. The acceptability of this risk varies based on application. As an example, I would say the spec should rigorously define the expected behavior of XML::Atom, not Movable Type's.
Draconian measures are necessary for many of the XML applications. It is not in some applications. The problem with RSS is that it is an application platform capable of supporting wide range of applications. For delivering news, draconian measures are not necessary. But for delivering critical information, you want strict conformance.
UI-oriented solution I mentioned before at my blog is not enough to address the later and should be supplemented with a Conformance Level Program (grade A, AA, AAA, etc.) so that certain RSS applications can require using clients that require grade AAA feeds.
Roger, you are correct. I didn't look closely enough the first time around. It appears that all programs accept those feeds, but none parses those dates correctly. Considering the Panglossian games that Userland played with the RSS 2.0 namespace, I'm amazed anyone can read these feeds at all. Is everyone just ignoring the namespace and keying off localname?
When RSS 2.0 first came out there were two annoying things I found out (a) a couple of people came up with their own homebrew namespace for the RSS elements, I seem to remember Timothy Appnel's feed being the first one where I stumbled on this issue and (b) keying off the RSS version number is not reliable.
It seems RSS Bandit isn't the only aggregator that has code to handle both cases which is why it seems most of them can display the feeds (minus the dates).
I can just hear sound of the long-time XML users' mice clicking to get this horrible topic off the screen ASAP ... but maybe people who are newer to XML will get something out of revisiting one of the oldest controversies: what should be...