The W3C validator says it not only is well formed, but valid. I also run a nightly cron job which validates the pages served for that day against the xhtml DTDs provided. I also serve the content using the XHTML mime type to browsers which support it, which causes Mozilla at least to be ultra-strict about well formedness.
But can I be sure?
Based on these two tests, it is my conclusion that at the present time that the requirement that all XML parsers reject all non-well formed documents that they are presented with as a Platonic ideal... something that perhaps can be aspired to, but something that is rarely if ever seen in the real world.
So, in a nihilist sense, no, I can not be sure. I'm relying on imperfect tools to reassure me that I am doing it right.
- - -
Brent and Nick have both stated their intent to reject feeds that are not well formed. They, too, will undoubtedly be relying on inperfect tools to implement this policy. So, they too, can never be quite sure. Frankly, I don't see how a requirement that Atom consumers either make up for the inadequacies of whatever XML parser they chose to use by either providing a front end filter or by writing their own parser would substantially improve the situation.
So, why are they doing this? My intuition tells me that this is based on a sincere desire to move from a world in which producers need to conform to whatever a predominance of other consumers may happen to accept and into a world in which there is a single clear definition as to what is acceptable.
It also appears to be a response to the growing recognition that liberal parsing strategies are a slippery slope, and a not particularly evolutionary stable strategy. My next (and final) test will explore this a bit further.
- - -
If you accept that perfection can't ever be achieved, the next question to face is whether such policies will substantially improve the quality of inbound feeds or if it will in fact cause mass defection of users to other tools or formats. Or both.
Given the data presented so far, I don't see conclusive evidence of the oft repeated claim that feed parsers that aren't liberal will be a significant competitive disadvantage. Sharpreader is (fairly) conservative, has competition, and seems to be doing fine.
Whether ill formed feeds exist because early aggregators were liberal or whether many of today's aggregators are liberal because of the existence of ill formed feeds is an imponderable. Both are likely to be true. The key ingredient that appears to be lacking to break this vicious circle is an an effective feedback loop.
I do agree that in an abstract sense that the efforts organized by Syndic8 and the existence of the feedvalidator are the "right" way to address the problem of well formedness, but these efforts to date do not appear to be sufficient.
I have hopes that the courageous and noble stands being made by Brent and Nick will make a difference. And that the end result will benefit Luke and Dare and others that wish to employ "real" XML parsers. This is because, contrary to popular belief, there are exceptions to Postel's "law".
Note: nothing in this endorsement should be construed to imply that an aggregator needs to be abrasive or abusive in their application of this policy. I may be biased, but I do like SharpReader's approach of linking to the feedvalidator first, and providing the email address for feedback to the aggregator author second.
- - -
One thing that needs to be said is that this needs to be a voluntary action on the part of aggregator authors. Each tool author needs to be free to modify, and potentially reverse entirely, their stated policy based on feedback that they receive. Without feeling that they are somehow letting down the Atom community. As tool authors, their first responsibilities are not to the producers or to the spec writers, but to their user base.
I continue to believe that codifying into the spec a well defined set of behaviours that a User Agent MAY implement to process a non-valid document is the best way forward. It should also be stressed that, in this regime, a User Agent wishing to parse a non valid document MUST only use the error correction specified. This solves two major problems with the idea that UAs should be able to do anything they like to fix up documents: Non-campatibility between different implementations and forward compatibility. Both of these problems plauge the existing set of HTML 4 documents to the extent that only 1 UA will render close to 100% of sites correctly and futher development of that UA will necessarily break backward compatibility.
Incidentally, if your data really needs to be transmitted correctly, the wellformedness of the document at the client end is not a good guide to whether this has occured. I'm guessing that online banks use html just as malformed as the rest of the web. If you need to ensure correct transmission you need some sort of error detection mechanism.
re: "Well-formed XML is the line I've drawn in the sand for RSS Bandit"
No, it isn't. The line you're drawn in the sand is "whatever my built-in tools happen to do by default", which is neither a particularly courageous nor a particularly interoperability-enhancing choice, especially given your explicit admission that your built-in tools operate in a non-conformant mode by default.
Brent and Nick are also guilty of this "lazy draconianism", which can only be truthfully stated as "I care about well-formedness right up until the point where my own tools are buggy."
Based on these two tests, it is my conclusion that at the present time that the requirement that all XML parsers reject all non-well formed documents that they are presented with as a Platonic ideal...
I think that's incorrect. We did hear about RSS clients that accepted the samples, but it seems those clients either didn't use a conforming XML parser, or that they messed with the stream before passing it to the parser.
The only actual XML parser bug I've seen is in Mozilla's XML parser not rejecting bad UTF-8 byte sequences. This is a known bug (see BugZilla) and is likely to be fixed at some point of time. If this proves anything, it's that the authors of XML parsers are absolutely willing to fix conformance issues once they are detected.
So please don't blame the parsers, blame the clients that use "custom" parsers or that explicitly workaround problems in the content before passing it to the parser...
Julian
Julian, Brent has explained what he does to circumvent the default processing for RSS feeds.
This process as described would not account for his XML parser's failure to reject control characters. His XML parser is simply buggy.
I've read Mark's comments on the well-formedness issue, and I've read Brent's, and I've read Tim's.
They're all minds I respect, and they all make a lot of sense.
I'm a translator by trade. One of the biggest challenges that a translator confronts is mind-reading. Because people don't always write with great clarity--and sometimes say something that is the exact opposite of what they really mean ("I've got spurs that jingle jangle jingle as I go riding merrily along / And they sing 'oh ain't you glad you're single?' / And that song ain't so very far from wrong"). I've got to guess at what they meant. There are often contextual or common-sense cues that help, but sometimes I wind up translating something that I know can't be right, but I just don't know what is right.
Parser authors must confront similar conundrums. I haven't tried very hard, but I am guessing it should be possible to create a test document badly formed in a way that there is nothing to help the parser guess what the generator really meant (perhaps the original document was so poorly constructed that it really didn't mean anything sensible). What to do in that case? Is that where you draw the line in the sand?
Back up from that line a bit. Imagine another error that is commonly committed. You've seen it before, you know why it happens, and you know how to fix it. Another feed generator comes along that generates invalid feeds that contain the same apparent error, but the underlying cause is completely different. Applying your existing fix could just make matters worse, and there's no way of identifying with certainty the feed generator so as to apply different fixes based on the known behavior of that generator.
In short, I'm wondering this: does a liberal parser need to be an indiscriminate parser? How liberal is "liberal"?
re: "the authors of XML parsers are absolutely willing to fix conformance issues once they are detected"
The fact that the bug in Mozilla was reported a year and a half ago and has yet to be fixed does nothing to bolster this claim.
It seems to me that the only thing Brent and Nick are "Guilty" of is trying to do their best to meet some obligations recommended to them by the authors of XML 1.0.
The conclusion "In the real world, producers of ATOM feeds cannot produce valid or well formed markup all of the time" may of course be true, however XML 1.0 is clear about what a parser must do about that.
The argument that neither Brent nor Nick (or apparently anyone else) can succeed in writing a parser which is able to detect every instance of non-well formedness, therefore, somehow relieves them of any requirement to report any such instances that they do find seems to me to fall in the general category - "Grasping at straws".
To say that Brent and Nick are taking a "courageous and noble" stand just by attempting to comply the best they can with a W3c recommendation seems an odd thing to say to me, although I do agree that they have exhibited these qualities in the way the have handled the political pressure and personal innuendoes applied to them over this.
If you are uncomfortable with the requirements of XML 1.0 - if you believe it to be so broken and impractical, why did you choose it as the format for ATOM?
Even the Mozilla bug is not a case where an off-the-shelf XML tool is broken. Like many other bugs that have popped up it, too, is a case where messing with the unparsed stream bypasses a well-formedness check.
The XML spec requires XML processors to support UTF-8 and UTF-16. However, it leaves a huge back door open when permitting the use of other encodings as well. Then the XML spec makes conformance with the character encoding spec a well-formedness requirement. This works great when the XML processor accepts only the encodings it must accept and implements the character decoder internally. When application writers (for whatever reason) believe they need to support encoding other than those required by the XML spec, they tend to use external off-the-shelf decoders which tend to be lenient.
What happens with Mozilla is that it converts everything to UTF-16 using its lenient (designed for HTML tags soup) converters. Therefore, expat doesn't see the original UTF-8 bytes.
(In a way, supporting extra encoding is pointless, because content generators should want to use only UTF-8 or UTF-16, because those are the only two encodings that are guaranteed to work.)
- - -
Now, if it is decided that Atom documents don't need to be well-formed, why go on pretending that the serialization of Atom is XML when it, by definition, is not if well-formedness isn't required? Wouldn't it be more honest to specify a different-looking infoset serialization that isn't confusingly similar to XML?
Henri,
thanks for the explanation of Mozilla's bug. I suspected something like that, as I knew that Mozilla uses Expat internally.
And yes, if Atom is going to allow acceptance of non-wf XML -- violating (a) XML, (b) the W3C "Architecture of the World Wide Web" and (c) "RFC3470: Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols", section 4.1, then it really shouldn't pretend to use XML and choose another base instead.
Julian
Joe,
having escaped HTML content and having unescaped HTML content inside a CDATA section means the same thing. So this wouldn't change anything. Try it.
Julian