It’s just data

Is my weblog well formed?

The W3C validator says it not only is well formed, but valid.  I also run a nightly cron job which validates the pages served for that day against the xhtml DTDs provided.  I also serve the content using the XHTML mime type to browsers which support it, which causes Mozilla at least to be ultra-strict about well formedness.

But can I be sure?

Based on these two tests, it is my conclusion that at the present time that the requirement that all XML parsers reject all non-well formed documents that they are presented with as a Platonic ideal... something that perhaps can be aspired to, but something that is rarely if ever seen in the real world.

So, in a nihilist sense, no, I can not be sure.  I'm relying on imperfect tools to reassure me that I am doing it right.

- - -

Brent and Nick have both stated their intent to reject feeds that are not well formed.  They, too, will undoubtedly be relying on inperfect tools to implement this policy.  So, they too, can never be quite sure.  Frankly, I don't see how a requirement that Atom consumers either make up for the inadequacies of whatever XML parser they chose to use by either providing a front end filter or by writing their own parser would substantially improve the situation.

So, why are they doing this?  My intuition tells me that this is based on a sincere desire to move from a world in which producers need to conform to whatever a predominance of other consumers may happen to accept and into a world in which there is a single clear definition as to what is acceptable.

It also appears to be a response to the growing recognition that liberal parsing strategies are a slippery slope, and a not particularly evolutionary stable strategy.  My next (and final) test will explore this a bit further.

- - -

If you accept that perfection can't ever be achieved, the next question to face is whether such policies will substantially improve the quality of inbound feeds or if it will in fact cause mass defection of users to other tools or formats.  Or both.

Given the data presented so far, I don't see conclusive evidence of the oft repeated claim that feed parsers that aren't liberal will be a significant competitive disadvantage.  Sharpreader is (fairly) conservative, has competition, and seems to be doing fine.

Whether ill formed feeds exist because early aggregators were liberal or whether many of today's aggregators are liberal because of the existence of ill formed feeds is an imponderable.  Both are likely to be true.  The key ingredient that appears to be lacking to break this vicious circle is an an effective feedback loop.

I do agree that in an abstract sense that the efforts organized by Syndic8 and the existence of the feedvalidator are the "right" way to address the problem of well formedness, but these efforts to date do not appear to be sufficient.

I have hopes that the courageous and noble stands being made by Brent and Nick will make a difference.  And that the end result will benefit Luke and Dare and others that wish to employ "real" XML parsers.  This is because, contrary to popular belief, there are exceptions to Postel's "law".

Note: nothing in this endorsement should be construed to imply that an aggregator needs to be abrasive or abusive in their application of this policy.  I may be biased, but I do like SharpReader's approach of linking to the feedvalidator first, and providing the email address for feedback to the aggregator author second.

- - -

One thing that needs to be said is that this needs to be a voluntary action on the part of aggregator authors.  Each tool author needs to be free to modify, and potentially reverse entirely, their stated policy based on feedback that they receive.  Without feeling that they are somehow letting down the Atom community.  As tool authors, their first responsibilities are not to the producers or to the spec writers, but to their user base.


RE: Is my weblog well formed?

Sam, A spec is a guideline not a law punishable by death. The benefit of following the spec is interoperability and predictable behavior. However sometimes the reality is that developers fail to comply with the spec either because its requirements are to too onerous and thus impractical or it is flat out incorrect. Unlike Mark I don't believe that quotes taken out of context from some spec and repeated dogmatically are laws set in stone that everyone must believe in. However I do agree with his sentiment that all you can do is draw a line in the sand and choosing to draw the line in the sand at well-formed XML is fairly arbitrary. Well-formed XML is the line I've drawn in the sand for RSS Bandit and I bend over backwards to fix other issues in feeds. Between Torsten and I we autogenerate titles, accept every date format under the sun regardless of whether they are conformant to RFC 822 or ISO 8601 and we import OPML files even when they don't have the required header element. If a feed has a description but no link or title it should still display in RSS Bandit. Some or all of these things are probably grounds for a fatal error if the specs are taken literally but I'll stick with our decisions. My reasons for drawing my line in the sand at well-formed XML is at http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=ad9a8344-1eb2-4dcd-ad3c-88d0ed538470 although if I was honest to myself the line should be drawn at erroring for any MUST from the RSS specs that was not complied to by the feed. Bah, I have a class to attend. Have a nice day.

Message from Dare Obasanjo

at

Should Googlebot refuse to index an ill-formed feed? There's a chance that it contains information found nowhere else.

Posted by Robert Sayre at

Robert - I would say that that is a choice that Google needs to make.

Posted by Sam Ruby at

I continue to believe that codifying into the spec a well defined set of behaviours that a User Agent MAY implement to process a non-valid document is the best way forward. It should also be stressed that, in this regime, a User Agent wishing to parse a non valid document MUST only use the error correction specified. This solves two major problems with the idea that UAs should be able to do anything they like to fix up documents: Non-campatibility between different implementations and forward compatibility. Both of these problems plauge the existing set of HTML 4 documents to the extent that only 1 UA will render close to 100% of sites correctly and futher development of that UA will necessarily break backward compatibility.

Incidentally, if your data really needs to be transmitted correctly, the wellformedness of the document at the client end is not a good guide to whether this has occured. I'm guessing that online banks use html just as malformed as the rest of the web. If you need to ensure correct transmission you need some sort of error detection mechanism.

Posted by jgraham at

re: "Well-formed XML is the line I've drawn in the sand for RSS Bandit"

No, it isn't.  The line you're drawn in the sand is "whatever my built-in tools happen to do by default", which is neither a particularly courageous nor a particularly interoperability-enhancing choice, especially given your explicit admission that your built-in tools operate in a non-conformant mode by default.

Brent and Nick are also guilty of this "lazy draconianism", which can only be truthfully stated as "I care about well-formedness right up until the point where my own tools are buggy."

Posted by Mark at

Sam Ruby on syndication and XML well-formed-ness

Sam Ruby: Is my weblog well-formed? (I’d quote a standout line or two from this, but there’s no way to decide—I’d end up quoting the whole thing. So just go read it.)...

Excerpt from inessential.com at

Based on these two tests, it is my conclusion that at the present time that the requirement that all XML parsers reject all non-well formed documents that they are presented with as a Platonic ideal...

I think that's incorrect. We did hear about RSS clients that accepted the samples, but it seems those clients either didn't use a conforming XML parser, or that they messed with the stream before passing it to the parser.

The only actual XML parser bug I've seen is in Mozilla's XML parser not rejecting bad UTF-8 byte sequences. This is a known bug (see BugZilla) and is likely to be fixed at some point of time. If this proves anything, it's that the authors of XML parsers are absolutely willing to fix conformance issues once they are detected.

So please don't blame the parsers, blame the clients that use "custom" parsers or that explicitly workaround problems in the content before passing it to the parser...

Julian

Posted by Julian Reschke at

Julian, Brent has explained what he does to circumvent the default processing for RSS feeds.

[link]

This process as described would not account for his XML parser's failure to reject control characters.  His XML parser is simply buggy.

Posted by Mark at

I've read Mark's comments on the well-formedness issue, and I've read Brent's, and I've read Tim's.

They're all minds I respect, and they all make a lot of sense.

I'm a translator by trade. One of the biggest challenges that a translator confronts is mind-reading. Because people don't always write with great clarity--and sometimes say something that is the exact opposite of what they really mean ("I've got spurs that jingle jangle jingle as I go riding merrily along / And they sing 'oh ain't you glad you're single?' / And that song ain't so very far from wrong"). I've got to guess at what they meant. There are often contextual or common-sense cues that help, but sometimes I wind up translating something that I know can't be right, but I just don't know what is right.

Parser authors must confront similar conundrums. I haven't tried very hard, but I am guessing it should be possible to create a test document badly formed in a way that there is nothing to help the parser guess what the generator really meant (perhaps the original document was so poorly constructed that it really didn't mean anything sensible). What to do in that case? Is that where you draw the line in the sand?

Back up from that line a bit. Imagine another error that is commonly committed. You've seen it before, you know why it happens, and you know how to fix it. Another feed generator comes along that generates invalid feeds that contain the same apparent error, but the underlying cause is completely different. Applying your existing fix could just make matters worse, and there's no way of identifying with certainty the feed generator so as to apply different fixes based on the known behavior of that generator.

In short, I'm wondering this: does a liberal parser need to be an indiscriminate parser? How liberal is "liberal"?

Posted by Adam Rice at

re: "the authors of XML parsers are absolutely willing to fix conformance issues once they are detected"

The fact that the bug in Mozilla was reported a year and a half ago and has yet to be fixed does nothing to bolster this claim.

Posted by Mark at

Mark,
  The fact that I use what the .NET Framework's XML parser does by default is an oversight, one compounded by the fact that the flag to tweak the conformance level is badly named because it's semantics are overloaded to do many things besides that one conformance settings. I plan to fix the bug this weekend when I start working on my SIAM implementation, it shouldn't be more than setting a flag ixed by changing a flag in a couple of places in the code.

Posted by Dare Obasanjo at

It seems to me that the only thing Brent and Nick are "Guilty" of is trying to do their best to meet some obligations recommended to them by the authors of XML 1.0.

The conclusion "In the real world, producers of ATOM feeds cannot produce valid or well formed markup all of the time" may of course be true, however XML 1.0 is clear about what a parser must do about that.

The argument that neither Brent nor Nick (or apparently anyone else) can succeed in writing a parser which is able to detect every instance of non-well formedness, therefore, somehow relieves them of any requirement to report any such instances that they do find seems to me to fall in the general category - "Grasping at straws".

To say that Brent and Nick are taking a "courageous and noble" stand just by attempting to comply the best they can with a W3c recommendation seems an odd thing to say to me, although I do agree that they have exhibited these qualities in the way the have handled the political pressure and personal innuendoes applied to them over this.

If you are uncomfortable with the requirements of XML 1.0 - if you believe it to be so broken and impractical, why did you choose it as the format for ATOM?

Posted by Chris Bentley at

Chris - where is this conclusion from which you quote?

Posted by Sam Ruby at

Is this not a conclusion you are asking me to draw?

Posted by Chris Bentley at

Even the Mozilla bug is not a case where an off-the-shelf XML tool is broken. Like many other bugs that have popped up it, too, is a case where messing with the unparsed stream bypasses a well-formedness check.

The XML spec requires XML processors to support UTF-8 and UTF-16. However, it leaves a huge back door open when permitting the use of other encodings as well. Then the XML spec makes conformance with the character encoding spec a well-formedness requirement. This works great when the XML processor accepts only the encodings it must accept and implements the character decoder internally. When application writers (for whatever reason) believe they need to support encoding other than those required by the XML spec, they tend to use external off-the-shelf decoders which tend to be lenient.

What happens with Mozilla is that it converts everything to UTF-16 using its lenient (designed for HTML tags soup) converters. Therefore, expat doesn't see the original UTF-8 bytes.

(In a way, supporting extra encoding is pointless, because content generators should want to use only UTF-8 or UTF-16, because those are the only two encodings that are guaranteed to work.)

- - -

Now, if it is decided that Atom documents don't need to be well-formed, why go on pretending that the serialization of Atom is XML when it, by definition, is not if well-formedness isn't required? Wouldn't it be more honest to specify a different-looking infoset serialization that isn't confusingly similar to XML?

Posted by Henri Sivonen at

Henri,

thanks for the explanation of Mozilla's bug. I suspected something like that, as I knew that Mozilla uses Expat internally.

And yes, if Atom is going to allow acceptance of non-wf XML -- violating (a) XML, (b) the W3C "Architecture of the World Wide Web" and (c) "RFC3470:  Guidelines for the Use of Extensible Markup Language (XML) within IETF Protocols", section 4.1, then it really shouldn't pretend to use XML and choose another base instead.

Julian

Posted by Julian Reschke at

Can I make a stupid suggestion to the strict constructionists?  Have the well formed bigot tool correct the RSS.  Have it crawl the blogspace looking for non-well formed RSS,etc.  Then if it can be parsed at all add it to a directory of "fixed and reformed RSS".  Post to their referrer log to supply adequate shame.  If it cannot be parsed well its probably bad enough that they will fix it.

Posted by Andy at

Links

FavIcon Generator, via Erik IZArc, via Erik. Supports 7-Zip! This stuff defies parody, via Ole Han Solo frozen in carbonite! IN LEGO, via Ole. Make sure you check out the rest of his site and his other Lego creations. More on Clark's anti-war...

Excerpt from Keith's Weblog at

I wish that RSS feeds would wrap their content in CDATA sections. To me that is the correct way to store html content inside any xml format. I display a few blogs on my website using an xsl stylesheet to transform the feed into html content, but the content never displays correctly, it displays the markup itself.

Posted by Joe Audette at

Joe,

having escaped HTML content and having unescaped HTML content inside a CDATA section means the same thing. So this wouldn't change anything. Try it.

Julian

Posted by Julian Reschke at

If people won't go to the validator

I think there are a lot of people who are willing to do a little bit to improve feed quality, if it's not too hard and they can do it from where they are already.... [more]

Trackback from dive into mark

at

Feedback loops

Is this feed valid?  Both SharpReader and Bloglines handle it flawlessly.  In fact, there are active blogline subscribers. The feedvalidator  chokes on it. Is this feed valid?  Both SharpReader and Bloglines handle it flawlessly.  In fact, there are a... [more]

Trackback from Sam Ruby

at

Vigilance

For the record, despite all of my  efforts, when Evan went to check, my setup had **GASP** stopped serving my main page with the  appropriate mime type to standards compliant browsers.  The problem appears to be an unfortunate interaction between  D... [more]

Trackback from Sam Ruby

at

Add your comment