The W3C validator says it
not only is well formed, but
valid. I
also run a
nightly cron
job which validates the pages served for that day against the
xhtml DTDs provided. I also serve the content using the
XHTML
mime type to browsers which support it, which causes Mozilla at
least to be ultra-strict about well formedness.
But can I be sure?
Based on these
twotests, it
is my conclusion that at the present time that the requirement that
all XML parsers reject all non-well formed documents that they are
presented with as a Platonic ideal... something that perhaps can be
aspired to, but something that is rarely if ever seen in the real
world.
So, in a
nihilist sense, no, I can not be sure. I'm relying on
imperfect tools to reassure me that I am doing it right.
- - -
Brent
and
Nick have both stated their intent to reject feeds that are not
well formed. They, too, will undoubtedly be relying on
inperfect tools to implement this policy. So, they too, can
never be quite sure. Frankly, I don't see how a requirement
that Atom consumers either make up for the inadequacies of whatever
XML parser they chose to use by either providing a front end filter
or by writing their own parser would substantially improve the
situation.
So, why are they doing this? My intuition tells me that
this is based on a sincere desire to move from a world in which
producers need to conform to whatever a predominance of other
consumers may happen to accept and into a world in which there is a
single clear definition as to what is acceptable.
It also appears to be a response to the growing recognition that
liberal parsing strategies are a
slippery slope, and a not particularly
evolutionary stable strategy. My next (and final) test
will explore this a bit further.
- - -
If you accept that perfection can't ever be achieved, the next
question to face is whether such policies will substantially
improve the quality of inbound feeds or if it will in fact cause
mass defection of users to other tools or formats. Or
both.
Given the data presented so far, I don't see conclusive evidence
of the oft
repeatedclaim
that feed parsers that aren't liberal will be a significant
competitive disadvantage.
Sharpreader
is (fairly)
conservative,
has
competition, and seems to be
doing
fine.
Whether ill formed feeds exist because early aggregators were
liberal or whether many of today's aggregators are liberal because
of the existence of ill formed feeds is an
imponderable.
Both are likely to be true. The key ingredient that appears
to be lacking to break this vicious circle is an an effective
feedback loop.
I do agree that in an abstract sense that the efforts organized
by Syndic8 and
the existence of the
feedvalidator are the
"right" way to address the problem of well formedness, but these
efforts to date
do
not appear to be sufficient.
I have hopes that the courageous and noble stands being made by
Brent and Nick will make a difference. And that the end
result will benefit
Luke and
Dare and others
that wish to employ "real" XML parsers. This is because,
contrary to popular belief, there
are exceptions to Postel's "law".
Note: nothing in this endorsement should be construed to imply
that an aggregator needs to be abrasive or abusive in their
application of this policy. I may be biased, but I do like
SharpReader's
approach of linking to the
feedvalidator first, and
providing the email address for feedback to the aggregator author
second.
- - -
One thing that needs to be said is that this needs to be a
voluntary action on the part of aggregator authors. Each tool
author needs to be free to modify, and potentially reverse
entirely, their stated policy based on feedback that they
receive. Without feeling that they are somehow letting down
the Atom community. As tool authors, their first
responsibilities are not to the producers or to the spec writers,
but to their user base.
RE: Is my weblog well formed?
Sam,
A spec is a guideline not a law punishable by death. The benefit of following the spec is interoperability and predictable behavior. However sometimes the reality is that developers fail to comply with the spec either because its requirements are to too onerous and thus impractical or it is flat out incorrect. Unlike Mark I don't believe that quotes taken out of context from some spec and repeated dogmatically are laws set in stone that everyone must believe in. However I do agree with his sentiment that all you can do is draw a line in the sand and choosing to draw the line in the sand at well-formed XML is fairly arbitrary.
Well-formed XML is the line I've drawn in the sand for RSS Bandit and I bend over backwards to fix other issues in feeds. Between Torsten and I we autogenerate titles, accept every date format under the sun regardless of whether they are conformant to RFC 822 or ISO 8601 and we import OPML files even when they don't have the required header element. If a feed has a description but no link or title it should still display in RSS Bandit. Some or all of these things are probably grounds for a fatal error if the specs are taken literally but I'll stick with our decisions.
My reasons for drawing my line in the sand at well-formed XML is at http://www.25hoursaday.com/weblog/PermaLink.aspx?guid=ad9a8344-1eb2-4dcd-ad3c-88d0ed538470 although if I was honest to myself the line should be drawn at erroring for any MUST from the RSS specs that was not complied to by the feed.
Bah, I have a class to attend. Have a nice day.
I continue to believe that codifying into the spec a well defined set of behaviours that a User Agent MAY implement to process a non-valid document is the best way forward. It should also be stressed that, in this regime, a User Agent wishing to parse a non valid document MUST only use the error correction specified. This solves two major problems with the idea that UAs should be able to do anything they like to fix up documents: Non-campatibility between different implementations and forward compatibility. Both of these problems plauge the existing set of HTML 4 documents to the extent that only 1 UA will render close to 100% of sites correctly and futher development of that UA will necessarily break backward compatibility.
Incidentally, if your data really needs to be transmitted correctly, the wellformedness of the document at the client end is not a good guide to whether this has occured. I'm guessing that online banks use html just as malformed as the rest of the web. If you need to ensure correct transmission you need some sort of error detection mechanism.
re: "Well-formed XML is the line I've drawn in the sand for RSS Bandit"
No, it isn't. The line you're drawn in the sand is "whatever my built-in tools happen to do by default", which is neither a particularly courageous nor a particularly interoperability-enhancing choice, especially given your explicit admission that your built-in tools operate in a non-conformant mode by default.
Brent and Nick are also guilty of this "lazy draconianism", which can only be truthfully stated as "I care about well-formedness right up until the point where my own tools are buggy."
Sam Ruby: Is my weblog well-formed? (I’d quote a standout line or two from this, but there’s no way to decide—I’d end up quoting the whole thing. So just go read it.)...
Based on these two tests, it is my conclusion that at the present time that the requirement that all XML parsers reject all non-well formed documents that they are presented with as a Platonic ideal...
I think that's incorrect. We did hear about RSS clients that accepted the samples, but it seems those clients either didn't use a conforming XML parser, or that they messed with the stream before passing it to the parser.
The only actual XML parser bug I've seen is in Mozilla's XML parser not rejecting bad UTF-8 byte sequences. This is a known bug (see BugZilla) and is likely to be fixed at some point of time. If this proves anything, it's that the authors of XML parsers are absolutely willing to fix conformance issues once they are detected.
So please don't blame the parsers, blame the clients that use "custom" parsers or that explicitly workaround problems in the content before passing it to the parser...
I've read Mark's comments on the well-formedness issue, and I've read Brent's, and I've read Tim's.
They're all minds I respect, and they all make a lot of sense.
I'm a translator by trade. One of the biggest challenges that a translator confronts is mind-reading. Because people don't always write with great clarity--and sometimes say something that is the exact opposite of what they really mean ("I've got spurs that jingle jangle jingle as I go riding merrily along / And they sing 'oh ain't you glad you're single?' / And that song ain't so very far from wrong"). I've got to guess at what they meant. There are often contextual or common-sense cues that help, but sometimes I wind up translating something that I know can't be right, but I just don't know what is right.
Parser authors must confront similar conundrums. I haven't tried very hard, but I am guessing it should be possible to create a test document badly formed in a way that there is nothing to help the parser guess what the generator really meant (perhaps the original document was so poorly constructed that it really didn't mean anything sensible). What to do in that case? Is that where you draw the line in the sand?
Back up from that line a bit. Imagine another error that is commonly committed. You've seen it before, you know why it happens, and you know how to fix it. Another feed generator comes along that generates invalid feeds that contain the same apparent error, but the underlying cause is completely different. Applying your existing fix could just make matters worse, and there's no way of identifying with certainty the feed generator so as to apply different fixes based on the known behavior of that generator.
In short, I'm wondering this: does a liberal parser need to be an indiscriminate parser? How liberal is "liberal"?
Mark,
The fact that I use what the .NET Framework's XML parser does by default is an oversight, one compounded by the fact that the flag to tweak the conformance level is badly named because it's semantics are overloaded to do many things besides that one conformance settings. I plan to fix the bug this weekend when I start working on my SIAM implementation, it shouldn't be more than setting a flag ixed by changing a flag in a couple of places in the code.
It seems to me that the only thing Brent and Nick are "Guilty" of is trying to do their best to meet some obligations recommended to them by the authors of XML 1.0.
The conclusion "In the real world, producers of ATOM feeds cannot produce valid or well formed markup all of the time" may of course be true, however XML 1.0 is clear about what a parser must do about that.
The argument that neither Brent nor Nick (or apparently anyone else) can succeed in writing a parser which is able to detect every instance of non-well formedness, therefore, somehow relieves them of any requirement to report any such instances that they do find seems to me to fall in the general category - "Grasping at straws".
To say that Brent and Nick are taking a "courageous and noble" stand just by attempting to comply the best they can with a W3c recommendation seems an odd thing to say to me, although I do agree that they have exhibited these qualities in the way the have handled the political pressure and personal innuendoes applied to them over this.
If you are uncomfortable with the requirements of XML 1.0 - if you believe it to be so broken and impractical, why did you choose it as the format for ATOM?
Even the Mozilla bug is not a case where an off-the-shelf XML tool is broken. Like many other bugs that have popped up it, too, is a case where messing with the unparsed stream bypasses a well-formedness check.
The XML spec requires XML processors to support UTF-8 and UTF-16. However, it leaves a huge back door open when permitting the use of other encodings as well. Then the XML spec makes conformance with the character encoding spec a well-formedness requirement. This works great when the XML processor accepts only the encodings it must accept and implements the character decoder internally. When application writers (for whatever reason) believe they need to support encoding other than those required by the XML spec, they tend to use external off-the-shelf decoders which tend to be lenient.
What happens with Mozilla is that it converts everything to UTF-16 using its lenient (designed for HTML tags soup) converters. Therefore, expat doesn't see the original UTF-8 bytes.
(In a way, supporting extra encoding is pointless, because content generators should want to use only UTF-8 or UTF-16, because those are the only two encodings that are guaranteed to work.)
- - -
Now, if it is decided that Atom documents don't need to be well-formed, why go on pretending that the serialization of Atom is XML when it, by definition, is not if well-formedness isn't required? Wouldn't it be more honest to specify a different-looking infoset serialization that isn't confusingly similar to XML?
Can I make a stupid suggestion to the strict constructionists? Have the well formed bigot tool correct the RSS. Have it crawl the blogspace looking for non-well formed RSS,etc. Then if it can be parsed at all add it to a directory of "fixed and reformed RSS". Post to their referrer log to supply adequate shame. If it cannot be parsed well its probably bad enough that they will fix it.
FavIcon Generator, via Erik IZArc, via Erik. Supports 7-Zip! This stuff defies parody, via Ole Han Solo frozen in carbonite! IN LEGO, via Ole. Make sure you check out the rest of his site and his other Lego creations. More on Clark's anti-war...
I wish that RSS feeds would wrap their content in CDATA sections. To me that is the correct way to store html content inside any xml format. I display a few blogs on my website using an xsl stylesheet to transform the feed into html content, but the content never displays correctly, it displays the markup itself.
I think there are a lot of people who are willing to do a little bit to improve feed quality, if it's not too hard and they can do it from where they are already....
[more]
Is this feed valid? Both SharpReader and Bloglines handle it flawlessly. In fact, there are active blogline subscribers. The feedvalidator chokes on it. Is this feed valid? Both SharpReader and Bloglines handle it flawlessly. In fact, there are a...
[more]
For the record, despite all of my efforts, when Evan went to check, my setup had **GASP** stopped serving my main page with the appropriate mime type to standards compliant browsers. The problem appears to be an unfortunate interaction between D...
[more]