I've thought about Brent's
compromise, and to borrow a phrase that is a favorite of Tim
Bray, I think that there is a way that 80% of the value can be
obtained with 20% of the effort. Is there really a market
requirement to be selectively pedantic on a feed by feed basis?
It seems to me that there are two levels of errors.
Unrecoverable, and recoverable. A HTTP status code of 404 is
something that the aggregator can not work around. On the
other hand, a malformed date may marginally reduce the user's
experience, but arguably should not prevent the user from seeing
what other data can be salvaged from the feed.
Unrecoverable errors, by necessity, needs to be handled each
time a feed is retrieved, but do recoverable errors need to be
reported each time such an error is encountered? I
mean, do thousands of people need to be alerted whenever a stray
smart quote appears on boingboing?
From strictly an engineering point of view, is that the right
design for a feedback loop? My experience is that, in
addition to alerting the wrong person, an overabundance of such
alerts tends to dull the message. People simply will tune
An alternative might be to only validate on subscription.
This would certainly reduce the number of such messages. The
also would present such messages to users at a time when they might
I would also suggest that all such messages be oriented to their
target audience. If a feed contains encoding errors, let the
user know that some characters may not appear as intended. If
the feed is missing a required element, tell the user what they
will be missing. If a date is not of the appropriate format,
let the user know that such information may be misinterpreted or
This information could be accompanied by a simple checkbox to
inhibit the display of further messages.
Hopefully, such an approach will ultimately result in a more
educated consumer base. A greater demand for higher quality
feeds would certainly not be an unwelcome side effect.
It also means that feeds would be sampled regularly.
Parting thought: in my opinion, such checks don't have to be
bullet proof, merely effective. Apply the 80/20 rule here
too. The well formedness checks provided by your off the
shelf parser can generally be obtained with a few lines of
code. Ditto for a simple scan for required elements. I
can share the regular expressions used by the
However, I do have one suggestion. I would suggest that
this not lead to a practice whereby each consumer documents what
subset or superset of the various specifications they support at
the moment. It would be better for all concerned if such
checks are made, and errors are reported, in terms of the original
Just expose the feedvalidator as a webservice :) Preferrably, something simple and RESTful. Perhaps a GET, with the URL of the feed provided as a parameter and the result an XML doc describing the errors in the feed.
Do thousands of people need to be alerted whenever a stray smart quote appears on boingboing?
I think we're approaching this problem from the wrong direction. Instead of trying to make it easier for RSS consumers to validate a feed, we should be making it easier for RSS producers to learn if their feed is invalid.
I believe many people would be happy to fix their feed's errors, but going so far as to stop by feedvalidator every time they make a post is too much trouble. Therefore, it the blogger can't come to the validator, the validator should come to the blogger.
Someone should set up a system where people can sign up to have their blog's validity checked by a parser every n hours, and if there's a problem, they receive an e-mail with the error text explaining how they can fix it.
Alternatively, another solution would be to issue some kind of TrackBack ping to the validator upon each post and have that trigger a validity check on the URL sent. However, I'm not quite sure how to best determine an e-mail address to mail the results, given that nobody uses the <webMaster> tag because of spammers.
I'm not sure exactly of all the specifics here, but I'm sure they could be worked out to balance ease-of-use and server resources issues. I do think it's important that if the feed validates then you don't receive e-mail, because otherwise you'll end up spamming yourself, and it's important for failures to stand out.
I've been told I'm an aberration often enough about this that I don't expect anyone to build what I want, but I for one would welcome selective alerting: if a member of my tribe slips up, I want an email to them to pop open, so I don't shirk my duty to let them know, but if it's one of my enemies? So be it.
And while I see your point about not selectively reporting missing things, I've spent enough time trying to convince people to add unused elements to their feeds so that aggregator authors would be willing to make use of them that I can also see the value in selectively saying "this feed lacks the <comments> element that would let me put a link to the comments over there on the right, and lacks the <slash:comments> element that would let me tell you how many comments there are already."
Heh. It's twisty, but the Trackback ping to validate might actually work with Movable Type's implementation at least. You can have a URL that you ping for every post in a category, and if you pinged the validator for every category (and categorized everything), then when you pinged from a post that made your feed invalid, the validator could just fail to respond, which would throw an error message in MT, and would tell it that it needed to try pinging again the next time that post was saved. Not exactly a consumer-grade web service, but cute.
It turns out that lots of people use NetNewsWire to monitor their own feeds. They subscribe to their own feeds, and if they stop working in NetNewsWire, then they validate them and fix the bugs.
It's for these people, in part, that I want the ability to be selective about which feeds require well-formed-ness and which don't.
There are also other people who simply care a great deal about this issue, and want to require well-formed-ness for all their subscriptions -- except for, say BoingBoing, which they know will have stray smart quotes sometimes but they want to read it anyway.
And then of course there are the majority of users who don't and shouldn't care about well-formed-ness. For them, the defaults will work nicely -- NetNewsWire will not require well-formed-ness and will not report well-formed-ness errors. In other words, it will work exactly as it works now.
I can't stress enough that all this well-formed-ness checking and error reporting will be optional, off by default.
Also, I don't plan at all to report bugs like malformed dates. Instead I plan to make the Validate this Feed command more prominent so people will use yours and Mark's on-line validator.
I think "negotiation" is an interesting concept to add to this discussion. . . I guess the assumption with RSS is that it is harder to "ask" an RSS server to deal with the format of its RSS than it is to deal with this in the RSS reader. I guess, in some ways, fat clients live on!...
I had the same reaction to the error reporting part of Brent's piece!
Spring, my universal canvas app for OS X, uses XSLT  to convert all incoming XML to our XML format (Conceptual Object format) before converting to a dictionary. So, liberal XML parsing doesn't come without high engineering costs. I assume other tools that rely on XSLT face similar costs.
The recent discussions about how strict aggregators should be when reading invalid or ill formed feeds (e.g. RSS or Atom) brought to mind an idea: an automatic service that checks your feed for validity and send e-mail whenever it finds......
Validate on Subscription (or, my turn to compromise...)
Sam Ruby proposes that aggregators validate on subscription, and I have to confess that this makes more sense than my stated position of requiring Atom feeds to be well-formed. What Sam suggests is that aggregators such as FeedDemon inform of......
Hmm, I agree that the time of subscription is a critical one, but I've a feeling that retreating back to a 'warn only' position defeats the whole object. The dialog box might as well just say: "That was invalid, I don't care [OK]".
"Do thousands of people need to be alerted whenever a stray smart quote appears on boingboing?" Well, yes, if you think boingboing should be publishing valid XML. There should be at most one stray quote, before it gets fixed (following thousands of complaints).
Like Robb, I've got XSLT on my front end (nurse!) and there is a cost associated with tooling for tag soup.
But I think the far greater cost will be in losing the 'default' of validity (XML and Atom) that a more draconian approach would provide.
Assuming that everyone runs scared of valid XML, I think I might still try a 2-tier approach. If the data is valid Atom or RDF, it gets first class treatment. If it says it's Atom but doesn't validate, or is one of the looser RSS specs then the feed URI gets tagged as 'potentially unreliable' and left out of processing where junk might mess things up - indexing etc.
The "well-formed" terminology makes this look like it's a purely ideological. Call it a "syntax error" and suddenly it doesn't seem petty. Similarly, "improving the well-formed-ness of feeds" isn't what this is about - it's about whether you can use a standard XML parser without your program becoming a second class citizen. I think programs that say there's an error but still continue harm that goal, since they're basically saying "people with inferior software can't read this". On the other hand, I do like Sam's idea of telling the user how it might affect them, which at least partially negates the superiority effect.
It's what's known as a compromise Mark. During early testing too many feeds were failing for Shrook to be a viable product. But I'm not going to go any further. There's a big difference between structural flaws and the encoding not having been labeled. Shrook still does everything with a standard parser - crucially it isn't doing any of its own XML parsing. It just has two lines of code that do a Windows-1252 -> UTF-8 translation and try again. (btw Shrook doesn't have "Draconian Unicode error handling", it was just tripped up by Sam's particular test)
Graham, you know where I stand on error recovery. I think second-guessing the encoding is a great idea, the more guessing the better. But some in the draconian camp would say that you're rewarding bigots and racists.
Tim Bray: "You need to know what encoding your data is in, so that for example when you see a Euro sign you know enough to emit €, not some Microsoft Code Page byte that's guaranteed not to work on lots of browsers. This can be tricky. But the alternative is, you're a parochial bigot. ... If your software can't manage to escape five special characters and fill in end-tags and quote attributes, it's failing to meet such a very low barrier to entry that it's probably pretty lame anyhow. And if developers are not willing to put in the effort to enable the non-white people of the world to use their software, I don't think [we] should condone or reward them."
Sam, according to the draconians, there are exactly 2 camps: the draconians (those who reject all ill-formed XML) and the tolerants (everyone else). Either a document is well-formed XML or it's not. You can't be a little ill-formed, just like you can't be a little pregnant.
The difference comes in how you choose to be tolerant, and here is where people baffle me. The tolerant camp is a very big tent. And according to the draconians, the minute you step into the tolerant camp (no matter which door you enter), you're rewarding bigots who hate brown people. So at that point, why not go all the way? I mean, why be a little tolerant?
Shrook accepts and displays documents that are not RSS, because they have misrepresented their character encoding. NetNewsWire accepts and displays documents that are not RSS, because they have unescaped ampersands. Pretty much everyone accepts and displays documents that are not RSS, because they contain illegal control characters. RSS Bandit accepts and displays documents that are not RSS, because they have invalid date formats (a validity issue, not a well-formedness issue).
My parser is the sum of all these sins, plus some.
Everybody's tolerant, but we're all tolerant in different ways. This is, in fact, exactly the nightmare scenario that the draconians envisioned 7 years ago. Hell, it's exactly the nightmare scenario the tolerants envisioned 7 years ago -- that nobody would be able to stomach "reject on first error" at an application level, so they would play games with their underlying "conforming" XML parsers by second-guessing them and feeding them crap repeatedly until it finally got accepted.
Mandatory draconian error handling hasn't increased interoperability; it's destroyed it. Nobody can stomach it, so everybody skirts it "just a little" -- each in their own way.
From my point of view, well formedness is just one component of validity checking, and by that criteria, everybody is "a little pregnant". As far as the racial references, Let That Be Your Last Battlefield.
Sam, that example is actually one of the reasons I wrote the ultra-liberal feed parser in the first place, because The Register used naked markup like that in their description. If it can no longer handle that naked markup like that, I would classify that as a very serious bug indeed.
I'm working on a test suite for the feed parser, which would hopefully prevent regression bugs like this.
Last update: 18/01/04; 14:15:25 EDT Scripting, Blogging, Softwares... Adam Trachtenberg: Using PHP 5's SimpleXML: SimpleXML is a new and unique feature of PHP 5 that solves these problems by turning an XML document into a data structure you can...