It’s just data

HTML charset vs XML encoding

Dare Obasanjo: Mark Pilgrim has a post entitled Determining the character encoding of a feed where he does good job of sumarizing what the various specs say about determining the character encoding of an XML document retrieved on the World Wide Web via HTTP. The only problem with his post is that although it is a fairly accurate description of what the specs say it definitely does not reflect reality.

OK, so given this, what should a feedvalidator do?  Should it follow the specifications?  Should it follow Dare's recommendations?  Should it issue a warning if the specifications and Dare's recommendations differ for a given feed?


I hate to be pedantic, but it's called a feedvalidator for a reason.  Validity is determined within a frame of reference, i.e., with reference to the spec(s).  The specs Mark cites are relevant for character encoding, therefore they determine if a document is valid (with respect to character encoding). Or as my kids are tired of hearing me say, "If it's not right, it's wrong."

Taken from another angle, Dare's premise is that because most (web server) software does not support the spec, neither should other software:

The moral of this story is if you are writing an application that consumes XML using HTTP you should use the following rule of thumb for the forseeable [sic] future....

The rule of thumb cited is (paraphrasing, and emphasis mine) to treat text/?xml? mime-types as though they were application/?xml? mime-types, instead of following the specs.  There's a clue, however, in the phrase, "for the foreseeable future".  Which is more likely to occur in the future... the spec will change, or more software will be written/patched to conform to the spec?

Posted by Jason Clark at

Hi,

I believe the author of the XML MIME RFC and the W3C Technical Architecture Group (TAG) have recently discussed this problem and decided to fix RFC3023.

See for instance: TAG mailing list discussion from 2003-10-30.

Julian

Posted by Julian Reschke at

RE: HTML charset vs XML encoding

Julian brings up a point I forgot to mention in my post, many would argue that it is RFC 3023 that is broken. There is also the fact that to actually implement RFC 3023 you have to use an XML parser that gives you the option of overriding the encoding declared by in the XML document and using us-ascii. I'm not even sure there are XML processors around that allow you to do this.

I was planning whether it was even possible to do this with the popular XML parsers for various platforms in a follow up blog post.

Message from Dare Obasanjo at


RE: HTML charset vs XML encoding

Sam,
I'd vote that the validator follow the specs. If it doesn't then there's no way to know that what they do is wrong since the validator will be the primary way people determine whether their feeds are OK or not.

Message from Dare Obasanjo

at

Dare, anything is possible once you realize that the XML parser is a component of the system, and not THE system.  I remember a case dealing with BOM's and SOAP interop with Microsoft where the only solution I could come up with was to intercept and filter the stream of bytes that flowed into the parser as Xerces doesn't support BOMs.  The definition of the XML prolog seems to have been explicitly designed in a way that would make creation of something that read to the end of that one declaration, computed the correct information, and then inserted the rest into the stream would be possible.

All that being said, I have my doubts that many people have done this.  In any case, here is a test case, adapted from one of the existing tests in the feedvalidator.  The question is: do you see curly quotes and a British  pound symbol, or do you see something else?

Finally, I must say that I am troubled by any combination of recommendations that result in the validator doing one thing and aggregators doing another.  Recovery from errors is one thing, but any mangling of a what is arguably a correct feed by a consumer has to be treated as a bug, in my opinion.

Posted by Sam Ruby at

Sam,
In your test I see the British pound symbol and two square boxes surrounding the word test in IE and when subscribed in RSS Bandit.

As for aggregators doing one thing and the validator doing another, I'd say that's up to you. However any aggregator author that actually follows RFC 3203 and treats text/xml as us-ascii when the charset parameter isn't set is going to mangle a large percentage of feeds out there [or error completely at them if they're in the draconian camp].

RFC 3203 got it wrong here. The same way the part of the HTML spec that expects web servers to parse HTML documents looking for META tags before sending them to clients got it wrong as well. This is what happens when you write a spec without talking to implementers first.

Posted by Dare Obasanjo at

Dare, one could apply that same argument to the XML spec.

In any case, I also believe that the worse you can say about IIS and Apache is that they don't automatically detect and configure the character set.  It certainly is possible to configure both servers correctly.

Posted by Sam Ruby at

Warnings anyone?

Posted by Don Park at

Warnings ... now that's a pragmatic approach ;-)

Reminds me of problems I used to pull my hair out over. It was my experience 10+ years ago when writing single-source C++ that ran on multiple platforms (w16, w32, Mac, OS/2, 3+ flavours of Unix) that you had to ignore certain warnings. This was infuriating. Cranking up the warning level all the way and treating them as errors resulted in header files from the every OS-vendor that couldn't be compiled. Was I supposed to tell management "we can't ship a product on such and such platform -- or any platform?"

Here's a question/observation:  A validator is a convenience, a helpful reminder. Sam's combination of techniques to ensure that somebody hasn't (intentionally) provided bad markup, spell-checking, etc. is one of the most sophisticated I've seen. And his skilled demonstration of supporting virtually every know flavour shows that this is a data-scrub/transform problem. You get metacrap/datacrap everywhere. Wishing it into non-existence won't make dirty data go away.

But I know what markup is [not at the level of some of the folks on this thread, but certainly more than then 10,000+ people a day who start blogs, comment on them, etc.]

The fact that I can use MoveableType for my own blog and basically limit myself to blockquotes and plaintext for everything else means that in virtually all cases a validator would be useless. If it's not well-formed there's something seriously wrong with MT. If the server doesn't add the encoding attribute, it's a bug. The fact that it would take updates to +90% of all web servers across all major and minor OS platforms [and suffer a performance penalty] means that it will never happen.

Those that are interested (and skilled enough) to want to build their own tools take on an extra burden -- and for these folks the debate [if it's kept to that] is legitimate.

But this debate has shifted to the public domain given news coverage, etc. The first generation of coverage will follow 'the trend' without digging in because the reporters don't have a context or the ability to do the analysis.

As Dare pointed out on his own blog a few days ago, the work/debate here is biased away from the user experience.

Subsequent generations of coverage will move beyond the entertainment angle and will have the user/business experience in mind and start asking questions:

1. Of the N million feeds, how many are being produced by tools written by the author?

2. How many are being used by tools provided to them by Google et al?

3. What business purpose does Google's current position serve?

They'll produce charts. Charts with a single pie slice aren't very interesting.

Posted by phil at

Validator Warnings

The following is a re-print of a posting I made in Sam Ruby's blog Warnings ... now that's a pragmatic approach ;-) Reminds me of problems I used to pull my hair out over. It was my experience 10+ years ago when writing single-source C++ that ran on multiple platforms (w16, w32, Mac, OS/2, 3+ flavours of Unix) that you had to ignore certain warnings. This was infuriating. Cranking up the warning level all the way and treating them as...... [more]

Trackback from Occasionally Connected

at

Don Park wrote:

Warnings anyone?

Don, your feed is a perfect example of a feed for which a warning would be appropriate.  You have explicitly specified the encoding of utf-8, however you let the http Content-type to default to text/xml, which in turn implies us-ascii, which by the rules defined by RFC 3023 override your explicit declaration in the xml.  This is presumably not what you expected or desired.

This can be easily rectified by following the instructions here.

Posted by Sam Ruby at

RE: HTML charset vs XML encoding

Sam,
  The W3C site you linked to shows the fundamental problem with RFC 3023 specification of how text/xml should be interpreted. Both IIS and Apache allow you to attach a charset to a file type. This breaks down if you have XML documents in multiple encodings on your server. In fact, the example in my blog that uses the W3 schools's website is an instance of this.

The fact is RFC 3203 is broken and at least one of the authors admits this as Julian Reschke pointed out in your comments yesterday.

Message from Dare Obasanjo at


Dare, all Don would need to do to be correct with respect to this RFC is to use the mime type of application/rss+xml.  This seems to be exactly what the message cited by Julian Reschke suggests: namely the deprecation of text/xml.

Posted by Sam Ruby at

RE: HTML charset vs XML encoding

Sam,
  True. However the link you provided shows that both Apache and IIS conflate file extensions with MIME types which is a gotcha but not a major one.

I'd suggest that the validator and the Atom spec frown upon using text/xml as the MIME type of feeds in general.

PS: It seems I'm banned from posting to your blog from the Web but not via the CommentAPI so can still post from RSS Bandit. A bug, intentional or a little bit of both?

Message from Dare Obasanjo at


Vote -> Warnings

Posted by Randy Charles Morin at

Dare, what you have noticed is that API interface does not currently implement my "frequent flamer" throttle (which enforces a limit of three such posts per 72 hour period, and expires when there is a 72 hours without a flame by that individual).

I really would like to get beyond statements like these.

Is that possible?

Posted by Sam Ruby at

Randy, are you aware that your feed uses the non-US ASCII character for trademark (™), along with the problematic and presumably soon to be deprecated text/xml content-type without specifying a character set?

In other words, once RFC 3023 (as currently written) is taken into consideration, your feed would not be considered well formed XML and therefore would be flagged.

Do you know of any aggregators which have a problem with your feed?

I simply want to be sure that you would still vote for a warning after all of the above is taken under consideration.

Posted by Sam Ruby at

I am of the opinion that XML mime types are not and will never be used as intended by the relevant specs.  This is why I suggested using warnings instead of errors.

Posted by Don Park at

If we're gonna encourage mime-types ... lets not forget to protect against gzip decompression bombs

Yet another thing that a validator implementor needs to worry about ;-(

Posted by phil at

I am of the opinion that XML mime types are not and will never be used as intended by the relevant specs.  This is why I suggested using warnings instead of errors.

If it truly is hopeless, my thought is to not even issue a warning at all in such a situation as it would only be ignored.

Posted by Sam Ruby at

It's your call Sam but I feel that having nothing to ignore is different from having something to ignore.

Maybe I should return "How to talk like Yogi Berra in 21 Days" to the bookstore...

Posted by Don Park at

Sam, tell you what, if your FeedValidator or for that matter anybody's tells me my feed isn't valid, then I'll change it ;)

Posted by Randy Charles Morin at

Randy, technically your feed is not valid according to the currently version of RFC 3023, but this can be easily rectified by following the instructions here.

Posted by Sam Ruby at

Yes, Sam, I know how to fix it, you dont have to be patronizing. Unfortunate, my Website is shared-hosted. The ISM doesnt work over FTP :) Not much I can do. I'm considering a dedicated box, but that's a lot of money to fix a bug that isn't reported by any of the feed validators.

Posted by Randy Charles Morin at

Interesting. "Peter" simply quoted Dare's comment from 15 February, which in my experience is usually 0.99 spam, though project-web.org seems like an odd domain to spam. Much more interesting, though, is the IP that the putative "Sam Ruby" used to comment. For the most part, I don't have any need to know the intimate details of your day-to-day activities, but since the commenter-IP is the only form of identity-verification you provide, and if you said you were going to make that large an IP-jump I've forgotten and can't find it now, I'm left wondering whether both of the comments above are suspect, or only one.

Posted by Phil Ringnalda at

I've now had an opportunity to look closer at the Apache log entries which lead up to the posting by "Peter" and have concluded that it was, in fact, spam, and have such, I have removed both the posting and my reply.

The most damning piece of evidence was the fact that this particular comment was NOT posted by RSS bandit.

Posted by Sam Ruby at

Really Funky Syndication

I've always wanted to claim to have written the funkiest RSS ever. Here's my attempt. I created a lot of rules that I had to follow to make this a valid attempt on my part. It had to be valid RSS 2.0 according to the feed validator. It had to be not...

Excerpt from iBLOGthere4iM at

Add your comment