It’s just data

Unicode for Syndication Consumers

Torsten Rendelmann: Hey, partitially good news: my local RSS Bandit beat build 109 does not fail anymore on Sam's test feed, if it is compiled with .NET 1.0

Whether that is good news or (or even news at all) is debatable, in any case, this should not be an accidental feature.  If this is to be pursued, here a few more things to think about.


According to the XML 1.0 spec, an XML doc fetched via some transport defaults to the transport's encoding, NOT utf-8.  That means that an xml doc (like, say, a feed) that is fetched via http defaults to iso-8859-1

This is one of the areas where people get into trouble....

See:  [link]

specifically:

"In the absence of information provided by an external transport protocol (e.g. HTTP or MIME), it is a fatal error for an entity including an encoding declaration to be presented to the XML processor in an encoding other than that named in the declaration, or for an entity which begins with neither a Byte Order Mark nor an encoding declaration to use an encoding other than UTF-8. Note that since ASCII is a subset of UTF-8, ordinary ASCII entities do not strictly need an encoding declaration."

i.e., in the presence of a transport (http), ou default to the encoding used by the transport....

Posted by James Robertson at

Damn.  I'm trying to figure out what order to write these things down, but everything is so damned intertwinged.  :-)

default to the encoding used by the transport....

This is only true if the encoding is specified by the transport.

Note: with HTTP and a mime type of text/xml without an Omitted Charset, the default per rfc-3023 is us-ascii.  However, with a mime type of Application/xml with omitted Charset, the XML declaration is to be respected.

I haven't yet decided whether I want to update this document with a forward reference to something I have yet to write, or to place this information directly in here.  Either way, thanks for pointing this out.

Posted by Sam Ruby at

James,

I don't see how you can come to that conclusion, that it should default to iso-8859-1.

Please see [link] for arguments against it.

Posted by Morten Frederiksen at

Sam,

The wording in that quote is really strange (assuming we're talking about text/xml), since, if there's no specification, there's a default - which then "counts" as the "encoding used by the transport".

Posted by Morten Frederiksen at

if there's no specification, there's a default

If charset is omitted, there MAY be a default.  For text/xml, there is a default.  For application/xml, there is not.

Note that my read of RFC 3023 is that an XML doc (like, say, a feed) that is fetched via HTTP with a MIME type of text/xml but with an omitted charset (unfortunately, a rather common combination), the defaults is us-ascii, not iso-8859-1.

Posted by Sam Ruby at

Http encoding:

[link]

specifically:

"The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter"

So - say I implement an aggregator.  Say I grab an xml doc via http, and

1) The http headers include no charset
2) The xml doc includes no encoding info

The default, based on this and my previous post is iso-8859-1.  Not utf-8, and not US-ASCII.  If you don't want your xml docs munged, always include the encoding explicitly....

Posted by James Robertson at

Chasing the HTML link back to the source (RFC 2616, section 3), we find:

The "charset" parameter is used with some media types to define the character set (section 3.4) of the data. When no explicit charset parameter is provided by the sender, media subtypes of the "text" type are defined to have a default charset value of "ISO-8859-1" when received via HTTP.

So, as I see it,

My takeaway from all of this is that the recommendation should be to always include the encoding explicitly in both the Content-type HTTP header and in the document AND be sure that they both match.  If you find that you can't reliably provide the correct encoding in the content-type header, then chose the application/xml MIME type.

Posted by Sam Ruby at

Actually, RFC 3023 unambiguously defines the default encoding for text/xml over HTTP, too:

Conformant with [RFC2046], if a text/xml entity is received with  the charset parameter omitted, MIME processors and XML processors MUST use the default charset value of "us-ascii"[ASCII].  In cases where the XML MIME entity is transmitted via HTTP, the default charset value is still "us-ascii".

It's an utterly useless default, since James is far from the only person who doesn't believe that's the case, but there's no question about the intent of 3023: text/xml defaults to us-ascii.

Posted by Phil Ringnalda at

However, 3023 distinctly states that it's in conflict with the http spec - since both are operative, there's confusion.  It's as clear as mud, which is why people make different choices...

Posted by James Robertson at

Subtile differences

... [more]

Trackback from torsten's .NET blog

at

Sam,
  Since the specs are definitely broken here, I always use the enoding declared in the document and ignore the charset parameter in RSS Bandit. If there is no encoding declared in the document I use UTF-8.

  This has worked for the most part for me. I guess I could reparse with a bunch of different encodings until I got no exceptions but I don't.

Posted by Dare Obasanjo at

I don't see how the specs are broken in any way.  RFC 2616 states that text/* content delivered over HTTP defaults to iso-8859-1.  RFC 3023 overrides one specific case of this (text/xml over HTTP) to default to us-ascii instead.  I see nothing broken about a later spec overriding part of an earlier spec, nor do I see anything ambiguous in how RFC 3023 is written.

XML validity depends on character encoding, character encoding depends on transport, and the rules for transporting text/xml over HTTP are defined in RFC 3023.  Dare has made it abundantly clear in the past that he disagrees with the choices made in RFC 3023, but the only "breakage" I see here is the breakage which Dare is manufacturing by ignoring particular specs which he finds inconvenient.

Posted by Mark at

Mark,
  RFC 3023 is broken because it ignores practice in the XML world and this has even been noted by the very authors of the spec who've expressed that they'd like to update it. If RSS Bandit actually followed RFC 3023 then we'd cause our users to have difficulties with a large percentage of the feeds they read since lots of them are served with text/xml MIME types but aren't encoded in us-ascii. 

Specs are not the perfect and irrevokable Word of God set that are set in stone. Many of them are ambiguous, contradictory and in some cases infeasible to implement.  .

Posted by Dare Obasanjo at

None of these issues are easy.  Things I would like to note:

To make the point even more clear, consider this feed.  This time, the feed is compliant with the spec.  Furthermore, it does not rely on any default (ambiguous or otherwise).  It is apparently accepted by RSSBandit when compiled with .NET Framework version 1.0.  Should such a feed be rejected by RSSBandit when compiled with the .NET Framework version 1.1?

Posted by Sam Ruby at

Specs are not the perfect and irrevokable Word of God set that are set in stone. Many of them are ambiguous, contradictory and in some cases infeasible to implement.

Agreed, but RFC 3203 is neither ambiguous nor contradictory.  Nor is it infeasible to implement, since I've implemented it in my Universal Feed Parser.  It treats any feed served as text/xml with no HTTP charset as us-ascii, in compliance with RFC 3023 (which, as Sam correctly points out, is now explicitly cited in the latest revision of the XML specification itself).  If I fail to parse the feed with a real XML parser in the specified charset, for any reason (character encoding or any other wellformedness issue), the "bozo" bit is set to 1 in the results, indicating that the feed author is an incompetent bozo who can't create a well-formed feed.  (Thanks to Tim Bray for suggesting this terminology.)

As is well-known, my feed parser goes further and falls back to a number of hacks to try to parse the feed anyway.  But it very clearly indicates to the calling application whether the feed was well-formed or not, via the bozo bit and associated bozo_exception, and it is up to the calling application to decide whether to enforce XML's draconian error handling.

You, on the other hand, have apparently decided that XML wellformedness means "XML wellformedness, except for the specs I don't like", and that "draconian error handling" means "draconian error handling, unless it would actually inconvenience anyone".

Posted by Mark at

RE: Unicode for Syndication Consumers

Mark,
  I wear my biases on my sleeve. I think XML well-formedness is important because I believe developers should be able to use off-the-shelf XML tools to process RSS. There are practical reasons why I think XML well-formedness is important, I'm not religious about specs. There are also practical reasons why I think taking RFC 3023 as anything but a buggy, poorly thought out idea is not feasible.

The issue with the character checking in the XML parser used by RSS Bandit is some flakiness in how we shipped the XML parser in the .NET Framework not to be conformant by default. And as I've said before this won't be the case in the next version of the .NET Framework. My main problem is that as Torsten complained I'm loathe to have code in RSS Bandit that depends on which version of the .NET Framework is loaded.

Message from Dare Obasanjo at


There is no doubt that RFC3023 could use a revision.  Basically, you usually shouldn't serve XML with a charset, since an XML processor has a better chance of getting it right at the receiving end than the server does of guessing right at the transmitting end.  This means that you probably shouldn't serve XML as text/xml since you have to provide a charset or a compliant receiver is forced to assume it's ASCII which is probably wrong.  But there are also other good reasons for not using text/xml, among them that intermediaries are allows to "transcode" anything in text/* - this actually happens in Japan I'm told, e.g. from EUC to Shift-JIS - and you just can't transcode XML without being a full-fledged XML parser, which most transcoders aren't.  All this generally sucks.  RFC 3470 has some smart things to say.  Furthermore, serving things as */xml is kind of broken since normally things that are thus served are actually XHTML or SVG or RSS or whatever, all of which have perfectly good media types beginning in application/ that should be used instead.  Did I mention that this all generally sucks?

Posted by Tim Bray at

Basically, you usually shouldn't serve XML with a charset, since an XML processor has a better chance of getting it right at the receiving end than the server does of guessing right at the transmitting end.

That statement puzzles me.

My server inserts both of the following lines in the output for my atom feed:

Content-Type: application/atom+xml; charset=utf-8

<?xml version="1.0" encoding="utf-8"?>
I guess that those statements could be called a "guess", but they are consistent with how I have my server configured.  And it is exactly this data which enable the XML processor on the receiving end to get it right.

Posted by Sam Ruby at

Sam,
  I assume Tim means that your web server has to be explicitly configured to support various XML types and their encodings otherwise it guesses. Even then there is the fact that the major web servers (both Apache and IIS) conflate file extensions and MIME types which seems fairly broken to me but what do I know.

If I just have a file called rss.xml on my hard drive, it is quite likely that IIS or Apache will serve it with the wrong MIME type and the wrong encoding by default. I have both installed on my machine and I've tried this just to test this assumption. However it is more likely that an XML parser would guess the right encoding for my rss.xml file than the web server.

Posted by Dare Obasanjo at

serving things as */xml is kind of broken since normally things that are thus served are actually XHTML or SVG or RSS or whatever, all of which have perfectly good media types beginning in application/ that should be used instead.

RSS has no registered media type.  There was an Internet Draft to register
application/rss+xml, but it expired over two years ago.

[link]

Posted by Mark at

How to write an aggregator [1]

This is a list of resources which are useful when building an RSS/Atom aggregator. I found them useful when building FeedThing, maybe other people will too. Expect this list to grow. Things That FeedThing Does Correctly Specs and Things HTTP Primer...

Excerpt from Gareth Simpson at

application/rss+xml vs. text/xml

I’ve been working on some feed support in MSDN’s new online platform (a beta of which is running [link]) and I had to decide what content-type to use when outputting a RSS feed. I knew this was a contentious issue in the past,...

Excerpt from Code/Tea/Etc. at

How to write an aggregator

This is a list of resources which are useful when building an RSS/Atom aggregator. I found them useful when building FeedThing, maybe other people will too. Expect this list to grow. Things That FeedThing Does Correctly Specs and Things HTTP Primer...

Excerpt from Gareth Simpson's Notes at

Add your comment