It’s just data

On Notice

Sean Lyndersay: We will only support feeds that are well-formed XML.

Gutsy, and welcome, move.

Question: how will IE7 deal with Priorities in the Presence of External Encoding Information?


I was thinking about the same thing. [link]

Posted by Anne van Kesteren at

IE7, Vista and RSS: ‘We will only support feeds that are well-formed XML’

In the context of discussing IE7 and Window Vista, the Microsoft RSS team declares:

"We will only...

... [more]

Trackback from Alex Barnett blog at

What sorts of bugs occur when RFC3023 is ignored?

Posted by Robert Sayre at

What sort of bugs occur if you get the encoding wrong in the face of the presence of external encoding information?  Well, if you have a policy of only supporting feeds that are well formed XML, you may reject valid feeds.

Related discussion:  Mark Pilgrim, Tim Bray

Posted by Sam Ruby at

I’m feeling a little dense. What would be a concrete example of this problem. Specifically, is it possible to make Expat incorrectly interpret an XML file by failing to load an external encoding? If so, what happens?

Posted by Robert Sayre at

Robert: perhaps a test-case would help.  The feedparser handles this feed correctly.  The feedvalidator picks up on the charset and only complains about some missing elements.

Now look what expat does if you separate the data from the charset:

>>> from xml.dom import minidom
>>> from urllib2 import urlopen
>>> url='http://feedparser.org/tests/wellformed/encoding/http_text_xml_charset_2.xml'
>>> minidom.parseString(urlopen(url).read())
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
  File "/usr/lib/python2.4/site-packages/_xmlplus/dom/minidom.py", line 1925, in parseString
    return expatbuilder.parseString(string)
  File "/usr/lib/python2.4/site-packages/_xmlplus/dom/expatbuilder.py", line 942, in parseString
    return builder.parseString(string)
  File "/usr/lib/python2.4/site-packages/_xmlplus/dom/expatbuilder.py", line 223, in parseString
    parser.Parse(string, True)
xml.parsers.expat.ExpatError: not well-formed (invalid token): line 13, column 23

For further background, see XML on the Web Has Failed (suggestion: search for "There is actually a good reason for this second set of rules.").

Note: those that serve feeds with any application/* MIME type are immune to this issue.

Posted by Sam Ruby at

The general question is whether the parser looks at the charset parameter of the Content-Type HTTP header to get character encoding information.  The more specific question is whether the parser respects the rules described in RFC 3023 for determining the precedence order of different character encoding declarations (one in the hTTP header, one in the XML body) if they differ or if one is absent.  The precendence order depends on the main part of the Content-Type HTTP header (before the charset parameter).

If the parser respects the character encoding in the Content-Type HTTP header when it is present but (incorrectly) falls back to the character encoding defined in the XML body of a feed served with a Content-Type of “text/xml” when the character encoding is absent, you will end up accepting feeds you ought to reject.  This sounds like an edge case but it’s not, because this was the default configuration in Apache until recently (mime.types listed “text/xml” as the default type for static .xml files, and depending on your Apache configuration upgrading to the latest version may not fix it).  On the client side, virtually everyone gets this wrong, including Firefox 1.0.x (I haven’t tested 1.5 yet).

If the parser ignores the Content-Type HTTP header altogether, you will end up misinterpreting characters, i.e. corrupting data.  iTunes 4 and 5 did this; I haven’t tested iTunes 6 yet.

Don’t even get me started about BOMs, UTF-16, or any of these issues.

Posted by Mark at

The irony of this post is that it shows unescaped tags in Bloglines using Safari 2.0.2

see below.

Sean Lyndersay: We will only support feeds that are &lt;a href="http://www.w3.org/TR/REC-xml/#dt-wellformed"&gt;well-formed&lt;/a&gt; XML. Gutsy, and welcome, move. ...

Posted by willc2 at

A gutsy, but rather pointless move on Bloglines' part, treating the <description> in RSS 1.0 as the plain text it was originally specced to be.

Interesting that when I tried to get them to show me all your feeds, to find out which one was displayed that way, by just asking to subscribe to www.intertwingly.net/blog/, instead of autodiscovery they offered to let me join the 26 people who are pointlessly subscribed to the HTML page which has no items, and never will. I wonder why Bloglines wants to frustrate their users in that particular way.

I guess since they are also frustrating some probably rather large percentage of the 1300 people subscribed to your RSS 1.0 feed by giving them the unparsed and truncated HTML source from the <description> rather than the full item from <content:encoded>, even when their preferences explicitly say that they want full content, frustrating 26 people is pretty small potatoes.

Posted by Phil Ringnalda at

Why is it that when transcoding proxies come up, they are almost always mentioned by someone whose native tongue can be written using Windows-1252 and not by the alleged users of the said proxies? A thread that might be of interest. Personally, I consider transcoding proxies that tamper with UTF-8-encoded XML harmful if they exist. Every XML processor is required to support UTF-8, so there is no legitimate need to transcode.

Posted by Henri Sivonen at

Personally, I consider transcoding proxies that tamper with UTF-8-encoded XML harmful if they exist.

Care to take a stab at producing an RFC 3023bis?

Posted by Sam Ruby at

People are already updating RFC 3023. [link] Although I believe the process slowed down a bit after a particular W3C TAG resolution was not accept by one of the editors.

Posted by Anne van Kesteren at

Anne, I don’t believe that that update captures Henri’s intent.

Posted by Sam Ruby at

Wellformed RSS and RFC 3023

... [more]

Trackback from Better Living Through Software

at

Wellformed RSS and RFC 3023

We’ve announced that the RSS platform in Vista will permit only well-formed XML.  Most people are celebrating, but there are some comments that indicate some people may be confused. To clear things up, this statement is ONLY about well-formed...

Excerpt from Better Living Through Software at

Add your comment