[MarkPilgrim] Clarify the rules for determining well-formedness of an Atom feed served over HTTP, covering issues of character encoding and MIME types. Also, client requirements for handling non-well-formed feeds.
RFC 3023 defines rules for determining the character encoding of a feed (or any other XML document served over HTTP). The default configuration for most web servers is to serve ".xml" files as "text/xml" with no charset parameter. According to RFC 3023, all of these feeds MUST be parsed as "us-ascii". This leaves UnprivilegedUsers without a way to publish Atom feeds in any other encoding except us-ascii. Furthermore, the Atom-enabled applications that the UnprivilegedUsers are running may not even have enough privileges to determine that they are, in fact, unprivileged in this way.
Therefore, this proposal recommends that all Atom-enabled publishers "assume the worst" and only emit ASCII-compatible XML.
Insert as section 6:
6. Client processing requirements
Atom feeds served over HTTP MUST be well-formed XML 1.0, as defined in Section 2.1 of the XML specification <http://www.w3.org/TR/REC-xml/#sec-well-formed>. Furthermore, the concept of XML well-formedness relies on first determining the character encoding of the XML document. RFC 3023 defines how to determine the character encoding of XML documents served over HTTP.
6.1 Determining the character encoding of an Atom feed
The rules for determining the character encoding of an Atom feed are the same as determining the character encoding of any XML document served over HTTP. The rules are wholely defined by RFC 3023, but they are summarized here because there has been widespread confusion over how RFC 3023 should be interpreted:
When serving an Atom feed, it is RECOMMENDED that publishers include the charset parameter along with the media type in the Content-type HTTP header. If the charset parameter is present, clients MUST parse the Atom feed in that charset, ignoring any charset declared in the encoding attribute of the XML declaration.
Publishers SHOULD serve all Atom feeds with the media type "application/atom+xml" (registered in Section 8 of this document). Clients MUST treat "application/atom+xml" as "application/xml" and determine the character encoding as per RFC 3023 or its successor.
If a publisher wishes to serve an Atom feed over HTTP, but for some reason they are unable to use the "application/atom+xml" media type, the publisher SHOULD use "application/xml", and clients MUST determine the character encoding as per RFC 3023 or its successor.
If a publisher is unable to serve their Atom feed with a Content-Type of "application/atom+xml" or "application/xml", they MAY use "text/xml". According to RFC 3023, XML documents served as "text/xml" with no charset parameter have a character encoding of "us-ascii".
When serving an Atom feed as "text/xml", publishers MUST escape all non-US-ASCII characters as character references. For example, 'ø' for the character 'ø'.
When retrieving an Atom feed served with a Content-type of "text/xml", clients MUST parse it with a "us-ascii" encoding. If such a feed contains non-US-ASCII characters, and clients MUST reject it as non-well-formed.
Publishers MUST NOT serve Atom feeds with a media type other than "application/atom+xml" (registered in this Section 8 of document) or one of the XML media types defined in RFC 3023 or its successor. In particular, "text/plain" is never an appropriate media type for an Atom feed. When retrieving an Atom feed served with a non-XML media type, clients MUST reject it as non-well-formed.
6.2 Handling well-formedness errors
After determining the character encoding by the rules in section 6.1 of this document, clients MUST use a conforming XML parser to parse an Atom feed. In particular, clients MUST stop processing at the first well-formedness error, although they MAY display any information they have parsed before the first well-formedness error.
Here is a non-comprehensive list of things clients have been known to do after encountering a well-formedness error, which this document specifically prohibits:
Clients MUST NOT reparse the feed in any other character encoding.
Clients MUST NOT "tidy" the feed to attempt to fix mismatched start and end tags.
Clients MUST NOT guess at the meaning of undefined entities, including entities defined in the HTML specification.
This proposal has significant impact for both publishers and clients. Publishers must be aware of their web server configuration and ensure that Atom feeds are served with the appropriate media type, or, if that is not possible, that all non-US-ASCII characters are properly escaped. Clients must ensure that they properly implement RFC 3023, which few tools currently support.
This is great advice, but it restates a number of requirements in other specifications that are referred to by Atom. That's not good specification practice, as it very often results in conflicts of interpretation. Furthermore, this places a large number of requirements on "clients" that are difficult to test and constrain their behaviours in frankly unrealistic ways.
I strongly suggest this be moved to a primer, another non-normative document, or to WG-external supporting documentation.