[MartinDürst] Make sure that non-ASCII characters can be used in what would otherwise be URIs.




Users will expect they can use Internationalized Domain Names (IDNs) in 'URI's, and then will also want to be able to use non-ASCII characters in other parts of 'URI's. Now that the IRI spec is approved by the IESG, this can easily be done by making all the URI elements/attributes IRIs. Also, because Atom uses XML, such IRIs will always be transferable.


The following edits are needed in the format spec (based on -03.txt draft):

  1. Replace all occurrences of "URI" (in upper case) with "IRI" (including things such as "relative URI reference", "URIs",...).

  2. Do *not* replace element/attribute names that read 'uri'. This will lead to somewhat strange sentences as "The content of atom:uri in a Person construct MUST be a IRI." While this makes the spec somewhat strange to read in very few places, it will work out for users.

  3. Replace all occurrences of "[RFC2396]" in the text (three times just after "URI"), but not the one in the References, by [RFCYYYY].

  4. Replace the reference to [RFC2396] with a reference to RFC2396bis, as follows: [RFCXXXX] Berners-Lee, T., Fielding, R. and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax (Note to the RFC Editor: Please update this reference with the RFC resulting from draft-fielding-uri-rfc2396bis-xx.txt, and remove this Note)", draft-fielding-uri-rfc2396bis-07 (work in progress), September 2004.

  5. Add a normative reference, as follows: [RFCYYYY] Duerst, M., and Suignard, M., "Internationalized Resource Identifiers (IRIs) (Note to the RFC Editor: Please update this reference with the RFC resulting from draft-draft-duerst-iri-xx.txt, and remove this Note)", draft-duerst-iri-11 (work in progress), November 2004.

  6. Change "[[ discussion of URI escaping and i18n ]]" to something like the following: "Atom alows the use of IRIs [RFCYYYY], rather than only URIs [RFCXXXX]. For resolution, IRIs can easily be converted to URIs. When comparing IRIs serving as Identity Constructs, they MUST NOT be converted to URIs. Please note that by definition, every URI is an IRI, so any URI can be used where an IRI is needed."

  7. In section 3.6.1 (Dereferencing Identity Constructs), replace the bullet point "Ensure that all portions of the URI are utf-8 encoded NFC form Unicode strings." with something like "Ensure that all components of the IRI are appropriately character-normalized, e.g. by using NFC or NFKC."

  8. Maybe add IRI examples in the section on Comparing Identity Constructs. But it's much better to have such examples somewhere else, because they need special notation in IETF specs that are, as of now, still limited to US-ASCII only. And this special notation usually leads to more confusion than benefit.


For the users of the Atom format, it will allow to put, into the href attribute and other, similar attributes, International Domain Names (IDNs) and path/document names with non-ASCII characters, rather than to escape/encrypt them as punicode (for IDNs) and %-escapes (for the rest).

For implementers, it will mean that whenever any of the above attributes is dereferenced (i.e. a link followed), the following is needed:

  1. If not using UTF-8 internally (e.g. UTF-16), then convert the attribute value to UTF-8. From then on, make sure your code is 8-bit clean.

  2. Before handing it over to your URI resolver, escape non-ASCII characters with %HH.

  3. Before resolving the domain name, revert the %-encoding, and hand over the domain name to an IDN library (various open-source implementations already available) for conversion to punycode.

1) is about one call to a library function that if you do XML, you'll have around anyway. 2) is a few lines of code. 3) is a few lines of code plus something like 200k in library code, mostly due to the tables needed for nameprep. In terms of implementation, this should be on the order of a few minutes to a few hours, depending on coding skills. In terms of footprint, the IDN library can be a problem e.g. on mobile phones, but should not be an issue on bigger systems.

Some more details: - XML Base: please note that XML Base already is written so that applying a base and converting IRIs to URIs are commutative; see


See also PaceUriOrItsSuccessor.