Abstract
Require atom:id values to be canonical URIs, based primarily on the rules defined in rfc 2396bis
Status
Open
Rationale
Existing software libraries vary in what normalization rules they respect when doing URI comparisons. In order to ensure predictable comparisions, canonicalization is required.
Proposal
In format draft -01, section 5.5 "atom:id" Element, replace the last paragraph with the following text.
-
atom:entry MUST contain exactly one atom:id element. The content of this element MUST be a URI canonicalized as follows:
-
Always provide the URI scheme in lowercase characters.
-
Always provide the host, if any, in lowercase characters.
-
Only perform percent-encoding where it is essential.
-
Always use uppercase A-through-F characters when percent-encoding.
-
Prevent dot-segments appearing in non-relative URI paths.
-
For schemes that define a default authority, use an empty authority if the default is desired.
-
For schemes that define an empty path to be equivalent to a path of "/", use "/".
-
For schemes that define a port, use an empty port if the default is desired
-
Empty fragment identifiers must be preserved
-
All portions of the URI must be utf-8 encoded NFC from Unicode strings
Examples
-
Always provide the URI scheme in lowercase characters.
-
Always provide the host, if any, in lowercase characters.
-
Only perform percent-encoding where it is essential.
-
Always use uppercase A-through-F characters when percent-encoding.
-
Prevent dot-segments appearing in non-relative URI paths.
-
For schemes that define a default authority, use an empty authority if the default is desired.
-
For schemes that define an empty path to be equivalent to a path of "/", use "/".
-
For schemes that define a port, use an empty port if the default is desired
-
Empty fragment identifiers must be preserved:
-
All portions of the URI must be utf-8 encoded NFC form Unicode strings
-
valid: http://example.com/
invalid: HTTP://example.com/
-
valid: http://example.com/
invalid: http://EXAMPLE.COM/
-
valid: http://example.com/~jane
invalid: http://example.com/%7Ejane
-
valid: http://example.com/?q=1%2F2
invalid: http://example.com/?q=1%2f2
-
valid: http://example.com/a/b
invalid: http://example.com/a/./b
invalid: http://example.com/a/../a/b
-
valid: http://user:password@example.com/
invalid: http://@example.com/
invalid: http://:@example.com/
-
valid: http://example.com/
invalid: http://example.com
-
valid: http://example.com:8080/
invalid: http://example.com:80/
-
valid: http://www.w3.org/2000/01/rdf-schema#
-
valid: http://example.com/?q=%C3%87 (C-cedilla U+00C7)
valid: http://example.com/?q=%E2%85%A0 (Roman numeral one U+2160)
invalid: http://example.com/?q=%C7 (C-cedilla ISO-8859-1)
invalid: http://example.com/?q=C%CC%A7 (Latin capital letter C + Combining cedilla U+0327)
Impacts
Few existing ids would be affected, but some feed producing software may need to be modified to insert canonicalization logic.