Ian
Hixie: I think it may be time to retire the Content-Type
header, putting to sleep the myth that it is in any way
authoritative, and instead have well-defined content-sniffing rules
for Web content.
The reason why people can safely enter non-Latin-1 characters in
my comments and have them presented properly to all consumers that
have installed the appropriate fonts is that these pages specify
charset=utf-8 in the content-type header.
Sniffing for the character encoding used is clearly not the
answer. Nor am I
convinced that
meta
http-equiv is either.
Don’t throw charset out with the bathwater
But utf-8 is is a good example of a “sniffable” encoding. It can be detected quite reliably - the probability that a random sequence of bytes is valid utf-8 is very low. So a browser can try to decode the text as utf-8. If it is valid, render it. Otherwise, fall back to another likely encoding (i.e. the traditional encoding for the user’s locale).
This technique is not absolutely perfect, but is probably better than trusting charset settings which are often misconfigured.
If you look closely at my page, and the headers that accompany it, I declare it as utf-8, but I could just as easily have declared it iso-8859-1 or us-ascii for that matter. Everything is seven-bit safe, with numeric character references for characters that can’t be directly expressed in us-ascii.
The reason why I declare my page as utf-8 is to influence the choice browsers will make when they send POST data to my application. Note: I also specify utf-8 on accept-charset on the form tag, even though I know that that is basically ignored.
And then, yes, I do sniff the data I receive. If I see a charset in the Content-Type header of the POST request, I assume that that is correct. Otherwise, I default to ‘utf-8’, as that’s how my form is constructed. Then I verify that the post data in question can be interpreted in that character set. Otherwise, I go with HTML’s default of iso-8859-1, as modified by Microsoft as win-1252, as that’s both what the HTML specs say and what is the most common default in my experience.
I’m not really convinced that we need to throw Content-type away anyway. Sure, a lot of software gets it wrong, but a lot of software gets it right too. Not all user-agents are web browsers.
UTF-8 sniffing requires that you buffer up the bytes for sniffing before you initialize an encoding decodes and start parsing the decoded Unicode character stream. That’s bad for performance.
UTF-8 sniffing requires that you buffer up the bytes for sniffing
If you assume that the document is a mix, then approaches like Aristotle’s could be adjusted to work with streams. At most you would only ever have to buffer four bytes at a time, and then only if each of these bytes have a value of >= 0x80.
Dans un récent article, Content-Type is dead, Ian Hickson fait le point sur l’en-tête HTTP Content-Type [1] et sur la façon dont il est réellement exploité (ou pas) par les agent utilisateurs [2]. Pour ceux qui ne comprennent pas l’anglais, le...
FWIW, I agree that you don’t want to drop Content-Type completely. It just would become one of several inputs to the type detection algorithm, instead of the only one.
I’m personally unhappy about sniffing. I still don’t like the fact that all of the Feed Validator testcases (e.g.: atom, rss 2.0) now trigger “do you want to subscribe?” messages in Firefox.
But I digress. My objection here was specific to charset. On that narrower topic, I would be satisfied if the “one of several inputs” approach to charset detection reduces to something akin to “if the charset parameter specifies utf-8, and nothing in the first n bytes indicates otherwise, then we’ll go with that”.