It’s just data

Don't throw charset out with the bathwater

Ian Hixie: I think it may be time to retire the Content-Type header, putting to sleep the myth that it is in any way authoritative, and instead have well-defined content-sniffing rules for Web content.

The reason why people can safely enter non-Latin-1 characters in my comments and have them presented properly to all consumers that have installed the appropriate fonts is that these pages specify charset=utf-8 in the content-type header.

Sniffing for the character encoding used is clearly not the answer.  Nor am I convinced that meta http-equiv is either.

Don’t throw charset out with the bathwater

But utf-8 is is a good example of a “sniffable” encoding. It can be detected quite reliably - the probability that a random sequence of bytes is valid utf-8 is very low. So a browser can try to decode the text as utf-8. If it is valid, render it. Otherwise, fall back to another likely encoding (i.e. the traditional encoding for the user’s locale).

This technique is not absolutely perfect, but is probably better than trusting charset settings which are often misconfigured.

Posted by oefe at

Don’t throw charset out with the bathwater

Yes, utf-8 can be sniffed.

Now, consider how the charset parameter interacts with HTML forms.

If you look closely at my page, and the headers that accompany it, I declare it as utf-8, but I could just as easily have declared it iso-8859-1 or us-ascii for that matter.  Everything is seven-bit safe, with numeric character references for characters that can’t be directly expressed in us-ascii.

The reason why I declare my page as utf-8 is to influence the choice browsers will make when they send POST data to my application.  Note: I also specify utf-8 on accept-charset on the form tag, even though I know that that is basically ignored.

And then, yes, I do sniff the data I receive.  If I see a charset in the Content-Type header of the POST request, I assume that that is correct.  Otherwise, I default to ‘utf-8’, as that’s how my form is constructed.  Then I verify that the post data in question can be interpreted in that character set.  Otherwise, I go with HTML’s default of iso-8859-1, as modified by Microsoft as win-1252, as that’s both what the HTML specs say and what is the most common default in my experience.

Posted by Sam Ruby at

Don’t throw charset out with the bathwater

I’m not really convinced that we need to throw Content-type away anyway. Sure, a lot of software gets it wrong, but a lot of software gets it right too. Not all user-agents are web browsers.

Posted by Martin Atkins at

Don’t throw charset out with the bathwater

Relevant to the discussion: TAG Finding 12: Authoritative Metadata

Posted by James Snell at

Don’t throw charset out with the bathwater

UTF-8 sniffing requires that you buffer up the bytes for sniffing before you initialize an encoding decodes and start parsing the decoded Unicode character stream. That’s bad for performance.

Posted by Henri Sivonen at

Don’t throw charset out with the bathwater

UTF-8 sniffing requires that you buffer up the bytes for sniffing

If you assume that the document is a mix, then approaches like Aristotle’s could be adjusted to work with streams.  At most you would only ever have to buffer four bytes at a time, and then only if each of these bytes have a value of >= 0x80.

Posted by Sam Ruby at

Agents utilisateurs et Content-Type

Dans un récent article, Content-Type is dead, Ian Hickson fait le point sur l’en-tête HTTP Content-Type [1] et sur la façon dont il est réellement exploité (ou pas) par les agent utilisateurs [2]. Pour ceux qui ne comprennent pas l’anglais, le...

Excerpt from Sébastien Guillon at

Don’t throw charset out with the bathwater

FWIW, I agree that you don’t want to drop Content-Type completely. It just would become one of several inputs to the type detection algorithm, instead of the only one.

Posted by Ian Hickson at

Don’t throw charset out with the bathwater

I’m personally unhappy about sniffing.  I still don’t like the fact that all of the Feed Validator testcases (e.g.: atom, rss 2.0) now trigger “do you want to subscribe?” messages in Firefox.

But I digress.  My objection here was specific to charset.  On that narrower topic, I would be satisfied if the “one of several inputs” approach to charset detection reduces to something akin to “if the charset parameter specifies utf-8, and nothing in the first n bytes indicates otherwise, then we’ll go with that”.

Posted by Sam Ruby at

Add your comment