Character Encoding and HTML Forms

2004-04-15T19:47:16Z

Joe Gregorio: lacking any other indications, a browser will submit the data from a form using the same character encoding that the page is served in.

This mind blowing statement was embedded in an otherwise interesting article on Atom and Wiki's. It has caused me to rethink how I serve pages on my weblog, and has caused me to begin the switch to utf-8. Here's why:

If you don't declare any encoding, the browser will do whatever it thinks is best, and it is up to the server to guess what has been sent.
If you declare iso-8859-1 (a common encoding covering western Europe and Latin countries), things not encodable in that scheme will be silently converted to numeric character references by the browser. For forms expecting HTML or XML input, this is fine, otherwise it is a bit unexpected.
If you declare utf-8 in the HTTP headers (or equivalently in the HTTP's meta http-equiv as Joe describes), you can be sure that the data you receive is the data that is sent.

Meanwhile, I've been receiving a lot of good input on my i18n survival guide; once the dust settles, this information will be factored in.