Joe
Gregorio: lacking any other indications, a browser will
submit the data from a form using the same character encoding that
the page is served in.
This mind blowing statement was embedded in an otherwise
interesting article on Atom and Wiki's. It has caused me to
rethink how I serve pages on my weblog, and has caused me to begin
the switch to utf-8. Here's why:
If you don't declare any encoding, the browser will do whatever
it thinks is best, and it is up to the server to guess what has
been sent.
If you declare iso-8859-1 (a common encoding covering western
Europe and Latin countries), things not encodable in that scheme
will be silently converted to
numeric
character references by the browser. For forms expecting
HTML or XML input, this is fine, otherwise it is a bit
unexpected.
If you declare utf-8 in the HTTP headers (or equivalently in
the HTTP's meta http-equiv as Joe describes), you can be sure that
the data you receive is the data that is sent.
Meanwhile, I've been receiving a lot of good input on my
i18n
survival guide; once the dust settles, this information will be
factored in.
The best way to control how the web browser will send back data is to use the accept-charset attribute on the <form> element. Without that attribute, all kinds of weird things can happen (eg. if the user forces the browser to use a non-default character encoding to display the page, the form might get submitted in that encoding).
Another nice thing is that if you set accept-charset="UTF-8", Internet explorer will send back curly quotes with the correct Unicode values rather than as latin1 control characters (which it will do if you ask for ISO-8859-1).
Assuming you want fancy curly quotes and the like. In some cases, for instance if you want to send email, you're going to want to convert your utf-8 data to iso-8859-1 or whatever the common encoding is for email in your language, you'll have to make sure that your utf-8->iso-8859-1 recoder can handle those characters. Some recoders will replace such characters with ?, some will just fail to convert the whole thing, etc.
The main characters to look out for are the various hypens and dashes, the various special spaces, the ellipsis, and of course the fancy single and double quotes (including the lower quotes).
You can see these in utf-8 forms even without the accept-charset if people do things like copy & paste text from Word into a form in IE.
If you declare iso-8859-1, things not encodable in that scheme will be silently converted to Windows-1252 if they are encodable in that scheme, and only converted to NCRs if they aren't in 1252 either, in Mozilla. Apparently that evil behavior made sense for some situation at some time.
As to accept-charset, it allows a space-separated list of options, though apparently only Opera actually tells the server which it used. Unless, unless, this old 1999 bug still describes the current situation, and by adding a hidden form field with the name "{underscore}charset{underscore}" will really cause both Moz and IE to populate it with the charset they are actually using. Now that would be a useful, and incredibly hidden, thing, to actually know what you are getting.
(Meta: bloody wiki-like markup. What about people who need to say {underscore}word{underscore} but are too lazy to look up the entity for the underscore character?)
(Mo-meta: preview is rather newline-happy: after one preview, a blank line between paragraphs becomes three blank lines, after another preview, it's up to seven blank lines.)
<cite>If you declare iso-8859-1 (a common encoding covering western Europe and Latin countries)</cite>
That not correct. iso-8859-1, or Latin1, does cover the Western European languages, but not Latin coutries, whatever you call that. Romanian, like Polish, Hungarian, Turkish and other non-Western but Latin alphabet languages are not well suporrted by Latin1.
Phil, I first saw NCR's in Moz when I tried some of the original extended ASCII characters such as ♥. With the configuration I had tried, IE would send the characters as single bytes. I'm not sure which behavior I like least. Oh, and I've fixed the whitespace problem, but the underscore problem is more problematic. Oh, and don't try entering _ or _ as I will simply escape those.
Gabriel, thanks. I'll try to be more precise in the future.
Wieder einmal bin ich auf zwei Seiten gestossen, die sich mit Layout von Web-Formularen auseinandersetzen: Form Layout Experimente des Man in Blue sowi...
[more]
On my plate, in my browser tabs: /~distler/blog/files/MTStripControlChars.pl Musings: MTStripControlChars Sketchbook: m[iA]cro: On NoHTMLEntities and application/xhtml+xml Sam Ruby: Character Encoding......
Hossein Derakhshan: Spread the meme Please test your clients, servers, comments, and feeds. Hossein Derakhshan: I'm doing my part. It took only a few lines of code for me to convert my weblog over to utf-8 (plus changing the content type in a few...
[more]
I've been struggling with the problem of encoding and HTML forms for quite some time now. I should have read this post on Sam Ruby's site, who also wrote the i18n survival guide. It's really simple once you set the......
[more]
Thanks, the link has been updated. As for the problems with the comment script please accept my apologies. I know how to fix the problem, but I am actually spending my time on migrating from Bulu to pyblosxom....
Els formularis HTML permeten determinar la codificació de les dades enviades mitjançant l’atribut accept-charset. Aquest atribut, tal com diu l’especificació, permet declarar una llista de codificacions permeses (llista separada per...
Basically, not much has changed with HTML form processing and input character encoding detection in the last 10 years. It’s still a mess. Some browsers are consistent in using the form data encoding that matches the one in the meta tag within...
Problemstellung: Die Daten eines Formulares, welches in eine ISO-8859-1 enkodierte Seite eingebettet ist, soll UTF-8 kodiert an eine (andere) URL gesendet werden. Browser submitten in der Regel ihre Formulardaten stets in dem Encoding, in welchem...