utf-8 musings
Jacques Distler: If I converted to UTF-8, presumably, this problem would be solved. Unfortunately, the last time I tried it, the interaction between UTF-8 and MT’s Comment Form was such a horror story that I’m loath to try it again.
Consider the following observations about Musings:
- Conceptually, each web page is composed of characters from the ISO 10646 character set.
- Physically, each web page is a stream of bytes which nearly always are limited to US-ASCII. In fact, despite a reference to Erdös and the presence of a number of Hebrew characters, this web page was exclusively US-ASCII until I left this comment.
- By declaration, each web page is iso-8859-1.
Now let's take a look at utf-8 in Perl:
use Encode;
$input="49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E";
print "$input\n";
$input=decode('utf-8',pack("H*",$input));
for ($i=0; $i<length($input); $i++) {
printf "%X", ord(substr($input,$i,1));
}
Which will produce the following output:
49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E 49F174EB726EE27469F46EE06C697AE67469F86E
Astute readers may note that the second line corresponds to iso-8859-1. Confused yet?
Now lets make a few observations which will simplify things:
- The characters in US-ASCII are a proper subset of the characters in iso-8859-1, which in turn are a proper subset of the characters in iso 10646.
- The byte representation for the first 128 characters in the iso 10646 character set are the same in US-ASCII, iso-8859-1, and utf-8.
- In US-ASCII and iso-8859-1, the ordinal value of characters are equal to the numeric value of the corresponding bytes. In utf-8, bytes with a numeric value greater than 127 are part of a multibyte sequence.
Now let's try a Gedanken Experiment... imagine an alternate version of Musings in which every character returned in response to a HTTP GET request was precisely limited to US-ASCII. That would mean that independent of how my comment was stored, it would be transmitted as the following (or equivalent):
Iñtërnâtiônàlizætiøn
Such web pages could be validly declared as us-ascii, iso-8859-1 or utf-8. In fact, the only operational difference would be how data returned from forms are encoded.
The comment forms on Musing have four input fields and one
textarea. Encode::decode can be used to convert
the utf-8 bytes received into a Perl string. The Perl
length, ord, and substr
functions work on characters (as of
Perl
version 5.6). In fact, a loop like the above can be used
to detect which characters have an ordinal value greater than 127,
and replace such characters with a
numeric
character reference.
"I saw the best minds of my generation destroyed by madness..."
We will eventually get this right, and understand it completely, and then slowly whip each CMS vendor but one into getting it right too, like we've done with ETags and gzip compression and accessibility and CSS and all the other issues the technical bloggers have advocated in the past few years. And then we'll find Sam wandering the streets muttering incoherently to himself and screaming "THERE'S NO SUCH THING AS A CHARACTER!" to passersby who swerve to avoid him.
And about that time, the next whiz kid with l33t Perl skills will make Gmail-for-blogs that's insanely easy to use but gets all of this wrong, and we'll be back where we started.
Internationalization is like parenting: a lifelong cycle of hardship in which no cumulative knowledge is gained.
Posted by Mark atJacques, you forgot "é" ;-)
If the user types "é", what should go into the <textarea> should be any one of the above, allowing the user to see exactly what they typed.
If the user types "é", what should go into the <textarea> should be "&eacute;" (or "&eacute;" or "&eacute;"), again allowing the user to see exactly what they typed.
An easy way to achieve this is to use CGI::escapeHTML.
So, I'm not quite following this so I hope that by summarising what I think I understand, I hope someone can explain why I'm wrong:
- US-ASCII, ISO 10646, and UTF-8 are all equivalent at the byte level for characters < 127
- For characters above 'position' 127 (I use the term position loosely because I seem to remember subtleties) one can always determine which character is being represented and convert it to an equivalent character entity that uses only characters below position 127
- Therefore with sufficiently clever conversion algorithms one may serve characters from many different character sets and claim one's page is UTF-8, ISO 10646 or US-ASCII and be correct whilst only sending bytes corresponding to ASCII characters over the wire
- There is a problem in that one is throwing away the information about which representation was originally entered. This is a problem for e.g. the comment preview page and, one presumes, also for the editing and previewing functionality available to the author.
- There is a slightly tangential issue that isn't raised in this post that no one understands what the default for any of this stuff is but people rely on it anyway
- Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.
So what am I missing?
Posted by jgraham atSo, the first thing I missed was that every time I wrote ISO 10646, I actually meant iso-8859-1. There's nothing like having descriptive names for these things...
Posted by jgraham at
jgraham,
US-ASCII, iso-8859-1, and UTF-8 are all equivalent at the byte level for characters < 127
Yes.
For characters above 'position' 127 (I use the term position loosely because I seem to remember subtleties) one can always determine which character is being represented and convert it to an equivalent character entity that uses only characters below position 127
Yes. In Perl, for example, ord returns the 'position' of a character.
Therefore with sufficiently clever conversion algorithms one may serve characters from many different character sets and claim one's page is UTF-8, iso-8859-1 or US-ASCII and be correct whilst only sending bytes corresponding to ASCII characters over the wire
Yes. FYI: here is the "sufficiently clever conversion algorithm" in Perl:
ord($c)>127 ? "&#".ord($c).";" : $c
There is a problem in that one is throwing away the information about which representation was originally entered. This is a problem for e.g. the comment preview page and, one presumes, also for the editing and previewing functionality available to the author.
If I were to enter the greek letter Sigma ("Σ") on a weblog served as iso-8859-1, the browser has no way to transmit this information to the server as sigma is not defined in iso-8859-1. What some browsers will transmit instead is "Σ". Of course, this is exactly the same thing a browser would transmit had you typed in the characters "&", "#", "9", "3", "1", and ";".
There is a slightly tangential issue that isn't raised in this post that no one understands what the default for any of this stuff is but people rely on it anyway
In this case, you would think that accept-charset would be definitive, but it seems that browsers prefer content-type or http-equiv, unless iso-8859-1 is specified, in which case the browser will generally ignore your preference.
Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.
There's equal bits of hyperbole and truth in Mark's comment. But something worth noting: less than two weeks ago, utf-8 encoding was completely opaque to me and the benefits of using it were completely unknown to me.
Posted by Sam Ruby atOK, so if I'm following correctly then, assuming you just communicate between browser and server, everything is peachy. A user enters a comment containing Σ and the browser converts it to Σ before transmitting to the server. Then if the comment is later edited, the server will transmit Σ but the browser will display Σ. Correct? If I'm following so far, the substantial difficulty that remains is editing content outside of the browser. If I write a post containing some Japanese characters but store it as US-ASCII encoded unicode as opposed to UTF-8 encoded unicode then, should I ever wish to open the post in a text editor, I'll be left staring at a bunch of encoded entities rather than real text.
Posted by jgraham at
jgraham: consider preview.
If you write a post containing some Japanese characters on Musings, it will be converted to xhtml numeric character references as opposed to utf-8 encoded Unicode. Consequently, in the preview window you will find that the carefully entered Japanese characters have all been converted to an opaque string of numbers and symbols.
Apparently, Jacques tried to convert to utf-8 in the past, unsuccessfully. I'm offering the suggestion that he try to do this in two steps: first do whatever it takes to convert my comment to 7-bit safe ASCII. Second, mark his comment form as utf-8, and use Perl Encode functions to convert the form fields into a string.
Posted by Sam Ruby atApparently, Jacques tried to convert to utf-8 in the past, unsuccessfully. I'm offering the suggestion that he try to do this in two steps: first do whatever it takes to convert my comment to 7-bit safe ASCII. Second, mark his comment form as utf-8, and use Perl Encode functions to convert the form fields into a string.
The round-trip-ability problems I encountered, and the eventual solution probably would not occur if MT used the superior UTF-8 awareness of recent version of Perl. Unfortunately, MT assumes only Perl 5.004.
Yuan-Chung managed to whip MT into doing the right thing with only some "small" changes to the code. Without that mucking around, my setup does the right thing with characters in the iso-8859-1 repertoire. Any characters entered by the user not in iso-8859-1 gets converted to numeric entities.
I'd love to do better, but not at the cost of having to rewrite MT's string-handling routines.
Posted by Jacques Distler atWhy writing software stinks
Because you often find yourself doing stuff like this instead of actually solving problems. Sometimes software is living proof that...... [more]Trackback from Notes from Classy's Kitchen at
Anne
Monday 31 May 2004 00:14 Nee, niet heel veel. utf-8 is alleen wel veel universeler en fijner imo. Zie ook: [link] en [link] en...Excerpt from GoT at
Um... An HTML character reference is supposed to refer to a character in the document’s character set. Thus, if you are using an ISO-8859-1 encoded document, “Σ” would be an invalid character reference.
Who cares? Well, suppose we are trying to translate a document to a UTF-8 or unicode representation. We could run through the document and replace html character references with valid characters in the document encoding and then convert the entire document to utf-8 or unicode. But for that approach to work, it must be the case that character references are in the correct character set.
So, we could try to convert the document and then go back in and try and covnert the character references. But now, when we encounter a numeric character reference, we have to decide if that reference is in the document’s stated encoding or if it uses some other encoding.
So, yes, you can make up your own rules which may just happen to work until someone comes along and doesn’t realize you made up your own rules. Or you can use standards and have some hope of things someday working. If you want to transmit “é” don’t claim your web-page uses us-ascii.
Posted by chuck simmons atUm... An HTML character reference is supposed to refer to a character in the document’s character set
The Document Character Set for HTML is ISO10646. The Character Encoding may be ISO-8859-1, US-ASCII, or UTF-8, but this does not affect the Document Character Set nor the way Numeric Character References are constructed.
In fact, the section on Numeric Character References specifically mentions ISO 10646. Twice.
Thus, if you are using an ISO-8859-1 encoded document, “Σ” would be an invalid character reference.
False
Posted by Sam Ruby at
Just to clarify, "é", "é" and "é" are all valid ways of entering a small latin 'e' with an acute accent. They should, of course, all display the same in the user's browser.
But, when echoed back to the user in the <textarea> of the comment form after the user clicks "PREVIEW", we should attempt, wherever possible, to allow the user to edit the same string that he entered.
If he entered "é", that's what he should see, not "é", and vice versa.
Posted by Jacques Distler at