Jacques Distler: If I converted to UTF-8, presumably, this
problem would be solved. Unfortunately, the last time I tried it,
the interaction between UTF-8 and MT’s Comment Form was such
a
horror story that I’m loath to try it again.
Consider the following observations about
Musings:
Conceptually,
each web page is composed of characters from the
ISO 10646
character set.
Physically, each web page is a stream of bytes which nearly
always are limited to
US-ASCII. In
fact, despite a reference to Erdös and the presence of a
number of Hebrew characters,
this web page was exclusively US-ASCII until I left
this comment.
Astute readers may note that the second line corresponds to
iso-8859-1. Confused yet?
Now lets make a few observations which will simplify things:
The characters in US-ASCII are a proper subset of the
characters in iso-8859-1, which in turn are a proper subset of the
characters in iso 10646.
The byte representation for the first 128 characters in the iso
10646 character set are the same in US-ASCII, iso-8859-1, and
utf-8.
In US-ASCII and iso-8859-1, the ordinal value of characters are
equal to the numeric value of the corresponding bytes. In
utf-8, bytes with a numeric value greater than 127 are part of a
multibyte sequence.
Now let's try a Gedanken Experiment... imagine an alternate
version of Musings in which every character returned in response to
a HTTP GET request was precisely limited to US-ASCII. That
would mean that independent of how my comment was stored, it would
be transmitted as the following (or equivalent):
Such web pages could be validly declared as us-ascii, iso-8859-1
or utf-8. In fact, the only operational difference would be
how data returned from forms are encoded.
The comment forms on Musing have four input fields and one
textarea. Encode::decode can be used to convert
the utf-8 bytes received into a Perl string. The Perl
length, ord, and substr
functions work on characters (as of
Perl
version 5.6). In fact, a loop like the above can be used
to detect which characters have an ordinal value greater than 127,
and replace such characters with a
numeric
character reference.
Just to clarify, "é", "é" and "é" are all valid ways of entering a small latin 'e' with an acute accent. They should, of course, all display the same in the user's browser.
But, when echoed back to the user in the <textarea> of the comment form after the user clicks "PREVIEW", we should attempt, wherever possible, to allow the user to edit the same string that he entered.
If he entered "é", that's what he should see, not "é", and vice versa.
"I saw the best minds of my generation destroyed by madness..."
We will eventually get this right, and understand it completely, and then slowly whip each CMS vendor but one into getting it right too, like we've done with ETags and gzip compression and accessibility and CSS and all the other issues the technical bloggers have advocated in the past few years. And then we'll find Sam wandering the streets muttering incoherently to himself and screaming "THERE'S NO SUCH THING AS A CHARACTER!" to passersby who swerve to avoid him.
And about that time, the next whiz kid with l33t Perl skills will make Gmail-for-blogs that's insanely easy to use but gets all of this wrong, and we'll be back where we started.
Internationalization is like parenting: a lifelong cycle of hardship in which no cumulative knowledge is gained.
If the user types "é", what should go into the <textarea> should be any one of the above, allowing the user to see exactly what they typed.
If the user types "é", what should go into the <textarea> should be "&eacute;" (or "&eacute;" or "&eacute;"), again allowing the user to see exactly what they typed.
An easy way to achieve this is to use CGI::escapeHTML.
So, I'm not quite following this so I hope that by summarising what I think I understand, I hope someone can explain why I'm wrong:
US-ASCII, ISO 10646, and UTF-8 are all equivalent at the byte level for characters < 127
For characters above 'position' 127 (I use the term position loosely because I seem to remember subtleties) one can always determine which character is being represented and convert it to an equivalent character entity that uses only characters below position 127
Therefore with sufficiently clever conversion algorithms one may serve characters from many different character sets and claim one's page is UTF-8, ISO 10646 or US-ASCII and be correct whilst only sending bytes corresponding to ASCII characters over the wire
There is a problem in that one is throwing away the information about which representation was originally entered. This is a problem for e.g. the comment preview page and, one presumes, also for the editing and previewing functionality available to the author.
There is a slightly tangential issue that isn't raised in this post that no one understands what the default for any of this stuff is but people rely on it anyway
Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.
So, the first thing I missed was that every time I wrote ISO 10646, I actually meant iso-8859-1. There's nothing like having descriptive names for these things...
US-ASCII, iso-8859-1, and UTF-8 are all equivalent at the byte level for characters < 127
Yes.
For characters above 'position' 127 (I use the term position loosely because I seem to remember subtleties) one can always determine which character is being represented and convert it to an equivalent character entity that uses only characters below position 127
Yes. In Perl, for example, ord returns the 'position' of a character.
Therefore with sufficiently clever conversion algorithms one may serve characters from many different character sets and claim one's page is UTF-8, iso-8859-1 or US-ASCII and be correct whilst only sending bytes corresponding to ASCII characters over the wire
Yes. FYI: here is the "sufficiently clever conversion algorithm" in Perl:
ord($c)>127 ? "&#".ord($c).";" : $c
There is a problem in that one is throwing away the information about which representation was originally entered. This is a problem for e.g. the comment preview page and, one presumes, also for the editing and previewing functionality available to the author.
If I were to enter the greek letter Sigma ("Σ") on a weblog served as iso-8859-1, the browser has no way to transmit this information to the server as sigma is not defined in iso-8859-1. What some browsers will transmit instead is "Σ". Of course, this is exactly the same thing a browser would transmit had you typed in the characters "&", "#", "9", "3", "1", and ";".
There is a slightly tangential issue that isn't raised in this post that no one understands what the default for any of this stuff is but people rely on it anyway
In this case, you would think that accept-charset would be definitive, but it seems that browsers prefer content-type or http-equiv, unless iso-8859-1 is specified, in which case the browser will generally ignore your preference.
Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.
There's equal bits of hyperbole and truth in Mark's comment. But something worth noting: less than two weeks ago, utf-8 encoding was completely opaque to me and the benefits of using it were completely unknown to me.
OK, so if I'm following correctly then, assuming you just communicate between browser and server, everything is peachy. A user enters a comment containing Σ and the browser converts it to Σ before transmitting to the server. Then if the comment is later edited, the server will transmit Σ but the browser will display Σ. Correct? If I'm following so far, the substantial difficulty that remains is editing content outside of the browser. If I write a post containing some Japanese characters but store it as US-ASCII encoded unicode as opposed to UTF-8 encoded unicode then, should I ever wish to open the post in a text editor, I'll be left staring at a bunch of encoded entities rather than real text.
If you write a post containing some Japanese characters on Musings, it will be converted to xhtml numeric character references as opposed to utf-8 encoded Unicode. Consequently, in the preview window you will find that the carefully entered Japanese characters have all been converted to an opaque string of numbers and symbols.
Apparently, Jacques tried to convert to utf-8 in the past, unsuccessfully. I'm offering the suggestion that he try to do this in two steps: first do whatever it takes to convert my comment to 7-bit safe ASCII. Second, mark his comment form as utf-8, and use Perl Encode functions to convert the form fields into a string.
Apparently, Jacques tried to convert to utf-8 in the past, unsuccessfully. I'm offering the suggestion that he try to do this in two steps: first do whatever it takes to convert my comment to 7-bit safe ASCII. Second, mark his comment form as utf-8, and use Perl Encode functions to convert the form fields into a string.
The round-trip-ability problems I encountered, and the eventual solution probably would not occur if MT used the superior UTF-8 awareness of recent version of Perl. Unfortunately, MT assumes only Perl 5.004.
Yuan-Chung managed to whip MT into doing the right thing with only some "small" changes to the code. Without that mucking around, my setup does the right thing with characters in the iso-8859-1 repertoire. Any characters entered by the user not in iso-8859-1 gets converted to numeric entities.
I'd love to do better, but not at the cost of having to rewrite MT's string-handling routines.
Um... An HTML character reference is supposed to refer to a character in the document’s character set. Thus, if you are using an ISO-8859-1 encoded document, “Σ” would be an invalid character reference.
Who cares? Well, suppose we are trying to translate a document to a UTF-8 or unicode representation. We could run through the document and replace html character references with valid characters in the document encoding and then convert the entire document to utf-8 or unicode. But for that approach to work, it must be the case that character references are in the correct character set.
So, we could try to convert the document and then go back in and try and covnert the character references. But now, when we encounter a numeric character reference, we have to decide if that reference is in the document’s stated encoding or if it uses some other encoding.
So, yes, you can make up your own rules which may just happen to work until someone comes along and doesn’t realize you made up your own rules. Or you can use standards and have some hope of things someday working. If you want to transmit “é” don’t claim your web-page uses us-ascii.