It’s just data

utf-8 musings

Jacques Distler: If I converted to UTF-8, presumably, this problem would be solved. Unfortunately, the last time I tried it, the interaction between UTF-8 and MT’s Comment Form was such a horror story that I’m loath to try it again.

Consider the following observations about Musings:

Now let's take a look at utf-8 in Perl:

 use Encode;
 
 $input="49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E";
 print "$input\n";
 
 $input=decode('utf-8',pack("H*",$input));
 
 for ($i=0; $i<length($input); $i++) {
   printf "%X", ord(substr($input,$i,1));
 }

Which will produce the following output:

 49C3B174C3AB726EC3A27469C3B46EC3A06C697AC3A67469C3B86E
 49F174EB726EE27469F46EE06C697AE67469F86E

Astute readers may note that the second line corresponds to iso-8859-1.  Confused yet? 

Now lets make a few observations which will simplify things:

Now let's try a Gedanken Experiment... imagine an alternate version of Musings in which every character returned in response to a HTTP GET request was precisely limited to US-ASCII.  That would mean that independent of how my comment was stored, it would be transmitted as the following (or equivalent):

I&ntilde;t&euml;rn&acirc;ti&ocirc;n&agrave;liz&aelig;ti&oslash;n

Such web pages could be validly declared as us-ascii, iso-8859-1 or utf-8.  In fact, the only operational difference would be how data returned from forms are encoded.

The comment forms on Musing have four input fields and one textarea.  Encode::decode can be used to convert the utf-8 bytes received into a Perl string.  The Perl length, ord, and substr functions work on characters (as of Perl version 5.6).  In fact, a loop like the above can be used to detect which characters have an ordinal value greater than 127, and replace such characters with a numeric character reference.


Just to clarify, "é", "&eacute;" and "&#233;" are all valid ways of entering a small latin 'e' with an acute accent. They should, of course, all display the same in the user's browser.

But, when echoed back to the user in the &lt;textarea&gt; of the comment form after the user clicks "PREVIEW", we should attempt, wherever possible, to allow the user to edit the same string that he entered.

If he entered "é", that's what he should see, not  "&eacute;", and vice versa.

Posted by Jacques Distler at

"I saw the best minds of my generation destroyed by madness..."

We will eventually get this right, and understand it completely, and then slowly whip each CMS vendor but one into getting it right too, like we've done with ETags and gzip compression and accessibility and CSS and all the other issues the technical bloggers have advocated in the past few years.  And then we'll find Sam wandering the streets muttering incoherently to himself and screaming "THERE'S NO SUCH THING AS A CHARACTER!" to passersby who swerve to avoid him.

And about that time, the next whiz kid with l33t Perl skills will make Gmail-for-blogs that's insanely easy to use but gets all of this wrong, and we'll be back where we started.

Internationalization is like parenting: a lifelong cycle of hardship in which no cumulative knowledge is gained.

Posted by Mark at

Jacques, you forgot "&#xE9;" ;-)

If the user types "é", what should go into the <textarea> should be any one of the above, allowing the user to see exactly what they typed.

If the user types "&eacute;", what should go into the <textarea> should be "&amp;eacute;" (or "&#38;eacute;" or "&#x26;eacute;"), again allowing the user to see exactly what they typed.

An easy way to achieve this is to use CGI::escapeHTML.

Posted by Sam Ruby at

So, I'm not quite following this so I hope that by summarising what I think I understand, I hope someone can explain why I'm wrong:

So what am I missing?

Posted by jgraham at

So, the first thing I missed was that every time I wrote ISO 10646, I actually meant iso-8859-1. There's  nothing like having descriptive names for these things...

Posted by jgraham at

jgraham,

US-ASCII, iso-8859-1, and UTF-8 are all equivalent at the byte level for characters < 127

Yes.

For characters above 'position' 127 (I use the term position loosely because I seem to remember subtleties) one can always determine which character is being represented and convert it to an equivalent character entity that uses only characters below position 127

Yes.  In Perl, for example, ord returns the 'position' of a character.

Therefore with sufficiently clever conversion algorithms one may serve characters from many different character sets and claim one's page is UTF-8, iso-8859-1 or US-ASCII and be correct whilst only sending bytes corresponding to ASCII characters over the wire

Yes.  FYI: here is the "sufficiently clever conversion algorithm" in Perl:

ord($c)>127 ? "&#".ord($c).";" : $c

There is a problem in that one is throwing away the information about which representation was originally entered. This is a problem for e.g. the comment preview page and, one presumes,  also for the editing and previewing functionality available to the author.

If I were to enter the greek letter Sigma ("Σ") on a weblog served as iso-8859-1, the browser has no way to transmit this information to the server as sigma is not defined in iso-8859-1.  What some browsers will transmit instead is "&#931;".  Of course, this is exactly the same thing a browser would transmit had you typed in the characters "&", "#", "9", "3", "1", and ";".

There is a slightly tangential issue that isn't raised in this post that no one understands what the default for any of this stuff is but people rely on it anyway

In this case, you would think that accept-charset would be definitive, but it seems that browsers prefer content-type or http-equiv, unless iso-8859-1 is specified, in which case the browser will generally ignore your preference.

Mark believes that because Unicode is harder than not-Unicode people will always create systems that fail to use Unicode and so break  in unpleasant ways only after they are widely enough deployed that I18N becomes an issue.

There's equal bits of hyperbole and truth in Mark's comment.  But something worth noting: less than two weeks ago, utf-8 encoding was completely opaque to me and the benefits of using it were completely unknown to me.

Posted by Sam Ruby at

OK, so if I'm following correctly then, assuming you just communicate between browser and server, everything is peachy. A user enters a comment containing Σ and the browser converts it to &#931; before transmitting to the server. Then if the comment is later edited, the server will transmit &#931; but the browser will display Σ. Correct? If I'm following so far, the substantial difficulty that remains is editing content outside of the browser. If I write a post containing some Japanese characters but store it as US-ASCII encoded unicode as opposed to UTF-8 encoded unicode then, should I ever wish to open the post in a text editor, I'll be left staring at a bunch of encoded entities rather than real text.

Posted by jgraham at

jgraham: consider preview.

If you write a post containing some Japanese characters on Musings, it will be converted to xhtml numeric character references as opposed to utf-8 encoded Unicode.  Consequently, in the preview window you will find that the carefully entered Japanese characters have all been converted to an opaque string of numbers and symbols.

Apparently, Jacques tried to convert to utf-8 in the past, unsuccessfully.  I'm offering the suggestion that he try to do this in two steps: first do whatever it takes to convert my comment to 7-bit safe ASCII.  Second, mark his comment form as utf-8, and use Perl Encode functions to convert the form fields into a string.

Posted by Sam Ruby at

Apparently, Jacques tried to convert to utf-8 in the past, unsuccessfully.  I'm offering the suggestion that he try to do this in two steps: first do whatever it takes to convert my comment to 7-bit safe ASCII.  Second, mark his comment form as utf-8, and use Perl Encode functions to convert the form fields into a string.

The round-trip-ability problems I encountered, and the eventual solution probably would not occur if MT used the superior UTF-8 awareness of recent version of Perl. Unfortunately, MT assumes only Perl 5.004.

Yuan-Chung managed to whip MT into doing the right thing with only some "small" changes to the code. Without that mucking around, my setup does the right thing with characters in the iso-8859-1 repertoire. Any characters entered by the user not in iso-8859-1 gets converted to numeric entities.

I'd love to do better, but not at the cost of having to rewrite MT's string-handling routines.

Posted by Jacques Distler at

Why writing software stinks

Because you often find yourself doing stuff like this instead of actually solving problems. Sometimes software is living proof that...... [more]

Trackback from Notes from Classy's Kitchen

at

Sam Ruby: utf-8 musings

ow, my head hurts...

Excerpt from del.icio.us/ffg/i18n at

Anne

Monday 31 May 2004 00:14 Nee, niet heel veel. utf-8 is alleen wel veel universeler en fijner imo. Zie ook: [link] en [link] en...

Excerpt from GoT at

Um...  An HTML character reference is supposed to refer to a character in the document’s character set.  Thus, if you are using an ISO-8859-1 encoded document, “&#931;” would be an invalid character reference.

Who cares?  Well, suppose we are trying to translate a document to a UTF-8 or unicode representation.  We could run through the document and replace html character references with valid characters in the document encoding and then convert the entire document to utf-8 or unicode.  But for that approach to work, it must be the case that character references are in the correct character set.

So, we could try to convert the document and then go back in and try and covnert the character references.  But now, when we encounter a numeric character reference, we have to decide if that reference is in the document’s stated encoding or if it uses some other encoding.

So, yes, you can make up your own rules which may just happen to work until someone comes along and doesn’t realize you made up your own rules.  Or you can use standards and have some hope of things someday working.  If you want to transmit “&eacute;” don’t claim your web-page uses us-ascii.

Posted by chuck simmons at

Um...  An HTML character reference is supposed to refer to a character in the document’s character set

The Document Character Set for HTML is ISO10646.  The Character Encoding may be ISO-8859-1, US-ASCII, or UTF-8, but this does not affect the Document Character Set nor the way Numeric Character References are constructed.

In fact, the section on Numeric Character References specifically mentions ISO 10646.  Twice.

Thus, if you are using an ISO-8859-1 encoded document, “&#931;” would be an invalid character reference.

False

Posted by Sam Ruby at

Add your comment