Cory Doctorow: The theme for this year's ETech is "Remix,"
encompassing those nexus points of iterative hacking and large
ideas that have a way of transforming technology
Cool.
The Problem
Now lets look at the next line:
* The phone has become a platform, moving beyond mere voice to
smart mobile sensor—and back to phone again, by
way of voice-over-IP.
sensor—and? How did
that happen? Let's look at the
O'Reilly source from which Cory copy and pasted:
The phone has become a platform, moving beyond mere voice to
smart mobile sensor—and back to phone again, by way of
voice-over-IP.
Much better. But let's view source:
<li>The phone has become a platform, moving beyond mere voice to smart mobile
sensor—and back to phone again, by way of voice-over-IP.</li>
Now lets view source on the Boing Boing page. What we see
there is sensor—and.
Something clearly happened in the transfer. Let's look closer
at the three bytes, this time in hex: E28094.
This turns out to be the utf-8
representation of
U+2014,
which is known as an "em dash", which is the correct character.
However, my fully standards compliant browser displays this
clean utf-8 as line noise. What's going on?
The Cause
Further investigation reveals that the browser is displaying
this as if it were encoded using windows-1252. How did
that happen? The story continues.
Viewing source on Boing Boing once again, you will see an
entirely futile attempt to declare the correct encoding:
As we will see later, this is a hack that really shouldn't work,
but does in many cases. Just not this one. Continuing
on the trail, we take a look at the HTTP headers that are returned
by Boing Boing:
The last line is key. As Cory once said,
There's
more than one way to describe something. OK, so Cory was
talking about something else there, but we are dealing with
metacrap
nevertheless. In this case, the
charset of
the page is described in two different places. One that is is
completely correct, and completely ignored. And one that is
incorrect, and partially ignored.
In this case, the data inside the document is correct, and the
data which accompanies the document in the transfer is
incorrect. I have a theory that, in general, the accuracy of
metadata is inversely proportional to the distance between the
metadata and the data which it proports to describe.
Apparently, the authors of HTTP and HTML disagree with me, as the
priorities
are defined to be first the HTTP content type;
then the meta element in HTML head; and finally any
charset attributes on any element in the HTML body. Given that the HTTP
Content type defines a default charset, you would think that the
others would never come into play, but here a bit of reality
intrudes. Direct from the
HTML
specification itself:
The HTTP protocol
(
[RFC2616], section 3.7.1) mentions ISO-8859-1 as a default
character encoding when the "charset" parameter is absent from the
"Content-Type" header field. In practice, this recommendation has
proved useless because some servers don't allow a "charset"
parameter to be sent, and others may not be configured to send the
parameter. Therefore, user agents must not assume any default value
for the "charset" parameter.
OK, so this section of the specification explains why the correct
encoding which is placed in something clearly designed as a hack
(meta http-equiv)
to address exactly this situation, is ignored.
We still haven't fully explained the line noise. Even if
the "wrong" piece of metadata was picked, why was windows-1252
selected? This has been a point of contention for a number of
years, and has inspired the creation of tools such as the
Demoroniser
which have this to say on the subject:
A little detective work revealed that, as is usually the case
when you encounter something shoddy in the vicinity of a computer,
Microsoft incompetence and gratuitous incompatibility were to
blame. Western language HTML documents are written in the ISO
8859-1 Latin-1 character set, with a specified set of escapes for
special characters. Blithely ignoring this prescription, as usual,
Microsoft use their own "extension" to Latin-1, in which a variety
of characters which do not appear in Latin-1 are inserted in the
range 0x82 through 0x95--this having the merit of being
incompatible with both Latin-1 and Unicode, which reserve this
region for additional control characters.
If you stroll through the
Mozilla
bug database, you will can chronicle the transition through the
stages of grief: from denial to anger to bargaining to
depression, and ultimately acceptance.
Recapping: we have a page which is correctly encoded as
utf-8. Mozilla ignores this as well as the declaration inside
the body that this is so. Instead it choses to respect the
HTTP header, which it finds to be incorrect, so it compensates by
introducing a windows specific encoding.
Solution
Now that we have identified the header that is in error, the
questions that remain are:
How did Boing Boing get to this state? - this is pretty
easy. It probably occurred from the
transition from Blogger to MovableType. Blogger's pages
tend to use the iso-8859-1 encoding; MovableType's pages tend to
use utf-8.
How should this be fixed? - this too is easy. Within
the Apache configuration there is either an
AddCharset or
AddDefaultCharset directive that specifies iso-8859-1.
Such directives should either be updated to reflect the usage of
the utf-8 encoding, or removed entirely - allowing the meta
http-equiv to take effect.
Conclusion
The web as we know it is built upon a foundation of concepts
such as Characters, HTML, and HTTP. These concepts are still
evolving, are not always mutually consistent, and incomplete.
Sometimes in order to solve problems such as these, you need to not
only know what the standards say, but which parts there is general
agreement on, and which parts are pretty consistently ignored.
Is there a good web resource or book where one can read up on how to figure out what exactly is going on when a site is experiencing character set problems like this?
Scott: not that I'm aware of. The primary reason why I know about this stuff is because I wrote my own weblogging software. Think about it: my page (talking about weird characters) faithfully displays these same characters on a range of browsers. And this entry in my various feeds work too - across a range of aggregators.
Getting that to work properly is much harder than it ought to be.
I can easily see how getting all of this stuff to work on your site would be harder than it ought to be. Especially without good, simple documentation on techniques to accomplish such a feat.
I made a php function to get some xml feeds and print them ina web page but I was getting bad charaters in the page, I didn't have much knowledge about character sets and all that stuff but I kept whole night reading about it, I realized that I needed to use utf8 to properly display the content so I used the utf8_decode for the job.
Well, is been a hell of a night, I just didn't know how tricky this part of html could be.
Sam Ruby: Copy and Paste, via Simon. I haven't had this problem because I do everything right (all Unicode, all the time). But I've seen other sites where this happens a lot, particularly Roger Simon's site....
Sam,
Just keep writing the blog and the book will write itself ;-)
Thanks for grappling with all this stuff and putting it online in such clear terms. One of these days I'm going to get my head around it, I just know I am.
Shouldn't Apache have noticed the http-equiv and returned the correct encoding? IIRC, http-equiv was created so that web servers would not require such global configuration changes, and would indicate the correct encoding per-resource.
Your efforts to describe and document problems with character encoding have been incredibly useful for me. Following your advice, I switched to using UTF-8 on all of my websites, and it has certainly helped. Despite that, I still get bitten by the odd character which causes my pages to barf when served as application/xhtml+xml.
Adam: where? On the meta element? No, this element is correct as coded. Check out google.com to see an instance where this works. Of course, there it isn't used to affect the rendering of the page, it is used in the hopes that it will hopes that it influence the encoding used by the browser when submitting the data from the form.
Yet another case where standards are incomplete, inconsistent, and partially respected and partially ignored.
Eine nette und ausführliche Erläuterung von meta-Tags mit Zeichensatzangaben, dem HTTP-Content-Type Header mit Zeichensatzangabe und das, was Browser daraus machen. Ich sags ja immer, das Web ist eine technische Müllhalde, die zufälligerweise...
Eine nette und ausführliche Erläuterung von meta-Tags mit Zeichensatzangaben, dem HTTP-Content-Type Header mit Zeichensatzangabe und das, was Browser daraus machen. Ich sags ja immer, das Web ist eine technische Müllhalde, die zufälligerweise...
I'd forgotten this little twist on charset until Gavin reminded me: probably BoingBoing shouldn't have removed the charset from the Content-type header, the solution they seem to have chosen, because they are sending gzipped content. Now they are at risk of having Moz get part of the content, unzip it, discover that it should be in a charset other than the one it expected from HTTP's default, and get into an argument with Apache over what the proper offset is for a re-request.
Dealing with any J2EE implementations and character encoding is an exercise in patience. For anyone struggling with that, I found that (logically when you think about it) any dynamic JSP includes must have the content-type set in the included JSP as they are compiled and executed independently from the JSP calling them. At least in JSP 1.1, which is what we were using.
As far as browsers go, just remember - the HTTP headers trump all. If you've set those, it doesn't matter what else you do. If you haven't set those, then you can concern yourself with the meta-tags.
Personally, I suggest going the HTTP header route. It's at least consistent and explicit.
I've been working on mod_blog for the past few days. Bob, who runs the Friday D&D game, has a site that used to use Blogger but due to reliability problems (as well as certain security issues related to FTP) I switched them over to mod_blog....
The ‘Coltrane’ release of Lotus Freelance Graphics, the 1998 version bundled with Lotus SmartSuite, was famously held up for a month when it was found out that one of the clip art images had a tiny 20-pixel image of Taiwanese currency (rather than...
I gave a presentation called I18n, M17n, Unicode, And All That at the recent 2006 RubyConf in Denver. This piece doesn’t duplicate this presentation; it outlines the problem, some conference conversation, and includes a couple of images that you...
I intend Foliomatic primarily for personal web sites such as this one. It should also be useful for projects and organizations whose Web presence is mostly static content, updated from time to time. It is not going to be a … Continue reading →...