Trackback in, valid out (mostly)

2004-06-29T01:49:50Z

Jacques Distler: You gonna turn off Trackbacks (which don't declare a charset, and could be sent in any charset imaginable, but very frequently are Windows-1252)? Unless you have a way to guess the charset and re-encode the result to UTF-8, they will invalidate your pages as quick as you can sneeze.

It turns out that by design it is rather hard for a string of bytes to accidentally be valid utf-8, unless that string is pure US-ASCII, in which case it doesn't much matter which encoding you presume.

So, my current heuristics are as follows: if the data is valid utf-8, I accept it as such. If not, I assume windows-1252, and convert it to utf-8. This had failed me once, but my page is still valid.

Note: neither windows-1252 nor iso-8859-1 guarantee well formed XML 1.0. There still is a nasty character range issue to deal with.