Jacques Distler: You gonna turn off Trackbacks (which don't
declare a charset, and could be sent in any charset imaginable, but
very frequently are Windows-1252)? Unless you have a way to guess
the charset and re-encode the result to UTF-8, they will invalidate
your pages as quick as you can sneeze.
It turns out that by design it is rather hard for a string of
bytes to accidentally be
valid utf-8, unless that string is pure US-ASCII, in which case
it doesn't much matter which encoding you presume.
So, my current heuristics are as follows: if the data is valid
utf-8, I accept it as such. If not, I assume windows-1252,
and convert it to utf-8. This had
failed me once, but my page is still
valid.
Note: neither windows-1252 nor iso-8859-1 guarantee well formed
XML 1.0. There still is a nasty
character
range issue to deal with.
I don't know the subtleties of trackback, but the URL for the Korean weblog was legible (perhaps you've since hand-modified its encoding to make it legible). Could you attempt a GET on the URL, determine its encoding, and guess that the trackback uses the same encoding? I'm sure there's more to it than that, but is the idea even possible?
That (GET and guess) is probably what I'm going to start doing, but, is there any reason why we are sending pings without a charset? It took me a line and a half to look up the charset in my config and set the header $req->content_type("application/x-www-form-urlencoded; charset=$charset");. Admittedly with my Perl, it's going to take me far more than that to parse it out of received pings, try to run through Text::Iconv, and either fall back to just the URL or refuse the ping if I can't convert it to something useful, but it still doesn't seem unsurmountable.
Ah, that sounds like a nice solution, although I still think it's ugly. However, creating a function that takes the page charset, page page content and outputs it in utf-8 is very hard, especially since not all pages declare a charset.
Should the Atom API possibly tidy up and release a new version of trackbacks and pingbacks that fixes all these issues, or should it bundle its own remote-comment mechanism?
It's obvious that both pingback and trackbacks are poor specifications, so either the owners of these (Radio and Movable Type, AFAIK) should improve them to also include character encoding and other issues, or Atom should do something to either deprecate or improve them.
Auch Sam Ruby hat bemerkt, dass Trackback ein paar konzeptionelle Probleme hat. Nicht nur, dass es immer noch Clients gibt, die Trackbacks per GET-Request senden (ist seit Oktober 2002 aus der Spezifikation entfernt), das Encoding macht Probleme....
OK, so let me see if I got this right: I should take a look at the content-type on the trackback request on the off chance that Phil Ringnalda is not the only person in the universe who will provide charset information in this manner. If I don't miss my guess, Phil's charset is likely to be utf-8 or iso-8859-1.
Should I not find a charset, I should fetch the page to see what the content-type header says. When there is no charset specified in the header, as is the case in this case, I should ignore the default of us-ascii (or is it iso-8859-1, depending on what spec you read), and press on.
I can then utilize code such as the following to extract information that was originally meant for the server, but common practice indicates that it is interpreted as an override on the client side:
from sgmllib import SGMLParser
import urllib, codecs
tb_url='http://blog.webservices.or.kr/hollobit/archives/000561.html'
class httpequiv(SGMLParser):
charset="iso-8859-1"
def start_meta(self, attrs):
attrs=dict([(x.upper(),y) for x,y in attrs])
if attrs.get('HTTP-EQUIV','').upper() == 'CONTENT-TYPE':
for param in attrs.get("CONTENT",'').split(';')[1:]:
name,value = param.split('=',1)
if name.strip().upper()=='CHARSET': self.charset=value
parser=httpequiv()
try:
parser.feed(urllib.urlopen(tb_url).read())
except:
pass
tb_excerpt=codecs.lookup(parser.charset).decode(tb_excerpt)
... only to find that the encoding in question, EUC-KR, does not have a corresponding python codec installed on my host.
I could then try to return an error to the host, but I'm not sure that such things are looked at anyway, such an error would likely be written in a language which is not the first language of the recipient, would be about an esoteric topic that confuses experts, and in any case, the fact that I don't have the sender's codec installed on my machine really isn't the senders fault.
Perhaps it would be better for me to ignore the excerpt in such circumstances, as excerpts are optional in any case, and settle simply for a title and a [more] link.
Perhaps it would be better for me to ignore the excerpt in such circumstances, as excerpts are optional in any case, and settle simply for a title and a [more] link.
And what character encoding would you use for the title?
And what character encoding would you use for the title?
Good point. I currently default title based on the parent blog entry. If I don't understand the encoding, I probably shouldn't assume ASCII, utf-8, or iso-8859-1.
By the way, beyond looking at the content-type on the trackback request itself, I'm not yet convinced of the value of fetching the page. That would be trading one set of problems and assumptions for another.
Finally, I'll note that pingback does not have any translatable parameters on the request.
Asbjørn, Pingback does not have this encoding flaw. It transfers URLs—not narrative text. I don’t think introducing yet another ping protocol (Atom ping or another incompatible revision of Trackback) is a good idea.
I suggest using Pingback (despite the XML-RPC overkill) and letting the recipient extract what the recipient chooses to extract from the alleged linking resource. (See a description of vaporware and an actual implementation.)
the fact that I don't have the sender's codec installed on my machine really isn't the senders fault.
The page in question purports to be XML (XHTML). As per the XML spec, the two encodings XML recipients are guaranteed to support are UTF-8 and UTF-16. Anyone sending XML using any other encoding is risking the recipient being unable to decode the characters. If communication fails due to the use of a legacy encoding, I think the sender is to blame.
TrackBack definitely needs to be updated to better support i18n; it's something we just weren't aware enough of back when we wrote the spec. We've been talking about updating it for awhile now, replacing the current form-urlencoded POST body with an Atom entry.
That would at least give servers the opportunity to do something intelligent with the content, whereas now they're just sort of lost. Of course, it still doesn't help everyone, particularly a lot of Movable Type users--only recent versions of Perl really have decent character encoding support.
By the way, don't think that we're not brutally aware of the lack of i18n support in TrackBack--we got our comeuppance when we had to do a lot of work in TypePad to try to guess the character encoding of an excerpt. :)
This is great. A regular expression that allows you to check if text is valid UTF-8. Via Sam Ruby. I'd previously used a function I found in the PHP manual and reproduced here. I like the regular expression better for aesthetic reasons, because it...
Yes, please: even if MT 3.0Final only sends a charset, and doesn't do anything with it, we can start messing around with plugins that try to use it and fail reasonably, and see what will work for 3.n. I'll ping Matt about getting WordPress to send it, too (though I shudder to think about trying to re-encode in PHP; wonder if anyone's written a library class to try the three different sets of possible PHP extensions and the four different ways to try to get the OS to do it directly?).
Interesting discussion - the link above discusses how international characters in URLs are interpreted by a popular Wiki engine. First we try to recognise the URL as UTF-8 (taking care not to allow over-long UTF-8 encodings for security reasons), and convert from UTF-8 to native; if the URL doesn't match the 'valid UTF-8' regular expression, we just use it unconverted.
Unicode codepoints ox0-ox08,0x0B,ox0C, ox0E-0x1F are problematic, whether you use NCRs, the corresponding iso-8859-1 characters or utf-8 or .... If you don’t filter these characters out of your utf-8 input, you are in just as much trouble as if you...
William: my own code, written in Python. In this case, the relevant code is at the top of post.py:
fs = cgi.FieldStorage()
charset=cgi.parse_header(fs.headers['content-type'])[1].get('charset','utf-8')
def param(key):
value=(fs.list and fs.has_key(key) and fs.getvalue(key)) or ''
try:
return unicode(value,charset)
except:
return value