Jacques Distler: You gonna turn off Trackbacks (which don't declare a charset, and could be sent in any charset imaginable, but very frequently are Windows-1252)? Unless you have a way to guess the charset and re-encode the result to UTF-8, they will invalidate your pages as quick as you can sneeze.
It turns out that by design it is rather hard for a string of bytes to accidentally be valid utf-8, unless that string is pure US-ASCII, in which case it doesn't much matter which encoding you presume.
Jacques Distler: You gonna turn off Trackbacks (which don't declare a charset, and could be sent in any charset imaginable, but very frequently are Windows-1252)? Unless you have a way to guess the charset and re-encode the result to UTF-8, they will invalidate your pages as quick as you can sneeze.
It turns out that by design it is rather hard for a string of bytes to accidentally be valid utf-8, unless that string is pure US-ASCII, in which case it doesn't much matter which encoding you presume.
So, my current heuristics are as follows: if the data is valid utf-8, I accept it as such. If not, I assume windows-1252, and convert it to utf-8. This had failed me once, but my page is still valid.
Note: neither windows-1252 nor iso-8859-1 guarantee well formed XML 1.0. There still is a nasty character range issue to deal with.
Should the Atom API possibly tidy up and release a new version of trackbacks and pingbacks that fixes all these issues, or should it bundle its own remote-comment mechanism?
It's obvious that both pingback and trackbacks are poor specifications, so either the owners of these (Radio and Movable Type, AFAIK) should improve them to also include character encoding and other issues, or Atom should do something to either deprecate or improve them.
IMHO, of course.
Isn't the Atom API already a remote-comment mechanism?
For example, check the links in weblog entry if they are Atom Enabled, if so submit something using the Atom API to that site.
OK, so let me see if I got this right: I should take a look at the content-type on the trackback request on the off chance that Phil Ringnalda is not the only person in the universe who will provide charset information in this manner. If I don't miss my guess, Phil's charset is likely to be utf-8 or iso-8859-1.
Should I not find a charset, I should fetch the page to see what the content-type header says. When there is no charset specified in the header, as is the case in this case, I should ignore the default of us-ascii (or is it iso-8859-1, depending on what spec you read), and press on.
I can then utilize code such as the following to extract information that was originally meant for the server, but common practice indicates that it is interpreted as an override on the client side:
from sgmllib import SGMLParser import urllib, codecs tb_url='http://blog.webservices.or.kr/hollobit/archives/000561.html' class httpequiv(SGMLParser): charset="iso-8859-1" def start_meta(self, attrs): attrs=dict([(x.upper(),y) for x,y in attrs]) if attrs.get('HTTP-EQUIV','').upper() == 'CONTENT-TYPE': for param in attrs.get("CONTENT",'').split(';')[1:]: name,value = param.split('=',1) if name.strip().upper()=='CHARSET': self.charset=value parser=httpequiv() try: parser.feed(urllib.urlopen(tb_url).read()) except: pass tb_excerpt=codecs.lookup(parser.charset).decode(tb_excerpt)
... only to find that the encoding in question, EUC-KR, does not have a corresponding python codec installed on my host.
I could then try to return an error to the host, but I'm not sure that such things are looked at anyway, such an error would likely be written in a language which is not the first language of the recipient, would be about an esoteric topic that confuses experts, and in any case, the fact that I don't have the sender's codec installed on my machine really isn't the senders fault.
Perhaps it would be better for me to ignore the excerpt in such circumstances, as excerpts are optional in any case, and settle simply for a title and a [more] link.
Perhaps it would be better for me to ignore the excerpt in such circumstances, as excerpts are optional in any case, and settle simply for a title and a [more] link.
And what character encoding would you use for the title?
It's obvious that both pingback and trackbacks are poor specifications, so either the owners of these (Radio and Movable Type, AFAIK)
Pingback is by Stuart Langridge and Ian Hickson. [link]
And what character encoding would you use for the title?
Good point. I currently default title based on the parent blog entry. If I don't understand the encoding, I probably shouldn't assume ASCII, utf-8, or iso-8859-1.
By the way, beyond looking at the content-type on the trackback request itself, I'm not yet convinced of the value of fetching the page. That would be trading one set of problems and assumptions for another.
Finally, I'll note that pingback does not have any translatable parameters on the request.
Asbjørn, Pingback does not have this encoding flaw. It transfers URLs—not narrative text. I don’t think introducing yet another ping protocol (Atom ping or another incompatible revision of Trackback) is a good idea.
I suggest using Pingback (despite the XML-RPC overkill) and letting the recipient extract what the recipient chooses to extract from the alleged linking resource. (See a description of vaporware and an actual implementation.)
the fact that I don't have the sender's codec installed on my machine really isn't the senders fault.
The page in question purports to be XML (XHTML). As per the XML spec, the two encodings XML recipients are guaranteed to support are UTF-8 and UTF-16. Anyone sending XML using any other encoding is risking the recipient being unable to decode the characters. If communication fails due to the use of a legacy encoding, I think the sender is to blame.
TrackBack definitely needs to be updated to better support i18n; it's something we just weren't aware enough of back when we wrote the spec. We've been talking about updating it for awhile now, replacing the current form-urlencoded POST body with an Atom entry.
That would at least give servers the opportunity to do something intelligent with the content, whereas now they're just sort of lost. Of course, it still doesn't help everyone, particularly a lot of Movable Type users--only recent versions of Perl really have decent character encoding support.
By the way, don't think that we're not brutally aware of the lack of i18n support in TrackBack--we got our comeuppance when we had to do a lot of work in TypePad to try to guess the character encoding of an excerpt. :)
Ben: excellent!
Perhaps it is worth flushing out Phil's suggestion first. The way I see it:
Such an approach would allow an orderly upgrade of existing clients and servers.
Note: neither windows-1252 nor iso-8859-1 guarantee well formed XML 1.0. There still is a nasty character range issue to deal with.
Good Point. MTStripControlChars has been updated to deal with the offending byte ranges.
That should guarantee validity (if not sensible handling of trackbacks in charsets other than ISO-8859-1/Windows-1252).
Let's see how your comment system handles some "illegal" input:
***
Will your Trackback system handle it any better?
Will your Trackback system handle it any better?
It does now. ;-)
body=re.compile("[\x01-\x08\x0B\x0C\x0E-\x1F]").sub("*",body)
William: my own code, written in Python. In this case, the relevant code is at the top of post.py:
fs = cgi.FieldStorage() charset=cgi.parse_header(fs.headers['content-type'])[1].get('charset','utf-8') def param(key): value=(fs.list and fs.has_key(key) and fs.getvalue(key)) or '' try: return unicode(value,charset) except: return value