Sam Ruby

Trackback in, valid out (mostly)

2004-06-28T21:49:50-04:00

Jacques Distler: You gonna turn off Trackbacks (which don't declare a charset, and could be sent in any charset imaginable, but very frequently are Windows-1252)? Unless you have a way to guess the charset and re-encode the result to UTF-8, they will invalidate your pages as quick as you can sneeze.

It turns out that by design it is rather hard for a string of bytes to accidentally be valid utf-8, unless that string is pure US-ASCII, in which case it doesn't much matter which encoding you presume.

So, my current heuristics are as follows: if the data is valid utf-8, I accept it as such. If not, I assume windows-1252, and convert it to utf-8. This had failed me once, but my page is still valid.

Note: neither windows-1252 nor iso-8859-1 guarantee well formed XML 1.0. There still is a nasty character range issue to deal with.

Trackback in, valid out (mostly)

2004-06-28T23:17:01-04:00

I don't know the subtleties of trackback, but the URL for the Korean weblog was legible (perhaps you've since hand-modified its encoding to make it legible). Could you attempt a GET on the URL, determine its encoding, and guess that the trackback uses the same encoding? I'm sure there's more to it than that, but is the idea even possible?

Trackback in, valid out (mostly)

2004-06-28T23:24:17-04:00

That (GET and guess) is probably what I'm going to start doing, but, is there any reason why we are sending pings without a charset? It took me a line and a half to look up the charset in my config and set the header $req->content_type("application/x-www-form-urlencoded; charset=$charset");. Admittedly with my Perl, it's going to take me far more than that to parse it out of received pings, try to run through Text::Iconv, and either fall back to just the URL or refuse the ping if I can't convert it to something useful, but it still doesn't seem unsurmountable.

Trackback in, valid out (mostly)

2004-06-29T01:21:22-04:00

Ah, that sounds like a nice solution, although I still think it's ugly. However, creating a function that takes the page charset, page page content and outputs it in utf-8 is very hard, especially since not all pages declare a charset.

Trackback in, valid out (mostly)

2004-06-29T02:45:45-04:00

Anne van Kesteren : Trackback in, valid out (mostly) - or: "Trackback considered harmful"...

Trackback in, valid out (mostly)

2004-06-29T04:33:10-04:00

Should the Atom API possibly tidy up and release a new version of trackbacks and pingbacks that fixes all these issues, or should it bundle its own remote-comment mechanism?

It's obvious that both pingback and trackbacks are poor specifications, so either the owners of these (Radio and Movable Type, AFAIK) should improve them to also include character encoding and other issues, or Atom should do something to either deprecate or improve them.

IMHO, of course.

Trackback und Ecoding

2004-06-29T04:45:30-04:00

Auch Sam Ruby hat bemerkt, dass Trackback ein paar konzeptionelle Probleme hat. Nicht nur, dass es immer noch Clients gibt, die Trackbacks per GET-Request senden (ist seit Oktober 2002 aus der Spezifikation entfernt), das Encoding macht Probleme....

Trackback in, valid out (mostly)

2004-06-29T07:11:57-04:00

Asbjorn, Although I'd tend to lean towards the re-use camp. Atom is surely about tight specs, which translates better to respecify.

Trackback in, valid out (mostly)

2004-06-29T08:13:03-04:00

Isn't the Atom API already a remote-comment mechanism?

For example, check the links in weblog entry if they are Atom Enabled, if so submit something using the Atom API to that site.

Trackback in, valid out (mostly)

2004-06-29T08:18:17-04:00

OK, so let me see if I got this right: I should take a look at the content-type on the trackback request on the off chance that Phil Ringnalda is not the only person in the universe who will provide charset information in this manner. If I don't miss my guess, Phil's charset is likely to be utf-8 or iso-8859-1.

Should I not find a charset, I should fetch the page to see what the content-type header says. When there is no charset specified in the header, as is the case in this case, I should ignore the default of us-ascii (or is it iso-8859-1, depending on what spec you read), and press on.

I can then utilize code such as the following to extract information that was originally meant for the server, but common practice indicates that it is interpreted as an override on the client side:

from sgmllib import SGMLParser
import urllib, codecs
tb_url='http://blog.webservices.or.kr/hollobit/archives/000561.html'

class httpequiv(SGMLParser):
  charset="iso-8859-1"
  def start_meta(self, attrs):
    attrs=dict([(x.upper(),y) for x,y in attrs])
    if attrs.get('HTTP-EQUIV','').upper() == 'CONTENT-TYPE':
      for param in attrs.get("CONTENT",'').split(';')[1:]:
        name,value = param.split('=',1)
        if name.strip().upper()=='CHARSET': self.charset=value

parser=httpequiv()
try:
  parser.feed(urllib.urlopen(tb_url).read())
except:
  pass

tb_excerpt=codecs.lookup(parser.charset).decode(tb_excerpt)

... only to find that the encoding in question, EUC-KR, does not have a corresponding python codec installed on my host.

I could then try to return an error to the host, but I'm not sure that such things are looked at anyway, such an error would likely be written in a language which is not the first language of the recipient, would be about an esoteric topic that confuses experts, and in any case, the fact that I don't have the sender's codec installed on my machine really isn't the senders fault.

Perhaps it would be better for me to ignore the excerpt in such circumstances, as excerpts are optional in any case, and settle simply for a title and a [more] link.

Trackback in, valid out (mostly)

2004-06-29T08:49:28-04:00

Perhaps it would be better for me to ignore the excerpt in such circumstances, as excerpts are optional in any case, and settle simply for a title and a [more] link.

And what character encoding would you use for the title?

Trackback in, valid out (mostly)

2004-06-29T08:51:40-04:00

It's obvious that both pingback and trackbacks are poor specifications, so either the owners of these (Radio and Movable Type, AFAIK)

Pingback is by Stuart Langridge and Ian Hickson. [link]

Trackback in, valid out (mostly)

2004-06-29T09:05:40-04:00

And what character encoding would you use for the title?

Good point. I currently default title based on the parent blog entry. If I don't understand the encoding, I probably shouldn't assume ASCII, utf-8, or iso-8859-1.

By the way, beyond looking at the content-type on the trackback request itself, I'm not yet convinced of the value of fetching the page. That would be trading one set of problems and assumptions for another.

Finally, I'll note that pingback does not have any translatable parameters on the request.

Trackback in, valid out (mostly)

2004-06-29T09:59:52-04:00

Hey Sam,
What's the story behind your favicon.ico? It's cool.
Christian

Knot Theory

2004-06-29T11:11:25-04:00

In response to a question from Christian Romney: this web page was the inspiration for my current favicon.ico....

Trackback in, valid out (mostly)

2004-06-29T14:17:56-04:00

Asbjørn, Pingback does not have this encoding flaw. It transfers URLs—not narrative text. I don’t think introducing yet another ping protocol (Atom ping or another incompatible revision of Trackback) is a good idea.

I suggest using Pingback (despite the XML-RPC overkill) and letting the recipient extract what the recipient chooses to extract from the alleged linking resource. (See a description of vaporware and an actual implementation.)

the fact that I don't have the sender's codec installed on my machine really isn't the senders fault.

The page in question purports to be XML (XHTML). As per the XML spec, the two encodings XML recipients are guaranteed to support are UTF-8 and UTF-16. Anyone sending XML using any other encoding is risking the recipient being unable to decode the characters. If communication fails due to the use of a legacy encoding, I think the sender is to blame.

Trackback in, valid out (mostly)

2004-06-29T19:09:14-04:00

TrackBack definitely needs to be updated to better support i18n; it's something we just weren't aware enough of back when we wrote the spec. We've been talking about updating it for awhile now, replacing the current form-urlencoded POST body with an Atom entry.

That would at least give servers the opportunity to do something intelligent with the content, whereas now they're just sort of lost. Of course, it still doesn't help everyone, particularly a lot of Movable Type users--only recent versions of Perl really have decent character encoding support.

By the way, don't think that we're not brutally aware of the lack of i18n support in TrackBack--we got our comeuppance when we had to do a lot of work in TypePad to try to guess the character encoding of an excerpt. :)

Trackback in, valid out (mostly)

2004-06-29T19:47:25-04:00

Ben: excellent!

Perhaps it is worth flushing out Phil's suggestion first. The way I see it:

Clients MAY provide a charset on in the content-type header of the POST request.
Servers are to assume iso-8859-1 if no charset is provided on the request.
If the server does not support the specified encoding, it should treat the title and excerpt as if they were missing.

Such an approach would allow an orderly upgrade of existing clients and servers.

A regular expression to check for valid UTF-8

2004-06-29T20:15:26-04:00

This is great. A regular expression that allows you to check if text is valid UTF-8. Via Sam Ruby. I'd previously used a function I found in the PHP manual and reproduced here. I like the regular expression better for aesthetic reasons, because it...

Trackback in, valid out (mostly)

2004-06-29T22:35:49-04:00

Yes, please: even if MT 3.0Final only sends a charset, and doesn't do anything with it, we can start messing around with plugins that try to use it and fail reasonably, and see what will work for 3.n. I'll ping Matt about getting WordPress to send it, too (though I shudder to think about trying to re-encode in PHP; wonder if anyone's written a library class to try the three different sets of possible PHP extensions and the four different ways to try to get the OS to do it directly?).

Trackback in, valid out (mostly)

2004-06-30T00:56:31-04:00

WordPress has sent charset with trackbacks for a few weeks now, and it will be included in the 1.3 release. A few steps ahead of ya. ;)

Trackback in, valid out (mostly)

2004-06-30T00:59:24-04:00

Heh. I knew I should have kept looking through CVS (I couldn't remember where pings get sent) instead of just opening my beak and cheeping.

Trackback in, valid out (mostly)

2004-06-30T18:17:04-04:00

Note: neither windows-1252 nor iso-8859-1 guarantee well formed XML 1.0. There still is a nasty character range issue to deal with.

Good Point. MTStripControlChars has been updated to deal with the offending byte ranges.

That should guarantee validity (if not sensible handling of trackbacks in charsets other than ISO-8859-1/Windows-1252).

MTStripControlChars

2004-06-30T18:50:49-04:00

Introducing the new, improved, MTStripControlChars plugin....

Trackback in, valid out (mostly)

2004-06-30T19:22:01-04:00

Let's see how your comment system handles some "illegal" input:

***

Will your Trackback system handle it any better?

Trackbacks and MTStripControlChars

2004-07-01T02:58:41-04:00

Just when I thought it was safe to enjoy Paris, a new version of MTStripControlChars is called for....

Trackback in, valid out (mostly)

2004-07-01T22:34:22-04:00

Will your Trackback system handle it any better?

It does now. ;-)

body=re.compile("[\x01-\x08\x0B\x0C\x0E-\x1F]").sub("*",body)

Look upon my works, ye mighty, and despair

2004-07-05T20:35:35-04:00

Why trackbacks will invalidate your page in a second, and some ideas of what to do about it....

Trackback in, valid out (mostly)

2004-12-31T04:15:58-05:00

Interesting discussion - the link above discusses how international characters in URLs are interpreted by a popular Wiki engine. First we try to recognise the URL as UTF-8 (taking care not to allow over-long UTF-8 encodings for security reasons), and convert from UTF-8 to native; if the URL doesn't match the 'valid UTF-8' regular expression, we just use it unconverted.

Internationalization and Trackbacks

2005-02-18T23:12:16-05:00

In which our hero gets fed up with gibberish Trackbacks and takes matters in his own hands....

Jacques Distler

2005-03-13T07:53:45-05:00

Unicode codepoints ox0-ox08,0x0B,ox0C, ox0E-0x1F are problematic, whether you use NCRs, the corresponding iso-8859-1 characters or utf-8 or .... If you don’t filter these characters out of your utf-8 input, you are in just as much trouble as if you...

Trackback in, valid out (mostly)

2005-12-01T06:52:02-05:00

Dumb question: What piece of software are you using to convert data?

Trackback in, valid out (mostly)

2005-12-01T07:25:47-05:00

William: my own code, written in Python. In this case, the relevant code is at the top of post.py:

fs = cgi.FieldStorage()
charset=cgi.parse_header(fs.headers['content-type'])[1].get('charset','utf-8')
def param(key):
  value=(fs.list and fs.has_key(key) and fs.getvalue(key)) or ''
  try:
    return unicode(value,charset)
  except:
    return value