It’s just data

Trackback in, valid out (mostly)

Jacques Distler: You gonna turn off Trackbacks (which don't declare a charset, and could be sent in any charset imaginable, but very frequently are Windows-1252)? Unless you have a way to guess the charset and re-encode the result to UTF-8, they will invalidate your pages as quick as you can sneeze.

It turns out that by design it is rather hard for a string of bytes to accidentally be valid utf-8, unless that string is pure US-ASCII, in which case it doesn't much matter which encoding you presume.

So, my current heuristics are as follows: if the data is valid utf-8, I accept it as such.  If not, I assume windows-1252, and convert it to utf-8.  This had failed me once, but my page is still valid.

Note: neither windows-1252 nor iso-8859-1 guarantee well formed XML 1.0.  There still is a nasty character range issue to deal with.


I don't know the subtleties of trackback, but the URL for the Korean weblog was legible (perhaps you've since hand-modified its encoding to make it legible).  Could you attempt a GET on the URL, determine its encoding, and guess that the trackback uses the same encoding?  I'm sure there's more to it than that, but is the idea even possible?

Posted by Tim at

That (GET and guess) is probably what I'm going to start doing, but, is there any reason why we are sending pings without a charset? It took me  a line and a half to look up the charset in my config and set the header $req->content_type("application/x-www-form-urlencoded; charset=$charset");. Admittedly with my Perl, it's going to take me far more than that to parse it out of received pings, try to run through Text::Iconv, and either fall back to just the URL or refuse the ping if I can't convert it to something useful, but it still doesn't seem unsurmountable.

Posted by Phil Ringnalda at

Ah, that sounds like a nice solution, although I still think it's ugly. However, creating a function that takes the page charset, page page content and outputs it in utf-8 is very hard, especially since not all pages declare a charset.

Posted by Anne at

Anne van Kesteren : Trackback in, valid out (mostly) - or: "Trackback considered harmful"...

Excerpt from HotLinks - Level 1 at

Should the Atom API possibly tidy up and release a new version of trackbacks and pingbacks that fixes all these issues, or should it bundle its own remote-comment mechanism?

It's obvious that both pingback and trackbacks are poor specifications, so either the owners of these (Radio and Movable Type, AFAIK) should improve them to also include character encoding and other issues, or Atom should do something to either deprecate or improve them.

IMHO, of course.

Posted by Asbjørn Ulsberg at

Trackback und Ecoding

Auch Sam Ruby hat bemerkt, dass Trackback ein paar konzeptionelle Probleme hat. Nicht nur, dass es immer noch Clients gibt, die Trackbacks per GET-Request senden (ist seit Oktober 2002 aus der Spezifikation entfernt), das Encoding macht Probleme....

Excerpt from $PLOG->read(); at

Asbjorn, Although I'd tend to lean towards the re-use camp. Atom is surely about tight specs, which translates better to respecify.

Posted by Randy Charles Mørin at

Isn't the Atom API already a remote-comment mechanism?

For example, check the links in weblog entry if they are Atom Enabled, if so submit something using the Atom API to that site.

Posted by Peter Winnberg at

OK, so let me see if I got this right: I should take a look at the content-type on the trackback request on the off chance that Phil Ringnalda is not the only person in the universe who will provide charset information in this manner.  If I don't miss my guess, Phil's charset is likely to be utf-8 or iso-8859-1.

Should I not find a charset, I should fetch the page to see what the content-type header says.  When there is no charset specified in the header, as is the case in this case, I should ignore the default of us-ascii (or is it iso-8859-1, depending on what spec you read), and press on.

I can then utilize code such as the following to extract information that was originally meant for the server, but common practice indicates that it is interpreted as an override on the client side:

from sgmllib import SGMLParser
import urllib, codecs
tb_url='http://blog.webservices.or.kr/hollobit/archives/000561.html'

class httpequiv(SGMLParser):
  charset="iso-8859-1"
  def start_meta(self, attrs):
    attrs=dict([(x.upper(),y) for x,y in attrs])
    if attrs.get('HTTP-EQUIV','').upper() == 'CONTENT-TYPE':
      for param in attrs.get("CONTENT",'').split(';')[1:]:
        name,value = param.split('=',1)
        if name.strip().upper()=='CHARSET': self.charset=value

parser=httpequiv()
try:
  parser.feed(urllib.urlopen(tb_url).read())
except:
  pass

tb_excerpt=codecs.lookup(parser.charset).decode(tb_excerpt)

... only to find that the encoding in question, EUC-KR, does not have a corresponding python codec installed on my host.

I could then try to return an error to the host, but I'm not sure that such things are looked at anyway, such an error would likely be written in a language which is not the first language of the recipient, would be about an esoteric topic that confuses experts, and in any case, the fact that I don't have the sender's codec installed on my machine really isn't the senders fault.

Perhaps it would be better for me to ignore the excerpt in such circumstances, as excerpts are optional in any case, and settle simply for a title and a [more] link.

Posted by Sam Ruby at

Perhaps it would be better for me to ignore the excerpt in such circumstances, as excerpts are optional in any case, and settle simply for a title and a [more] link.

And what character encoding would you use for the title?

Posted by Mark at

It's obvious that both pingback and trackbacks are poor specifications, so either the owners of these (Radio and Movable Type, AFAIK)

Pingback is by Stuart Langridge and Ian Hickson.  [link]

Posted by Mark at

And what character encoding would you use for the title?

Good point.  I currently default title based on  the parent blog entry.  If I don't understand the encoding, I probably shouldn't assume ASCII, utf-8, or iso-8859-1.

By the way, beyond looking at the content-type on the trackback request itself, I'm not yet convinced of the value of fetching the page.  That would be trading one set of problems and assumptions for another.

Finally, I'll note that pingback does not have any translatable parameters on the request.

Posted by Sam Ruby at

Hey Sam,
<OffTopic>What's the story behind your favicon.ico? It's cool.</OffTopic>
Christian

Posted by Christian Romney at

Knot Theory

In response to a  question from Christian Romney:  this web page was the inspiration for my current favicon.ico.... [more]

Trackback from Sam Ruby

at

Asbjørn, Pingback does not have this encoding flaw. It transfers URLs—not narrative text. I don’t think introducing yet another ping protocol (Atom ping or another incompatible revision of Trackback) is a good idea.

I suggest using Pingback (despite the XML-RPC overkill) and letting the recipient extract what the recipient chooses to extract from the alleged linking resource. (See a description of vaporware and an actual implementation.)

the fact that I don't have the sender's codec installed on my machine really isn't the senders fault.

The page in question purports to be XML (XHTML). As per the XML spec, the two encodings XML recipients are guaranteed to support are UTF-8 and UTF-16. Anyone sending XML using any other encoding is risking the recipient being unable to decode the characters. If communication fails due to the use of a legacy encoding, I think the sender is to blame.

Posted by Henri Sivonen at

TrackBack definitely needs to be updated to better support i18n; it's something we just weren't aware enough of back when we wrote the spec. We've been talking about updating it for awhile now, replacing the current form-urlencoded POST body with an Atom entry.

That would at least give servers the opportunity to do something intelligent with the content, whereas now they're just sort of lost. Of course, it still doesn't help everyone, particularly a lot of Movable Type users--only recent versions of Perl really have decent character encoding support.

By the way, don't think that we're not brutally aware of the lack of i18n support in TrackBack--we got our comeuppance when we had to do a lot of work in TypePad to try to guess the character encoding of an excerpt. :)

Posted by Ben at

Ben: excellent!

Perhaps it is worth flushing out Phil's suggestion first.  The way I see it:

Such an approach would allow an orderly upgrade of existing clients and servers.

Posted by Sam Ruby at

A regular expression to check for valid UTF-8

This is great. A regular expression that allows you to check if text is valid UTF-8. Via Sam Ruby. I'd previously used a function I found in the PHP manual and reproduced here. I like the regular expression better for aesthetic reasons, because it...

Excerpt from Keith's Weblog at

Yes, please: even if MT 3.0Final only sends a charset, and doesn't do anything with it, we can start messing around with plugins that try to use it and fail reasonably, and see what will work for 3.n. I'll ping Matt about getting WordPress to send it, too (though I shudder to think about trying to re-encode in PHP; wonder if anyone's written a library class to try the three different sets of possible PHP extensions and the four different ways to try to get the OS to do it directly?).

Posted by Phil Ringnalda at

WordPress has sent charset with trackbacks for a few weeks now, and it will be included in the 1.3 release. A few steps ahead of ya. ;)

Posted by Matt at

Heh. I knew I should have kept looking through CVS (I couldn't remember where pings get sent) instead of just opening my beak and cheeping.

Posted by Phil Ringnalda at

Note: neither windows-1252 nor iso-8859-1 guarantee well formed XML 1.0.  There still is a nasty character range issue to deal with.

Good Point. MTStripControlChars has been updated to deal with the offending byte ranges.

That should guarantee validity (if not sensible handling of trackbacks in charsets other than ISO-8859-1/Windows-1252).

Posted by Jacques Distler at

MTStripControlChars

Introducing the new, improved, MTStripControlChars plugin.... [more]

Trackback from Musings

at

Let's see how your comment system handles some "illegal" input:

***

Will your Trackback system handle it any better?

Posted by Jacques Distler at

Trackbacks and MTStripControlChars

Just when I thought it was safe to enjoy Paris, a new version of MTStripControlChars is called for.... [more]

Trackback from Musings

at

Will your Trackback system handle it any better?

It does now.  ;-)

body=re.compile("[\x01-\x08\x0B\x0C\x0E-\x1F]").sub("*",body)


Posted by Sam Ruby at

Look upon my works, ye mighty, and despair

Why trackbacks will invalidate your page in a second, and some ideas of what to do about it....

Excerpt from Dave Walker's Bookmarks at

Interesting discussion - the link above discusses how international characters in URLs are interpreted by a popular Wiki engine.  First we try to recognise the URL as UTF-8 (taking care not to allow over-long UTF-8 encodings for security reasons), and convert from UTF-8 to native; if the URL doesn't match the 'valid UTF-8' regular expression, we just use it unconverted.

Posted by Richard Donkin at

Internationalization and Trackbacks

In which our hero gets fed up with gibberish Trackbacks and takes matters in his own hands.... [more]

Trackback from Musings

at

Jacques Distler

Unicode codepoints ox0-ox08,0x0B,ox0C, ox0E-0x1F are problematic, whether you use NCRs, the corresponding iso-8859-1 characters or utf-8 or .... If you don’t filter these characters out of your utf-8 input, you are in just as much trouble as if you...

Excerpt from phil ringnalda dot com: My brother's (feed's) keeper: Comments at

Dumb question: What piece of software are you using to convert data?

Posted by William at

William: my own code, written in Python.  In this case, the relevant code is at the top of post.py:

fs = cgi.FieldStorage()
charset=cgi.parse_header(fs.headers['content-type'])[1].get('charset','utf-8')
def param(key):
  value=(fs.list and fs.has_key(key) and fs.getvalue(key)) or ''
  try:
    return unicode(value,charset)
  except:
    return value
Posted by Sam Ruby at

Add your comment