It’s just data

Sgmllib patch

The last place I figured I would be patching when I saw this bug was the Python runtime library.

The problem is markup in titles alt attributes.  No, not those titles, or these titles, rather these alt attributes.

The way it all started was that Georg von Hippel used a TeX2PNG to convert a mathematical equation into an image.  That software puts the original LaTeX source into the title alt attribute.

George then used Konqueror to copy and paste the result into Blogger’s web interface.  Apparently Konqueror inserted a new line into the attribute value.  This arguably is suboptimal, but legal.  Blogger than proceeded to to convert the new line — even though it was in an attribute value — to a <br /> sequence.  While this is not what is intended, existing browsers (I’ve tested it on Firefox, IE, and Opera) took this all in stride as this was within a quoted string.

However, programs based on Python’s SGMLLib, like the Universal Feed Parser and BeautifulSoup throw up a hairball.  If you look in the source, you will see:

# XXX The following should skip matching quotes (' or ")

Ouch.  Test case and patch submitted.  For those who can’t wait, here is a workaround that will work with existing versions of Python:

if' <').start(0):
    class EndBracketMatch:
	endbracket = re.compile(r'/?[a-zA-Z][-_.:a-zA-Z0-9]*\s*('
	def search(self,string,index=0):
	    self.match = self.endbracket.match(string,index)
	    if self.match: return self
	def start(self,n):
	    return self.match.end(n)
    sgmllib.endbracket = EndBracketMatch()

While testing this patch, I noticed that there is a surprise in the Python SVN Head - character references will be substituted in attribute values.  While this is the way it always should have been, this will come as a surprise to many.  I’ve committed a few changes to the Universal Feed Parser so that it will accomodate both current releases and the SVN Head version.

Particularly problematic are the substitution of character references.  Substituting &lt; &gt; and &amp; will cause naïve programs (or programs explicitly coded to the current behavior of sgmllib) which consume and produce HTML to no longer be able to round trip their results.  But worse is the handling of numeric character references.  Decimal (but not hexadecimal) character references which are expressible in iso-8859-1 are converted to strings (not Unicode, but strings).  If the enclosing data is in another encoding (such as utf-8), this creates a problem.

That software puts the original LaTeX source into the title attribute.

The alt attribute, actually. Not that it makes any difference.

Posted by Jacques Distler at

Oh, and here’s an actual real-live value for an alt attribute from the front page of Georg’s blog.


Should be fun.

(Note to Georg: “much less than” is “\ll” (≪), not “<<” (<<), not that that would solve the more general problem.)

Posted by Jacques Distler at

Actually, you meant these titles.

Posted by Jonathan P. at

The character reference change to sgmllib sounds bad to me; that’s an old library, and changing the types that way seems pretty substantial; and it’s just the kind of change that can cause cascading and mysterious errors.  Even if it might already might be the cause of mysterious and lingering bugs... the right solution is probably forking the code rather than fixing it in place.

Posted by Ian Bicking at

Ian: you had to drop the ‘F’ word, didn’t you?

I will say that that’s a definite possibility.

Posted by Sam Ruby at

Add your comment