Bleach Alternatives

2006-05-31T07:36:07Z

Until recently, Dare’s reds would show through on Planet Intertwingly, but Antonio’s yellow’s would be stripped. The reason was that Dare uses the <font> tag, and Antonio uses the style attribute. Both approaches should be equally valid, the only difference is that the latter is more difficult to correctly parse.

The Universal Feed Parser is not known for shying away from difficult problems, and I saw no reason why this situation should be any different. That being said, I didn’t aim to solve the general problem of parsing all possible CSS, I merely aimed to allow through a large subset of CSS that is both simple to parse and known to be safe.

This provided other benefits, for example, inset images on many feeds displayed as inset images on Planet Intertwingly.

Per Feed Customization

While this made a dramatic improvement, it still didn’t capture everything. It turns out that a number of sources either put too much or too little style information into their feeds.

BoingBoing often puts a <br clear="both"> in descriptions. Engadget does something similar with <h6> tags. This has the effect of leaving large gaps when these items appeared alongside the subscription list which “floats” to the top right of the page.

Rogers Cadenhead places class="sourcecode" on paragraphs and span tags when he is referencing source code. This displays using a monospace font on his site, but this style information is not syndicated along with his feed. I do something similar on my site, but I use <pre> and <code> tags as these degrade nicely.

Henri Sivonen places <p> tags inside of <ul> and <ol> elements, and then uses CSS to reduce the gaps between list items.

Gizmodo uses left, right, and center class names on images to cause them to float or to be centered. Again, the style sheet which describes the desired behavior associated with these classes is not placed into the feed.

Most of these issues are solvable with a little css (search for "Accomodations"). However, as the body of Planet Intertwingly is not positioned absolutely and has a floating subscription list, setting the left and right margins to auto does not center an image. But even in this case, display:block is an improvement.

Longer Term

It occurs to me that I’ve seen these problems solved before, and with a better tool. And I even have that the important piece installed on my machine...

I’d love to see all HTML processing in UFP become pluggable, and for a plug-in based on Mozilla to become a reality. Many of the pieces seem to be in place. After an apt-get install python2.4-gtk2, I find that I can import gtkmozembed from within Python. It looks like more pieces to the puzzle are (or will) become available with GtkMozEdit. But I don’t believe that fine grained access to the DOM from within Python is either necessary or even desirable.

To my way of thinking, the ideal would be to run Mozilla in a headless mode. I’d simply construct a MozEmbed object, stream in some data, that data would have some unobtrusive javascript or would use an evalInSandbox technique to make adjustments to the DOM tree, and finally either an HTMLSerializer or an XHTMLSerializer would be used to return back sanitized content.

I’d much rather use DOM/XPath techniques than regular expressions.

At this point, it occurs to me that a number of people who read this weblog have far more experience and/or better contacts than I do to help pull these pieces together.