It’s just data

Bleach Alternatives

Until recently, Dare’s reds would show through on Planet Intertwingly, but Antonio’s yellow’s would be stripped.  The reason was that Dare uses the <font> tag, and Antonio uses the style attribute.  Both approaches should be equally valid, the only difference is that the latter is more difficult to correctly parse.

The Universal Feed Parser is not known for shying away from difficult problems, and I saw no reason why this situation should be any different.  That being said, I didn’t aim to solve the general problem of parsing all possible CSS, I merely aimed to allow through a large subset of CSS that is both simple to parse and known to be safe.

This provided other benefits, for example, inset images on many feeds displayed as inset images on Planet Intertwingly.

Per Feed Customization

While this made a dramatic improvement, it still didn’t capture everything.  It turns out that a number of sources either put too much or too little style information into their feeds.

BoingBoing often puts a <br clear="both"> in descriptions.  Engadget does something similar with <h6> tags.  This has the effect of leaving large gaps when these items appeared alongside the subscription list which “floats” to the top right of the page.

Rogers Cadenhead places class="sourcecode" on paragraphs and span tags when he is referencing source code.  This displays using a monospace font on his site, but this style information is not syndicated along with his feed.  I do something similar on my site, but I use <pre> and <code> tags as these degrade nicely.

Henri Sivonen places <p> tags inside of <ul> and <ol> elements, and then uses CSS to reduce the gaps between list items.

Gizmodo uses left, right, and center class names on images to cause them to float or to be centered.  Again, the style sheet which describes the desired behavior associated with these classes is not placed into the feed.

Most of these issues are solvable with a little css (search for "Accomodations").  However, as the body of Planet Intertwingly is not positioned absolutely and has a floating subscription list, setting the left and right margins to auto does not center an image.  But even in this case, display:block is an improvement.

Longer Term

It occurs to me that I’ve seen these problems solved before, and with a better tool.  And I even have that the important piece installed on my machine...

I’d love to see all HTML processing in UFP become pluggable, and for a plug-in based on Mozilla to become a reality.  Many of the pieces seem to be in place.  After an apt-get install python2.4-gtk2, I find that I can import gtkmozembed from within Python.  It looks like more pieces to the puzzle are (or will) become available with GtkMozEdit.  But I don’t believe that fine grained access to the DOM from within Python is either necessary or even desirable.

To my way of thinking, the ideal would be to run Mozilla in a headless mode.  I’d simply construct a MozEmbed object, stream in some data, that data would have some unobtrusive javascript or would use an evalInSandbox technique to make adjustments to the DOM tree, and finally either an HTMLSerializer or an XHTMLSerializer would be used to return back sanitized content.

I’d much rather use DOM/XPath techniques than regular expressions.

At this point, it occurs to me that a number of people who read this weblog have far more experience and/or better contacts than I do to help pull these pieces together.


the style sheet which describes the desired behavior associated with these classes is not placed into the feed.

Sam, I’m curious what you’re advocating here. Do you suggest feeds be constructed so that they contain inline style rules instead of classes, or is there some way (currently unknown to me) to associate an external CSS file with an Atom content element? Is the latter be as simple as including a xml-stylesheet processing instruction at the top of the feed?

Posted by Justin Watt at

[from miyagawa] Sam Ruby: Bleach Alternatives

same as [link]...

Excerpt from del.icio.us/network/torum at

Justin: It would be considerably more difficult to consume an arbitrary external CSS stylesheet safely.  People tend to limit themselves to a very constrained subset of CSS when coding style tags.  Additionally, it would be up to the consumer to resolve the selectors and cascading style sheet rules.  Such an approach would virtually require something like an embedded Mozilla browser to do this correctly.  On the other hand, an embedded Mozilla browser could make this trivial.

For now, I simply advocate giving a moments thought to how entries appear in isolation, and either make more use of presentational markup and/or limited use of style attributes.  Not to the extent that tools like current versions of Microsoft Word do, more like the direction that Microsoft Word appears to be heading.

Posted by Sam Ruby at

That’ll probably work. Should be easy to hook up an XMLSerializer to get XHTML. There are HTML serializers in the tree (including a sanitizer), but none of them are exposed to JS, AFAIK. Wouldn’t be hard to wrap them in XPCOM, though.

Posted by Robert Sayre at

What I think would be a good solution to the general context-less class problem would be to draw up a set of classes with defined semantics which could be adopted by people. This adoption could some-how be signaled in the feed (or HTML) and picked up by the aggregator. Courtesy of studies such as Ian Hickson’s at google ([link]), we’ve already seen that “de-facto” class name standards are emerging, although we can only guess that class="footer" means the same thing from two different sources, and we still have people missing the point with “classes of a presentational nature”.

Posted by Jon Dowland at

FWIW this post looked like garbage in buglines. Er bloglines.

Posted by Darryl at

That’ll probably work

That means a lot to me, coming from you as you clearly have some knowledge of the codebase involved.  Anything you can do to help?

There clearly is demand for this, across a number of products, and languages.  As I pointed out, the interface doesn’t even have to be very wide: tag soup and JavaScript in, XHTML out.  Nor does the interface have to be particularly language specific, I imagine that such an interface would be popular with all the “scripting” languages: Perl, PHP, Ruby, Python...

I would also suggest that producing XHTML is sufficient.  If the serializer avoids CDATA and escapes all less than characters in content, a simple xhtml.replace('/>','>') would suffice to convert this XHTML to HTML.

Posted by Sam Ruby at

I had to come over here to finish reading the post because Bloglines had trouble following “Engadget does something similar with.”

Posted by Michael Pate at

The reason why I tend to have paragraphs inside list items even if they are not particularly paragraph-like is that OOo Writer/Web (or in my specific case NeoOffice Writer/Web) puts them there and I am too lazy to take them out on a case-by-case basis.

Besides, considering the definition of “paragraph” in the current draft of HTML5, the usage is not even necessarily semantically wrong.

Using Mozilla for sanitizing HTML seems like an overkill to me. In Java and Jython, I use TagSoup plus SAX filters. Making TagSoup available to CPython would be a worthy project.

A couple of months ago I pondered compiling TagSoup with gcj and wrapping the resulting C++-compatible classes as a Python module. I didn’t really need it myself and I have other stuff to do, so I dropped the idea. Other possibilities include a Java-to-Python compiler with just enough features to handle TagSoup and ad hoc source munging. I think a one-time manual port would be problematic considering updates.

Posted by Henri Sivonen at

One more idea: At fiMUG, we decided to integrate a wiki written in Ruby with an in-house CMS written in PHP. The design I chose was putting a TagSoup-powered screen scraper written in Java in between. The Java process is memory-resident all the time, so that there’s no need to wait for the JVM to start. The PHP side of this project is not done yet, so I can’t say if it is a success on the whole.

Posted by Henri Sivonen at

Sure, I’ll help with this

Sam Ruby had a great idea. He wants to run Mozilla headless with GtkMozembed and use it to sanitize markup....... [more]

Trackback from franklinmint.fm

at

I’d love for there to be a pure-Python version of TagSoup, and I’d be willing to work with anyone who wants to construct one.  My truly elegant logic in a truly elegant language — excellent!

When I started planning TagSoup back in 2002, Java made the most sense.  It had full Unicode support, SAX was defined very clearly for it, and it obviously wasn’t going away.  Now those reasons aren’t as pressing, and my development efforts are winding down — probably one more pre-1.0 release, as I just got a fat contributed patch.

The gcj-plus-wrapper idea is a good one, though.  I’ve played with gcj, but there are problems using it on Cygwin, which is my main development platform.

Posted by John Cowan at

Beautiful Soup

Posted by Mark at

A little research shows that the original author is relatively inactive. Mozilla people have been keeping it compiling, but most of the actual gtk patches and reviews come from chpe@gnome.org and marco@gnome.org. The gnome-python stuff looks like a little bit of build hazing, but the bindings are straightforward.

So, the API we would want to add is

gtkmozembed.run_user_script( js_code_string );
gtkmozembed.serialize();

and some serialize-start/serialize-data/serialize-stop signals... sound about right?

It occurs to me that the XMLSerializer will probably write CDATA blocks, since the DOM has CDATA nodes. But that is a solvable Mozilla-internals thing.

Posted by Robert Sayre at

A little research shows that the original author is relatively inactive. Mozilla people have been keeping it compiling, but most of the actual gtk patches and reviews come from chpe@gnome.org and marco@gnome.org. The gnome-python stuff looks like a little bit of build hazing, but the bindings are straightforward.

When I said long term above, I meant it.  If it can be done fast, great, but ideally as much as makes sense would be included in Mozilla itself, and the rest should only be one apt-get install away.  How close we can get to that goal, I leave up to you.

So, the API we would want to add is gtkmozembed.run_user_script( js_code_string ); gtkmozembed.serialize();

Sounds good.

and some serialize-start/serialize-data/serialize-stop signals... sound about right?

I presume that all this is due to the fact that Mozilla is inherently multi-threaded/asynchonous.  In the application I have in mind, a synchronous function call is a better match.  But no matter, this can be accomodated.

It occurs to me that the XMLSerializer will probably write CDATA blocks, since the DOM has CDATA nodes. But that is a solvable Mozilla-internals thing.

Oddly enough, the way that XML-DOM people use the term CDATA is quite separate from the similar term used in the serialization.  My experience with DOM serializers is that they tend to avoid the use of CDATA in serialization, that feature tends to be more used by templates.

Posted by Sam Ruby at

Hmm, you have again linked FeedBurner’s http://feeds.feedburner.com/tecosystems?m=1299 redirect URI instead of the http://www.redmonk.com/sogrady/archives/001688.html permalink it ends up at. (Don’t ask me how I specifically stumble over these – I didn’t hover any of the other links in this post.)

Posted by Aristotle Pagaltzis at

I Have Arrived

Sam Ruby and Mark Pilgrim endorse Beautiful Soup....

Excerpt from News You Can Bruise at

I presume that all this is due to the fact that Mozilla is inherently multi-threaded/asynchonous.  In the application I have in mind, a synchronous function call is a better match.

No, it can be synchronous. I am just used to avoiding blocking the event loop, but that doesn’t matter here.

My experience with DOM serializers is that they tend to avoid the use of CDATA in serialization, that feature tends to be more used by templates.

Mozilla code tends to preserve lots of syntax-level detail. Makes sense if you have to support stuff like Composer and Nvu. The result we get here is not what we want:

<html xmlns="http://www.w3.org/1999/xhtml">
 <head>
   <title>DOMParser Test</title>
   <script type="text/javascript">

function testSerializer() {  
  var serializer = new XMLSerializer();
  var xml = serializer.serializeToString(document.getElementById("testdiv"));
  alert(xml);
}
window.onload = testSerializer;

   </script>
 </head>
 <body>
   <div id="testdiv">foo <![CDATA[b&r]]> baz</div>
 </body>
</html>
Posted by Robert Sayre at

The result we get here is not what we want

I agree that Firefox should display that text.

I am comfortable, however, with the serializer returning data that is consistent with what Firefox would display.  For a number of reasons.  While there always will be edge cases, the people working on Firefox are likely, on average, to do a much better job than the various “soup” parsers out there.  Second, having a correlation between what is returned and what is actually seen is a plus.  Finally, (and perhaps, most importantly), the inevitable bug reports can be routed to Mozilla.  ;-)

Posted by Sam Ruby at

Sam: Do you know how well your CODE and PRE tags are making it into aggregators? I would write a converter that makes my sourcecode-styled paragraphs use those tags, but I’ve assumed that most aggregators strip all but the most common HTML tags.

It would be cool to have a naming convention for weblog styles for presentation in aggregators, so that things like sourcecode, filename references, books and the like could be tagged with metadata that aids presentation.

Posted by Rogers Cadenhead at

Hmm, you have again linked FeedBurner’s http://feeds.feedburner.com/tecosystems?m=1299 redirect URI instead of the http://www.redmonk.com/sogrady/archives/001688.html permalink it ends up at.

Fixed. Thanks!

Posted by Sam Ruby at

Filtering

Currently, Abdera does not perform any filtering of the content or text elements. Meaning that if some feed decided to include some unsafe style and script in the content, it would be passed through to the application using the parser, which would...

Excerpt from snellspace.com at

Rogers:

Do you know how well your CODE and PRE tags are making it into aggregators?

I can’t think of any reason to strip <code>, and indeed I haven’t seen any aggregator that does. They don’t strip <pre> either, though at least I can see how some aggregator authors might consider that one bad.

You will definitely do no worse than you are currently doing by just punting and slipping class attributes into your feed, which is likely to work nowhere.

Posted by Aristotle Pagaltzis at

I know of only one aggregator that doesn’t handle <code>. It seems to me to that it’s basically ignoring anything that would change the font face - that includes <tt>, <samp> and <kbd>. For the same reason, while it seems to keep the formatting provided by <pre>, it doesn’t use a monospace font.

I also know of one other aggregator that has problems with <pre>. Interestingly enough, it gets things wrong in exactly the opposite way. It uses a monospace font, but it loses all the formatting.

Posted by James Holderness at

Add your comment