intertwingly

It’s just data

Dealing with HTML in Feeds


Frédéric Wang: Issue with self closing MathML tags in planet

The problem here is that Frédéric takes the same content that he carefully serves as application/xhtml+xml and places it in Atom feed as HTML.  Planet Venus is based on the FeedParser which uses sgmllib to parse HTML content, and sgmllib by design ignores the self closing tag syntax.

There are a few changes Frédéric should consider in order to make his feed consumable by the widest variety of consumers, but the subject of this post focuses on what changes should be made to the feed parser in order to support this case better.

The simplest, and lowest risk, approach is to automatically close mspace, mglyph, msline, none, mprescripts, malignmark, and maligngroup when inside a math element.  This process will need to be repeated for SVG which undoubtedly will have a considerably larger number of such elements.

A more comprehensive, and therefore one which simultaneously provides greater benefit and greater risk, is replacing the calls to sgmllib with calls to html5lib.  There are two parts to this effort: (a) separate out the usages of sgmllib to parse ill-formed feeds from usages where it is known to be parsing html, and (b) if html5lib is available use dom2sax to produce events that can be mapped to sgmllib equivalents.

Implementing both results in a number of failures which I have sorted by severity, and will describe below:

That’s pretty much it.  Not too bad, really, until you realize that there is an effort underway to port the feedparser to Python 3, and efforts to port html5lib to Python 3 appear to have stalled.

Finally, the process of repetitively parsing HTML content into a DOM, producing events from the DOM, looking for simple patterns like href attributes which may need to be resolved, producing a string, and then repeating the process again to do sanitization or microformats or whatever is a bit suboptimal.  A better approach would be to convert all HTML once into a DOM and then traverse and scour the DOM as many times as necessary.  That’s the design of Mars, which is a more ambitious refactoring of Planet.