Inside the feedparser is the following comment, originally by Mark Pilgrim:
# This will horribly munge inline content with non-empty qnames,
# but nobody actually does that, so I'm not fixing it.
Two comments: this is now a bit of an overstatement, as I’ve addressed a number of the common use cases for svg and mathml. And it is amazing that we have gotten to 2008 without this being an issue.
That’s the good news. The bad news is that continued further progress is difficult. The internal model for the feed parser for content is a serialized string. Such a string is repeatedly pulled apart using a SGML parser and put back together. It was the best technology at the time. Workable, but not ideal for HTML. Problematic for XHTML.
That’s what inspired me to produce Mars. Its internal model is a REXML DOM. Atom feeds with xhtml or text content are directly read into that DOM (ideally using libxml2). Content that is escaped html utilizes the html5lib parser to produce a DOM. Further processing (such as sanitization and resolving relative URIs) is done directly on the DOM.
The downside for Mars at the moment is that it’s development has focused on relatively good feeds. It does have code in place to attempt to parse non-well-formed feeds using bits and pieces that are part of the HTML5 parser, but that’s only lightly tested at this point. And while it does support a number of the more popular RSS formats out there, it doesn’t attempt to handle Atom 0.3 or some of the more obscure RSS formats.
It also looks like I have a few more patches I can pull from. This one, in particular, looks interesting. Apparently, I hadn’t documented harvest adequately, as it should be able to directly address the ERb issue mentioned.
And it is amazing that we have gotten to 2008 without this being an issue.
I would say, “disappointing,” rather than “amazing.” The Feedparser had a pretty good run, but I, for one, wish it had been shorter.
It’s taken till 2008 to get to the point where browsers are sufficiently capable. Only within the past year has it been feasible to author content that makes this an issue. (Granted, without your efforts, this would have been an issue, with much simpler content, years ago.)
FWIW, I pushed out a fix to Venus to handle nested mathml/svg/mathml. Additionally the latest Venus will recover more gracefully when the feedparser introduces a problem: as in, it will treat that one reserialized xhtml fragment as tag soup.
The particular bug that your post triggered turned out to have nothing to do with prefixes, the actual problem was nesting. The close of the first nested math element was treated as a close of the outer math element, causing other mathml related tags to be sanitized — including the close tag for the mathml element itself, which is what caused the well-formedness error.