It’s just data

Planet Hopping

Jacques Distler: I got quite annoyed that the existing software (Venus) was unable to handle my own Atom feed. Apparently, the Universal Feedparser is weak, and easily confused by posts like this one.

Inside the feedparser is the following comment, originally by Mark Pilgrim:

# This will horribly munge inline content with non-empty qnames,
# but nobody actually does that, so I'm not fixing it.

Two comments: this is now a bit of an overstatement, as I’ve addressed a number of the common use cases for svg and mathml.  And it is amazing that we have gotten to 2008 without this being an issue.

That’s the good news.  The bad news is that continued further progress is difficult.  The internal model for the feed parser for content is a serialized string.  Such a string is repeatedly pulled apart using a SGML parser and put back together.  It was the best technology at the time.  Workable, but not ideal for HTML.  Problematic for XHTML.

That’s what inspired me to produce Mars.  Its internal model is a REXML DOM.  Atom feeds with xhtml or text content are directly read into that DOM (ideally using libxml2).  Content that is escaped html utilizes the html5lib parser to produce a DOM.  Further processing (such as sanitization and resolving relative URIs) is done directly on the DOM.

Additional methods are added to the REXML elements to make traversing the DOM as convenient as the feedparser does.  In fact, it goes further and borrows an idea from JavaScript making properties accessible either via hash index or named attribute notation, for example d['feed']['title'] can be more simply expressed as d.feed.title.  Of course, the full REXML methods (including XPath) are also available.

The downside for Mars at the moment is that it’s development has focused on relatively good feeds.  It does have code in place to attempt to parse non-well-formed feeds using bits and pieces that are part of the HTML5 parser, but that’s only lightly tested at this point.  And while it does support a number of the more popular RSS formats out there, it doesn’t attempt to handle Atom 0.3 or some of the more obscure RSS formats.

It also looks like I have a few more patches I can pull from.  This one, in particular, looks interesting.  Apparently, I hadn’t documented harvest adequately, as it should be able to directly address the ERb issue mentioned.


And it is amazing that we have gotten to 2008 without this being an issue.

I would say, “disappointing,” rather than “amazing.” The Feedparser had a pretty good run, but I, for one, wish it had been shorter.

It’s taken till 2008 to get to the point where browsers are sufficiently capable. Only within the past year has it been feasible to author content that makes this an issue. (Granted, without your efforts, this would have been an issue, with much simpler content, years ago.)

Posted by Jacques Distler at

Is libhtml5 meant to say html5lib or is there a new parser out there I don’t know about yet?

Posted by Anne van Kesteren at

I meant html5lib... fixed.

A longer term goal of mine is to attempt to enable (perhaps by default, perhaps as an option) Planet Venus to produce valid HTML5 output.

Hmmm.  I was going to provide a link to HTML5 validator output for the Mars edition of my planet, but I can’t seem to find html5.validator.nu.

Posted by Sam Ruby at

A longer term goal of mine is to attempt to enable (perhaps by default, perhaps as an option) Planet Venus to produce valid HTML5 output.

What do you do with DOMs that cannot be serialized to HTML5?
I would imagine that any DOM produced by Venus/Mars can be serialized to XHTML5. Do you intend that it also be serializable to HTML5?

Posted by Jacques Distler at

I wasn’t clear.  It wasn’t my intent to focus on the differences between HTML5 and XHTML5, but rather on the differences between well-formed but not valid (X)HTML5 and well-formed and valid (X)HTML5.

Now that the HTML5 validator is back online, it looks like the first place I need to focus is on my template.  That’s easy enough.

Posted by Sam Ruby at

FWIW, I pushed out a fix to Venus to handle nested mathml/svg/mathml.  Additionally the latest Venus will recover more gracefully when the feedparser introduces a problem: as in, it will treat that one reserialized xhtml fragment as tag soup.

Posted by Sam Ruby at

The Feedparser had a pretty good run, but I, for one, wish it had been shorter.

"Patches welcome."

Posted by Mark at

Sometimes.”

Posted by Sam Ruby at

but I can’t seem to find html5.validator.nu.

Hmm. Validator.nu outages seem to happen when you are about to link. :-(

The DNS host had a 15-minute outage that they cannot find on their own logs…

Posted by Henri Sivonen at

FWIW, I pushed out a fix to Venus to handle nested mathml/svg/mathml.

Does this fix depend on the peculiar mix of prefixed and un-prefixed content used on my blog? Or will it, say, handle this feed?

"Patches welcome."

Depends what one means by "welcome."

Posted by Jacques Distler at

The particular bug that your post triggered turned out to have nothing to do with prefixes, the actual problem was nesting.  The close of the first nested math element was treated as a close of the outer math element, causing other mathml related tags to be sanitized — including the close tag for the mathml element itself, which is what caused the well-formedness error.

Posted by Sam Ruby at

Add your comment