Xhtml5lib
While there unquestionably are a lot of applications of XML for which strict, draconian, error handing is appropriate, there also are a number of use cases for which robust scavenging is required, as is evidenced by the popularity of libraries such as Beautiful Soup and the Universal Feed Parser. I’ve even done likewise for OPML.
HTML5’s grammar is a rich a blend of SGML (the common ancestor to both HTML and XML), XML, and custom parsing rules; these rules were arrived at by observing the effective consensus that browser vendors have converged on in the process of dealing with the enormous diversity of documents that exist on the Internet; documents often produced either by hand editing or by copy/pasting portions of templates.
Much of that experience can directly benefit those that find themselves in need of recovering data from mal-formed XML at any cost, particularly for the XML documents which are produced using similar hand editing, copy/pasting, and templating techniques that are used to produce invalid HTML. Additionally, given the rough similarity between HTML and XML syntax, naïve users will often copy things that happen to work in HTML into XML documents.
For these reasons, it should be of no surprise that only some relatively small adaptations to the existing html5lib tokenizer and html5parser are needed to support an XML/XHTML scavenger library. With tests.
Just be aware that in scavenge mode, some data will be interpreted in a manner different than the author intended, as such intent can’t be determined. Also be aware that some of the more advanced XML features that are less commonly used in hand-produced XML, like internal DTD subsets, are not supported by this process. For this reason, it is recommended that data first be parsed by a “real” XML parser and this logic only be used as a fallback.
References:
Charlie: I’ve corrected my redirects, so now you should be able to get to that code. Just be aware that that version can’t handle mal-formed nested outlines, whereas the version in xhtml5lib will.
Posted by Sam Ruby at
Having a “loose” XML parsing mode would be great, BUT, and this is a huge, massive BUT: If you do this, you absolutely must have a detailed specification first. Otherwise, you’ll just do to XML what happened to HTML — implementations will have to reverse-engineer each others' parsing algorithms, and you’ll start an arms race which will eventually end up in Tag Soup. Only by having a very fixed specification of exactly how error handling should happen can you avoid such a mess.
Such a specification would need to cover things like how to handle bogus internal subsets, how to handle namespaces when nodes cross each other, and so forth. There are literally dozens of valid ways you could imagine to handle such misformed content; only by ensuring that all implementations do it the same way can you ensure that no “XML Tag Soup” hell comes out of this.
Spec first. Then write tests. Then implement interoperably.
Posted by Ian Hickson atWhat Ian Hickson is saying here what I said on the Liberal XML parsing thread as well. I’m willing to try making
html5lib work for both XML and HTML but there has to be a specification for it.
Posted by Anne van Kesteren at
Spec first. Then write tests. Then implement interoperably.
With all due respect, I don’t believe that this can be done in a waterfall approach.
In many ways, HTML5 can be viewed as a massive reverse engineering effort. Scan the massive database of HTML pages that we call the Internet. Test how existing browsers have adapted to this. See if there are any coherent and consistent rules that closely approximate what browser have implemented. By the very nature of this activity, this can only be an approximation, as what browsers have implemented in rare edge cases can’t be described in a coherent and consistent fashion; in fact, it may be a moving target as bugs are fixed.
Nor was the Universal Feed Parser done in a waterfall approach. Test, code, real world usage, iterate, then spec would be a closer description of that process.
Additionally, I should state what my goal is: I am dissatisfied with the quality of output that Planet Venus produces. Venus depends on Beautiful Soup and the Universal Feed Parser, both of which depend on sgmllib, which is the true source of my dissatisfaction.
One way or another, I will write a replacement for sgmllib that more accurately reflects RSS, Atom, OPML, HTML and XHTML as practiced today. If I can’t find a good home for that code in places like html5lib, then I will simply look for others to collaborate with. And if I don’t find any, I will simply put it directly into Venus.
None of this should be interpreted as operating without any respect for specifications. For starters, I have chosen to build upon a rather solid foundation, and have taken great care to minimize the deviation from this. At the present time it consists solely of three things: case sensitivity, empty elements, and a much simplified phase diagram (as in Root->Element) for non-HTML dialects of XML.
The only major remaining piece of work left is a full mapping to either DOM or SAX.
Posted by Sam Ruby atbut there has to be a specification for it.
And there will be. But it needs to be based on real world experience. Or as Ian Hickson said get the browser vendors to implement it first (as an experimental mode, e.g.), to demonstrate that they are willing to do so.
Posted by Sam Ruby atSam Ruby: Xhtml5lib
wearehugh : Sam Ruby: Xhtml5lib - ultra-liberal xhtml parsing Tags : html5 library xhtml...Excerpt from HotLinks - Level 1 at
I might be convinced. Hacking together a liberal XML parser and checking out if it works with the Universal Feed Parser, Venus and similar projects with large data sets might give us some actual input for a specification. It’s also something I wanted to do myself for some time, but I hadn’t thought that html5lib might be a good foundation for it.
My main worries are the internal subset and namespaces though. Although I suppose you don’t have to handle namespaces at the tokenization level so that’s less of an issue.
Posted by Anne van Kesteren at
The OPML parsing code to which you point is 404 MIA. I’d love to have access to it though!
Regards,
Posted by Charlie Wood atCharlie