It’s just data

REXML on Expat

Tim Bray: Ruby has a kind of stand-offish attitude towards two of my favorite pieces of infrastructure, XML and Unicode. REXML provides a nice API, but, as Sam Ruby discovered, has big-enough holes that you can’t point it at Arbitrary Internet XML and hope for good results.

I talked to Tim about this at OSCON, and took a look at it on the plane ride back.  It also gives me an opportunity to demonstrate something I only talked about previously: namely converting a SAX parser to a pull parser via continuations.

Preparation

First, one needs to install the Ruby interface to Expat.  With Ubuntu Dapper:

sudo apt-get install libxml-parser-ruby1.8

With other operating systems, it is somewhat harder.

Proof of Concept

The implementation is pretty simple.  It simply calls the parser, and for each event it receives, it reformats the data into the structure that the REXML::Parsers::BaseParser produces.

Two methods: push and pull handle the context switches, and they are very simple: a single call to callcc (saving the current stack) and a single call to call to resume execution on the other stack.  Priming the pump involves a single additional usage of callcc coupled with a return statement, forking the stack.

Included is a small but representative set of test cases.  Some ensure that the events produced by this code exactly matches the ones produced by the BaseParser.  Others verify that a specific event is produced given a specific input.

I’ve also produced a simple demo application.  A real application would understand Atom’s procesing model.

Future

First, I must stress that this is just a proof of concept at this point.  REXML’s base parser, for example, doesn’t resolve entity references.  If this is compensated for in other places in the code, the results would be incorrect with Expat.  A complete audit and test suite is in order, bringing the semantics of REXML’s base parser in line with the other XML parsers.

Ideally, the code to allow other parsers to be registered would also be accepted into the REXML code base.  Additionally, the parameter which allows one to select which parser to use would need to be propagated up into interfaces like Document.new.

For Expat, there is more work that needs to be done.  The Expat-Ruby interface does not provide enough information to fully construct the corresponding DTD events.  A full pull parser interface also includes methods such as unshift and peek.  Assuming the REXML registration code is accepted, and given the popularity of REXML in the Ruby community, all of this could be coded in C and included with the Expat-Ruby module itself.

Similar efforts could also be made for other parsers, such as libxml2 and xerces-c.  One could then pick the parser one desires based on considerations such as performance, functionality, or portability.


Sam Ruby shows how to integrate Ruby’s REXML API with Expat. Interestingly, REXML is a pull API, while Expat is SAX-based — a difference Sam addresses via continuations. Very cool....... [more]

Trackback from Stefan Tilkov's Random Stuff

at

Sam Ruby: REXML on Expat

Simon Willison : Sam Ruby: REXML on Expat - Sam does something frighteningly clever with continuations....

Excerpt from HotLinks - Level 1 at

Parsing XML with REXML using Expat

Expat is the recognized big daddy of XML parsing. It’s a stream-based XML parser written in C and, as a library, is used for XML parsing functions by many languages. Rubyists have tended towards REXML, however, a more flexible (though infinitely...

Excerpt from Ruby Inside at

A blast from the past — on pull and push SAX APIs

Sam Ruby talks about using continuations in Ruby for SAX pull parsers. His pull-interface uses the structure that REXML::Parsers::BaseParser uses, namely an event type followed by positional arguments based on the event type. Back when I adapted...

Excerpt from Ken MacLeod at

Sam Ruby: REXML on Expat

[link]...

Excerpt from del.icio.us/tag/ruby at

01

ein pollo im pisastall, *g*. From: Pushkin, storing literacy as mailbox or on IMAP. Genius, seriously. The politically incorrect alphabet, fuck PC! Sea Tac controll tower by Rave06 Lunar urbanism 7: Being post-terrestrial, “It would be run by...

Excerpt from Anarchaia at

书签(2006-08-03)

书签(2006-08-03)
SCPD - Donald E. Knuth
Tags: classics cs knuth lecture

How Google News Indexes
Tags: google index news seo

Writing Cool Games for Cellular Devices | Java SC
Tags: game j2me java mobile programming

Shuzak.com | Chronicles of...

... [more]

Trackback from g9 at

Just for the sake of reference, on FreeBSD (6.1), it’s portinstall ruby18-libxml.

Posted by Keith Gaughan at

From push to pull with javaflow

Just recently Marcus blogged about how Sam turned expat into a pull parser via ruby continuations. I found that pretty interesting and was wondering if you could do the same with javaflow. So let’s run through this little program sketch…
A ...... [more]

Trackback from Torsten's weblog

at

Sam's continuations based REXML parser based on Expat

Just came across Sam’s REXML compatible XML parser based on Expat which had my brain thinking for a bit. The interesting thing about this XML parser (other than it implements the REXML interfaces with an Expat implementation) is that it’s......

Excerpt from crafterm's weblog at

REXML pull parser on top of expat

[link]...

Excerpt from del.icio.us/tag/sax+ruby at

Add your comment