It’s just data

REXML and Mangled Text

Rick Blommers: ReXML seems to escape items very nicely when setting values.  But it doesn’t unescape the values with … )

A bare minimum amount of functionality that one would expect from an XML parsing library is the ability to round-trip data.  If you parse a document and immediately reserialize the result, you would expect to get the original back.  If you create a DOM, serialize it, and parse the results, you would expect to get the original back.  The version of REXML that comes with Ruby 1.8.4 gives you the latter.  The version of REXML that comes with Ruby 1.8.6 gives you the former.  Neither gives you both.

This test case can be used to explore this situation.  When run using Ruby 1.8.6, and you pass nots (no test serializer) as a command line argument, you will see that everything passes.  If you pass notp (no test parser) instead, you will see 30 failures.  Running with mp notp (monkey patch and no test parser) and everything passes, but running with mp nots and you will see 30 failures.

The root problem is in text.rb.  Line 147 will “normalize” (entity encode) @string in response to calls to to_s.  Line 174 will “unnormalize” (entity decode) @string in response to calls to value.

The key question is: is @string already entity encoded (in which case normalize will double encode it)?  Or is @string already entity decoded (in which case value will double decode it).  The answer can be found in @raw.  If it is set, the attribute is assumed to be entity encoded, in which case to_s simply returns it.  If it is not set (the default), you would assume that the reverse would be true, but no such short circuiting exists in value.  Additionally, the keyword return is missing in the first line of value, eliminating a potential optimization.

There are other issues with the code.  For example, try REXML::Text.unnormalize('&') (which works as expected) and REXML::Text.unnormalize('&&')  (which doesn’t).

“when the world ends, the only things left will be cockroaches, rats, Keith Richards, and mangled text that has been escaped one-too-many or one-too-few times” — Dave Walker

The two things I have yet to find is where I can SVN checkout the latest code, and how to run the exiting set of tests.  I would like to submit new tests which expose the problems I have found so far, and patches to correct these issues.  Ideally in time for 3.1.8.

Pointers appreciated.