It’s just data

Producing Well Formed XML with Rails

I started out taking a look at how I could robustly handle i18n in my Rails Weblog implementation, and ended up in a completely different place - ensuring that Weblog produced well formed XML.

As described previously, atom.rxml uses Ruby’s XML builder.  I was going to look into enhancing the escaping function to handle utf-8, iso-8859-1, and windows-1252 for both element and attribute values when I noticed that escaping was only done on element values.

Perhaps this is best explained by example:

code output
@xml.title('1<2') <title>1&lt;2</title>
@xml.title('AT&T') <title>AT&amp;T</title>
@xml.title('&amp;') <title>&amp;amp;</title>
@xml.a(:title => '1<2') <a title="1<2"/>
@xml.a(:title => 'AT&T') <a title="AT&T"/>
@xml.a(:title => '"x"') <a title=""x""/>

This is either a case of “everybody knows” that the XML builder expects pre-escaped attribute values, or an oversight.  If the former, then I expect a lot of people who build podcast feeds to produce XML that is not well formed if any of the URIs contain multiple query parameters.

If it is indeed an oversight, then it is one that is easily correctable, even locally, given that classes tend to be “open” (i.e., modifiable) at runtime.

Here’s an attr_escape_fix.rb with a few tests.


Is that fourth example really invalid XML?  I don’t think that it is. 

Anne van Kesteren just had a post about the greater-than sign where he references this spec.  I think it’s acceptable within the attribute value.

Posted by Joe W. at

Joe W. is correct; the fourth example is (according to libxml2) well-formed XML.  The fifth and sixth examples are not.

Either way, this state of affairs is obviously less than optimal.  Leaving the classes as-is can result in non-wellformed XML; fixing them would introduce a subtle backward incompatibility.

See also: HOWTO Avoid Being Called a Bozo When Producing XML, which — despite the pretentious irony of claiming “There seem to be developers who think that well-formedness is awfully hard” while issuing head-spinning warnings like “when UTF-16 data is converted into UTF-8, the surrogate pair needs to be converted into the scalar value of the code point which is then converted into a 4-byte UTF-8 byte sequence” — is an excellent introduction to the issue of XML well-formedness.

Posted by Mark at

Sam Ruby: Producing Well Formed XML with Rails

[link]...

Excerpt from del.icio.us/tag/ruby at

Joe W: good catch.  I’ve updated the fourth (and first) examples.  Now the fourth example is not well formed, according to libxml2:

require 'xml/libxml'
p = XML::Parser.new
p.string = '<a title="1<2"/>'
p.parse

produces:

Entity: line 1: parser error : Unescaped '<' not allowed in attributes values
<a title="1<2"/>
	   ^
Entity: line 1: parser error : attributes construct error
<a title="1<2"/>
	   ^
Entity: line 1: parser error : Couldn't find end of Start Tag a line 1
<a title="1<2"/>
	   ^
Entity: line 1: parser error : Extra content at the end of the document
<a title="1<2"/>
	   ^
XML::Parser::ParseError: Document didn't parse
	from (irb):21:in `parse'
	from (irb):21
	from :0

Posted by Sam Ruby at

What timing.  This very weekend, after the ump-teenth time someone asked why attributes were not escaped[1], I decided to add optional escaping to attribute values.  Create your builder like so to get escaped attributes:

  xml = Builder::XmlMarkup.new(:escape_attrs => true)

This is in the CVS head for Builder, but I did miss the ‘"’ escaping.  I’ll fix it up and release it soon.

Oh, and BTW, thanks for the BOZO link.  Very useful.

[1] An early user of builder was using entities explicitly in attribute values.  Escaping attribute values would make that use very difficult.  On the flip side 99% of the users don’t care about that use case, so that was probably a bad call early on.

Posted by Jim Weirich at

Sam Ruby: Producing Well Formed XML with Rails

Sam Ruby: Producing Well Formed XML with Rails“I started out taking a look at how I could robustly handle i18n in my Rails Weblog implementation, and ended up in a completely different place - ensuring that Weblog produced well formed...

Excerpt from Hacking Feeds at

Sam Ruby: Producing Well Formed XML with Rails

deusx : Sam Ruby: Producing Well Formed XML with Rails - "I started out taking a look at how I could robustly handle i18n in my Rails Weblog implementation, and ended up in a completely different place - ensuring that Weblog produced well formed...

Excerpt from HotLinks - Level 1 at

I decided to add optional escaping to attribute values.

Cool!

Jim, while I don’t normally quibble over defaults, I would urge you to reconsider in this case.  I would argue that more people care about producing well formed XML than care about using entities explicitly in attributes.  More importantly, those that care about using entities explicitly are more likely to seek out and set this attribute than those who don’t.

Posted by Sam Ruby at

Hey Sam,
Looking at the copyright notice in your code sample got me thinking...under what terms/license do you publish the content of this site? I noticed none of your feeds make reference to any Creative Commons or OSI license. Is there any particular reason you don’t have a blanket license for most works (including presentations, code, writings)? Also, just looking out for you - as long as you are going to include a copyright statement, you might want to consider a hold harmless clause.

Posted by Christian Romney at

An example Ruby patch

Todd Huss posted his thoughts on dealing with patches to dependencies that you rely on, in response to my Tweaking on the bleeding edge: Ruby vs. Java. Sam Ruby found an issue in Rubys XmlMarkup builder. He put up a fix for this which is very clean....

Excerpt from techno.blog("Dion") at

I would argue that more people care about producing well formed XML than care about using entities explicitly in attributes.

However, in this case fixing the problem would break compatibility with apps that expect the old behavior. :-(

I think the API abstraction should completely hide escaping and not allow micromanagement of which characters are escaped and how. That way, the app programmer cannot break things. If the backwards-compatibility issue did not exist, I think using pre-escaped strings should not even be an option.

Posted by Henri Sivonen at

However, in this case fixing the problem would break compatibility with apps that expect the old behavior.

Several counterpoints:

Net: in real life, I tend to find that there are rarely any absolutes.  Every bug is potentially a feature, and therefore every bug fix is potentially a breaking change.  That doesn’t mean that bug fixes shouldn’t be made.  Both builder and rails are comparatively young and — at the moment at least — fairly free from the cruft that this option is an example of.

Three years from now, is this the way we all would like to see this API look?

Posted by Sam Ruby at

Christian: I only included that statement in this example as I didn’t want to either violate the author’s wishes, or to misrepresent the authorship of my changes in case the author is uncomfortable with them.  In general, I don’t seek any of the goals that the creative commons provides for (example: I am quite OK with my name being dropped from this contribution).

Mostly what I am interested in avoiding litigation, and preventing misrepresentation.

Posted by Sam Ruby at

Three years from now, is this the way we all would like to see this API look?

Oh, you know just what arguments pull my strings! :)

You are right, I’m convinced.

Posted by Jim Weirich at

Jim,

If you want to, you can keep your defaults as is.  There’s nothing to stop us(rails) from changing setting that attribute on our builder instance.

Cheers

Koz

Posted by Michael Koziarski at

HOWTO Avoid Being Called a Bozo When Producing XML

HOWTO Avoid Being Called a Bozo When Producing XML, via Mark on Sam Ruby’s blog....

Excerpt from Keith's Weblog at

Sobre como evolucionar un API

El siempre brillante Sam Ruby se encuentra portando el software de su blog a Ruby (que metafísico ¿verdad?) y en el camino se ha encontrado solventado bugs que existen dentro de la plataforma. En concreto generando la fuente atom para su sindicación...

Excerpt from finis coronat opus: Sobre como evolucionar un API at

Logistics and teamwork

Amy Hoy’s Atom feed is busted. Amy uses Typo.  Typo uses builder Builder had a quirk.  It did not automatically quote attributes.  This has been fixed. Typo picks up builder via rails 1.1 through the magic of svn:externals. The version of builder in svn... [more]

Trackback from Sam Ruby

at

Typo-Atom patch

Seven months ago, I got Jim Weirich to make a change to builder. Monday, I got David Heinemeier Hansson to incorporate that change into Rails. Today I noticed that the title for this entry showed up as Associations aren’t :dependent =&gt; true anym... [more]

Trackback from Sam Ruby

at

Sam Ruby: Producing Well Formed XML with Rails

[link]...

Excerpt from del.icio.us/tag/ruby at

Lo que un Webmaster Principiante Debe Saber

CSS: hojas de estilo Las páginas web estan hechas de texto y gráficos. Pueden crearse con un simple editor de texto tal como el Notepad u otros similares. Eso no quiere decir que el resultado sea pobre, al contrario, las páginas mas elaboradas se...

Excerpt from webyfoto at

Add your comment