I started out taking a look at how I could robustly handle
i18n
in my Rails Weblog implementation, and ended up in a completely
different place - ensuring that Weblog produced well formed
XML.
As described
previously,
atom.rxml uses
Ruby’s XML
builder. I was going to look into enhancing the escaping
function to handle utf-8, iso-8859-1, and windows-1252 for both
element and attribute values when I noticed that escaping was only
done on element values.
Perhaps this is best explained by example:
code
output
@xml.title('1<2')
<title>1<2</title>
@xml.title('AT&T')
<title>AT&T</title>
@xml.title('&')
<title>&amp;</title>
@xml.a(:title => '1<2')
<a title="1<2"/>
@xml.a(:title => 'AT&T')
<a title="AT&T"/>
@xml.a(:title => '"x"')
<a title=""x""/>
This is either a case of “everybody knows” that the
XML builder expects pre-escaped attribute values, or an
oversight. If the former, then I expect a lot of people who
build podcast feeds to produce XML that is not well formed if any
of the URIs contain multiple query parameters.
If it is indeed an oversight, then it is one that is easily
correctable, even locally, given that classes tend to be
“open” (i.e., modifiable) at runtime.
Joe W. is correct; the fourth example is (according to libxml2) well-formed XML. The fifth and sixth examples are not.
Either way, this state of affairs is obviously less than optimal. Leaving the classes as-is can result in non-wellformed XML; fixing them would introduce a subtle backward incompatibility.
See also: HOWTO Avoid Being Called a Bozo When Producing XML, which — despite the pretentious irony of claiming “There seem to be developers who think that well-formedness is awfully hard” while issuing head-spinning warnings like “when UTF-16 data is converted into UTF-8, the surrogate pair needs to be converted into the scalar value of the code point which is then converted into a 4-byte UTF-8 byte sequence” — is an excellent introduction to the issue of XML well-formedness.
Joe W: good catch. I’ve updated the fourth (and first) examples. Now the fourth example is not well formed, according to libxml2:
require 'xml/libxml'
p = XML::Parser.new
p.string = '<a title="1<2"/>'
p.parse
produces:
Entity: line 1: parser error : Unescaped '<' not allowed in attributes values
<a title="1<2"/>
^
Entity: line 1: parser error : attributes construct error
<a title="1<2"/>
^
Entity: line 1: parser error : Couldn't find end of Start Tag a line 1
<a title="1<2"/>
^
Entity: line 1: parser error : Extra content at the end of the document
<a title="1<2"/>
^
XML::Parser::ParseError: Document didn't parse
from (irb):21:in `parse'
from (irb):21
from :0
What timing. This very weekend, after the ump-teenth time someone asked why attributes were not escaped[1], I decided to add optional escaping to attribute values. Create your builder like so to get escaped attributes:
xml = Builder::XmlMarkup.new(:escape_attrs => true)
This is in the CVS head for Builder, but I did miss the ‘"’ escaping. I’ll fix it up and release it soon.
Oh, and BTW, thanks for the BOZO link. Very useful.
[1] An early user of builder was using entities explicitly in attribute values. Escaping attribute values would make that use very difficult. On the flip side 99% of the users don’t care about that use case, so that was probably a bad call early on.
Sam Ruby: Producing Well Formed XML with Rails“I started out taking a look at how I could robustly handle i18n in my Rails Weblog implementation, and ended up in a completely different place - ensuring that Weblog produced well formed...
deusx : Sam Ruby: Producing Well Formed XML with Rails - "I started out taking a look at how I could robustly handle i18n in my Rails Weblog implementation, and ended up in a completely different place - ensuring that Weblog produced well formed...
I decided to add optional escaping to attribute values.
Cool!
Jim, while I don’t normally quibble over defaults, I would urge you to reconsider in this case. I would argue that more people care about producing well formed XML than care about using entities explicitly in attributes. More importantly, those that care about using entities explicitly are more likely to seek out and set this attribute than those who don’t.
Hey Sam,
Looking at the copyright notice in your code sample got me thinking...under what terms/license do you publish the content of this site? I noticed none of your feeds make reference to any Creative Commons or OSI license. Is there any particular reason you don’t have a blanket license for most works (including presentations, code, writings)? Also, just looking out for you - as long as you are going to include a copyright statement, you might want to consider a hold harmless clause.
Todd Huss posted his thoughts on dealing with patches to dependencies that you rely on, in response to my Tweaking on the bleeding edge: Ruby vs. Java. Sam Ruby found an issue in Rubys XmlMarkup builder. He put up a fix for this which is very clean....
I would argue that more people care about producing well formed XML than care about using entities explicitly in
attributes.
However, in this case fixing the problem would break compatibility with apps that expect the old behavior. :-(
I think the API abstraction should completely hide escaping and not allow micromanagement of which characters are escaped and how. That way, the app programmer cannot break things. If the backwards-compatibility issue did not exist, I think using pre-escaped strings should not even be an option.
However, in this case fixing the problem would break compatibility with apps that expect the old behavior.
Several counterpoints:
This just moves the problem. Rails automatically instantiates the builder object, and generally eschews configuration. Requiring configuration for an option that 99% of the users will want (or equivalently — will be prone to error if not set) just doesn’t make good sense.
Apparently, the API changed for builder 1.0, so there is precedent for making incompatible changes.
As was previously stated, 99% of the users would find it surprising if they were told that they were responsible for escaping — but only in the case of attributes. Most of these people won’t notice the change, except for the fact that the documents that they produce are not only more often well formed, but also that these documents actually convey the information that they expect.
The users that expect to have control over escaping are more likely to have unit tests that will fail if the default changed - and therefore, would more likely be in a better position to react (i.e., discover and make the one line change).
Net: in real life, I tend to find that there are rarely any absolutes. Every bug is potentially a feature, and therefore every bug fix is potentially a breaking change. That doesn’t mean that bug fixes shouldn’t be made. Both builder and rails are comparatively young and — at the moment at least — fairly free from the cruft that this option is an example of.
Three years from now, is this the way we all would like to see this API look?
Christian: I only included that statement in this example as I didn’t want to either violate the author’s wishes, or to misrepresent the authorship of my changes in case the author is uncomfortable with them. In general, I don’t seek any of the goals that the creative commons provides for (example: I am quite OK with my name being dropped from this contribution).
Mostly what I am interested in avoiding litigation, and preventing misrepresentation.
El siempre brillante Sam Ruby se encuentra portando el software de su blog a Ruby (que metafísico ¿verdad?) y en el camino se ha encontrado solventado bugs que existen dentro de la plataforma. En concreto generando la fuente atom para su sindicación...
Amy Hoy’s Atom feed is busted. Amy uses Typo. Typo uses builder Builder had a quirk. It did not automatically quote attributes. This has been fixed. Typo picks up builder via rails 1.1 through the magic of svn:externals. The version of builder in svn...
[more]
Seven months ago, I got Jim Weirich to make a change to builder. Monday, I got David Heinemeier Hansson to incorporate that change into Rails. Today I noticed that the title for this entry showed up as Associations aren’t :dependent => true anym...
[more]
CSS: hojas de estilo Las páginas web estan hechas de texto y gráficos. Pueden crearse con un simple editor de texto tal como el Notepad u otros similares. Eso no quiere decir que el resultado sea pobre, al contrario, las páginas mas elaboradas se...