One of the joys of Ruby and HTML5 is that one can easily extract data from a web page with an XPath expression. For example, the following extracts that URI of the RSD document from a weblog that supports RSD:
require 'open-uri' require 'html5/html5parser' doc = HTML5::HTMLParser.parse(open(ARGV[0])) rsd = doc.elements['//link[@type="application/rsd+xml"]/@href'].to_s
Unfortunately, there is a bug in Ruby 1.8.6 that affects documents with a default namespace (even a vestigial one, like those sported by WordPress weblogs) which prevents non-namespace qualified attribute names from working in XPath expressions.
The following monkey-patch fixes this:
require 'rexml/document' doc = REXML::Document.new '<doc xmlns="ns"><item name="foo"/></doc>' if not doc.root.elements["item[@name='foo']"] class REXML::Element def attribute( name, namespace=nil ) prefix = nil prefix = namespaces.index(namespace) if namespace prefix = nil if prefix == 'xmlns' attributes.get_attribute( "#{prefix ? prefix + ':' : ''}#{name}" ) end end end
As I am bound to hit this issue frequently, I’ve added it to my monkey_patches file:
export RUBYOPT='-rubygems -r/home/rubys/bin/monkey_patches'
Or put another way:
“There is a bug in Ruby 1.8.6. Until a version is released which fixes this, here you go.”
Interesting concept.
Thanks for steering me in right direction.
Additionally to the problem you’ve described, I found that this method does not work if there are phantom namespaces left after XSLT translation.
Basically, the following test case does not work
require 'rexml/document' doc = REXML::Document.new( '<doc xmlns="ns" xmlns:phantom="ns"><item name="foo">text</item></doc>' ) p doc.text( "/doc/item[@name='foo']" ) p doc.root.elements["item"].attribute("name", "ns") p doc.root.elements["item[@name='foo']"]
These are the test results in ruby 1.8.6
$ ruby test.rb nil nil nil
With the following monkey-patch...
require 'rexml/document' doc = REXML::Document.new( '<doc xmlns="ns" xmlns:bar="ns"><item name="foo"/></doc>' ) if not doc.root.elements["item[@name='foo']"] class REXML::Element def attribute( name, namespace=nil ) prefix = nil prefix = namespaces.index(namespace) if namespace prefix = nil if prefix == 'xmlns' ret_val = attributes.get_attribute( "#{prefix ? prefix + ':' : ''}#{name}" ) return ret_val unless ret_val.nil? return nil if prefix.nil? # now check that prefix'es namespace is not the same as the # default namespace return nil unless ( namespaces[ prefix ] == namespaces[ 'xmlns' ] ) attributes.get_attribute( name ) end end end
... the test produces expected results
$ ruby test.rb "text" name='foo' <item name='foo'/>
I’ve submitted a bug and proposed fix with REXML