intertwingly

It’s just data

Ruby HTML5 Parser


I got enough of this running to demonstrate proof of concept:

require 'open-uri'
require 'html5lib/html5parser'

uri = 'http://www.whatwg.org/'
doc = HTML5lib::HTMLParser.parse(open(uri))
doc.elements.each('//p[@class="what-to-do"]/a') {|link|
  link.elements.each('em') {|title| print title.children}
  puts ":\t#{link.attribute('href')}"
}

REXML is used for the TreeBuilder

I’m looking for help.  Interested?  Join the group.

Update: First patch