It’s just data

HTML5 Sanitizer

A while back, I commented that I would likely backport Jacques’s sanitizer to Python.  I still haven’t gotten around to that, but I have ported it to html5lib (source, tests).

My approach was slightly different.  I made it a subclass of HTMLTokenizer, meaning that the parsing and sanitization is all done in one pass, with the results sent to the treebuilder of your choice.

Example usage:

require 'html5lib/sanitizer'
require 'html5lib/html5parser'
include HTML5lib
HTMLParser.parse(stream, :tokenizer => HTMLSanitizer).to_s

Other differences worth noting:

As the differences were more than I had anticipated, doing this work in two steps, and verifying the results via unit tests on each step, will make the overall effort easier.

Update: Python version is now available: source, tests