A while back, I commented that I would likely backport Jacques’s sanitizer to Python. I still haven’t gotten around to that, but I have ported it to html5lib (source, tests).
My approach was slightly different. I made it a subclass of HTMLTokenizer, meaning that the parsing and sanitization is all done in one pass, with the results sent to the treebuilder of your choice.
require 'html5lib/sanitizer' require 'html5lib/html5parser' include HTML5lib HTMLParser.parse(stream, :tokenizer => HTMLSanitizer).to_s
Other differences worth noting:
For my unit tests, I used REXML to serialize the DOM tree. REXML serializes attribute values using single quotes instead of double quotes, and doesn’t insert a space before the trailing slash
Jacques’s library assumes iso-8859-1. html5lib assumes utf-8. This affected the testing for
in XSS attacks.
Because html5lib has more built in knowledge of HTML, a number of results are different. A few examples:
input HTML::Tokenizer html5lib <a>boo</a> <a>boo</a> <a>boo</a> <img>boo</img> <img>boo</img> <img/>boo <image>boo</image> <image>boo</image> <img/>boo <table>boo</table> <table>boo</table> boo
This sanitizer “defangs” scripts and the like by exposing the tags as character data. The Universal Feed Parser (and therefore Venus) simply drops the scripts. This likely will need to be an option.
In several cases, the “defanging” is different with html5lib, but in each case in the unit test suite, just as effective.
HTMLParser preemptively downcases element and attribute names. XHTMLParser does not. Pick your poison.
As the differences were more than I had anticipated, doing this work in two steps, and verifying the results via unit tests on each step, will make the overall effort easier.
Update: Python version is now available: source, tests