HTML5 Sanitizer

5/23/2007, 1:01:51 AM

A while back, I commented that I would likely backport Jacques’s sanitizer to Python. I still haven’t gotten around to that, but I have ported it to html5lib (source, tests).

My approach was slightly different. I made it a subclass of HTMLTokenizer, meaning that the parsing and sanitization is all done in one pass, with the results sent to the treebuilder of your choice.

Example usage:

require 'html5lib/sanitizer'
require 'html5lib/html5parser'
include HTML5lib
HTMLParser.parse(stream, :tokenizer => HTMLSanitizer).to_s

Other differences worth noting:

For my unit tests, I used REXML to serialize the DOM tree. REXML serializes attribute values using single quotes instead of double quotes, and doesn’t insert a space before the trailing slash
Jacques’s library assumes iso-8859-1. html5lib assumes utf-8. This affected the testing for   in XSS attacks.

Because html5lib has more built in knowledge of HTML, a number of results are different. A few examples:

input HTML::Tokenizer html5lib

<a>boo</a> <a>boo</a> <a>boo</a>

<img>boo</img> <img>boo</img> <img/>boo

<image>boo</image> <image>boo</image> <img/>boo

<table>boo</table> <table>boo</table> boo

This sanitizer “defangs” scripts and the like by exposing the tags as character data. The Universal Feed Parser (and therefore Venus) simply drops the scripts. This likely will need to be an option.
In several cases, the “defanging” is different with html5lib, but in each case in the unit test suite, just as effective.
HTMLParser preemptively downcases element and attribute names. XHTMLParser does not. Pick your poison.

input	HTML::Tokenizer	html5lib
<a>boo</a>	<a>boo</a>	<a>boo</a>
<img>boo</img>	<img>boo</img>	<img/>boo
<image>boo</image>	<image>boo</image>	<img/>boo
<table>boo</table>	<table>boo</table>	boo

As the differences were more than I had anticipated, doing this work in two steps, and verifying the results via unit tests on each step, will make the overall effort easier.

Update: Python version is now available: source, tests