A while back, I commented that I would likely backport Jacques’s sanitizer to Python. I still haven’t gotten around to that, but I have ported it to html5lib (source, tests).
My approach was slightly different. I made it a subclass of HTMLTokenizer, meaning that the parsing and sanitization is all done in one pass, with the results sent to the treebuilder of your choice.
Example usage:
require 'html5lib/sanitizer' require 'html5lib/html5parser' include HTML5lib HTMLParser.parse(stream, :tokenizer => HTMLSanitizer).to_s
Other differences worth noting:
For my unit tests, I used REXML to serialize the DOM tree. REXML serializes attribute values using single quotes instead of double quotes, and doesn’t insert a space before the trailing slash
Jacques’s library assumes iso-8859-1. html5lib assumes utf-8. This affected the testing for
in XSS attacks.
Because html5lib has more built in knowledge of HTML, a number of results are different. A few examples:
input HTML::Tokenizer html5lib <a>boo</a> <a>boo</a> <a>boo</a> <img>boo</img> <img>boo</img> <img/>boo <image>boo</image> <image>boo</image> <img/>boo <table>boo</table> <table>boo</table> boo
This sanitizer “defangs” scripts and the like by exposing the tags as character data. The Universal Feed Parser (and therefore Venus) simply drops the scripts. This likely will need to be an option.
In several cases, the “defanging” is different with html5lib, but in each case in the unit test suite, just as effective.
HTMLParser preemptively downcases element and attribute names. XHTMLParser does not. Pick your poison.
As the differences were more than I had anticipated, doing this work in two steps, and verifying the results via unit tests on each step, will make the overall effort easier.
Ah! Cool. I was think that, perhaps, I should do something like this.
Jacques’s library assumes iso-8859-1
Maybe some of the tests, which I cribbed from elsewhere, do. The code itself was designed to work in Instiki/Rails, which is a utf-8 environment.
HTMLParser preemptively downcases element and attribute names. XHTMLParser does not. Pick your poison.
If you’re allowing SVG elements and attributes, then your choice of poison is dictated for you.
Now to roll this back into Instiki ...
The code itself was designed to work in Instiki/Rails, which is a utf-8 environment.
Look how I modified the assignment to val_unescaped
If you’re allowing SVG elements and attributes, then your choice of poison is dictated for you.
On the XHTMLParser TODO list is to lowercase elements and attributes, but only in the xhtml namespace.
Now to roll this back into Instiki
Hopefully shortly, all this will be available as a gem.
Look how I modified the assignment to val_unescaped
I don’t think either of us did it right.
Non-breaking space (U+00A0) is, as you say, encoded as \320\240 in utf-8. But, then, the rest of the characters in the range U+0080 – U+009F are also encoded as \320\200 – \320\237 in utf-8.
Curse Ruby’s lack of Unicode support!
In any case, maybe I was being too lenient, allowing hT\320\240tP://
to match an allowed protocol. I should be just as happy dropping protocol names which include embedded control characters, tabs, non-breaking spaces, and the like.
So one could just as well drop that gsub()
and get a slightly less lenient, but encoding-independent sanitizer.
So one could just as well drop that
gsub()
and get a slightly less lenient, but encoding-independent sanitizer.
Unfortunately, the result would be more lenient. " javascript:alert('foo')"
would be allowed.
As this particular implementation derives from HTML5lib::HTMLTokenizer, it is safe to presume utf-8. I’ve adjusted the range accordingly. Thanks!
Neat!
I started something similar that worked with the final tree but the freedom to choose a treebuilder that comes with altering the tokens is a very big win. Any plans to backport this to python or do I need to add it to my todo list?
Any plans to backport this to python
Done. Noticed a small porting error in Jaques’s version that I corrected. In Python “” is false, in Ruby it is true; this difference lead to a empty background-image:
to appear in the platypus output... I changed it to be consistent with the Python version which deletes this property entirely.
From this point forward, it should not be hard to keep the test cases in sync.
(Writing from a cyber cafe in Kenya...)
Sweet! I should probably see about having the other people who are now working on FeedTools use this to replace the existing sanitization code, which really needs work.
Shouldn’t
require 'html5lib/sanitizer' require 'html5lib/liberalxmlparser' include HTML5lib XHTMLParser.parseFragment(stream, :tokenizer => HTMLSanitizer).to_s
treat attribute names in a case-sensitive manner?
With the code you have, this works:
puts HTML5lib::XHTMLParser.new.parse('<html><FOO/></html').to_s
If you pull the latest, this will now also work too:
puts HTML5lib::XHTMLParser.parse('<html><FOO/></html').to_s
(this also applies to the parseFragment
methods)
Cool! Thanks.
In return, tests for case-sensitive elements and attributes.
(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)
Just a side-note to anyone wishing to use this sanitizer: the HTML5 parser is quite unfriendly to MathML named entities. So you have to convert them to NCR’s before calling the sanitizer.
In return, tests for case-sensitive elements and attributes.
Committed. If you are interested, I will add you as a committer.
(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)
test case?
Just a side-note to anyone wishing to use this sanitizer: the HTML5 parser is quite unfriendly to MathML named entities. So you have to convert them to NCR’s before calling the sanitizer.
It seems to me that if HTMLTokenizer::initialize simply set @entities = ENTITIES
, and both provided an appropriate accessor and made use of it, XHTMLParser could set @tokenizer.entities = HTML5lib::ENTITIES.merge(MATHML_ENTITIES)
or some such.
Just a thought.
If you are interested, I will add you as a committer.
Works for me either way. (The good thing about BZR, as I learned from you, is that no such decision is necessary.)
test case?(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)
I believe the tests should read:
--- HTML5lib/tests/test_sanitizer.rb.orig 2007-05-26 09:31:16.000000000 -0500 +++ HTML5lib/tests/test_sanitizer.rb 2007-05-26 09:43:28.000000000 -0500 @@ -21,11 +21,15 @@ next if %w[caption col colgroup optgroup option table tbody td tfoot th thead tr].include?(tag_name) ### TODO define_method "test_should_allow_#{tag_name}_tag" do if tag_name == 'image' - assert_equal "<img title=\"1\"/>foo <bad>bar</bad> baz", + assert_equal "<img title=\"1\">foo <bad>bar</bad> baz", sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>") + assert_equal "<img title=\"1\"/>foo <bad>bar</bad> baz", + sanitize_xhtml("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>") elsif VOID_ELEMENTS.include?(tag_name) - assert_equal "<#{tag_name} title=\"1\"/>foo <bad>bar</bad> baz", + assert_equal "<#{tag_name} title=\"1\">foo <bad>bar</bad> baz", sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>") + assert_equal "<#{tag_name} title=\"1\"/>foo <bad>bar</bad> baz", + sanitize_xhtml("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>") else assert_equal "<#{tag_name.downcase} title=\"1\">foo <bad>bar</bad> baz</#{tag_name.downcase}>", sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
It seems to me that if
HTMLTokenizer::initialize
simply set@entities = ENTITIES
, and both provided an appropriate accessor and made use of it,XHTMLParser
could set@tokenizer.entities = HTML5lib::ENTITIES.merge(MATHML_ENTITIES)
or some such.
That might be useful to some people. For me, it was just as easy to change
sanitize_xhtml(text).to_ncr
to
sanitize_xhtml(text.to_ncr)
.
- assert_equal "<img title=\"1\"/>foo <bad>bar</bad> baz", + assert_equal "<img title=\"1\">foo <bad>bar</bad> baz",
Let’s be clear what is going on here. The default treebuilder is (currently) rexml, so what is happening here is a REXML::Document
is being produced, and then to_s
is being called on that document.
Understood.
All I’m saying is that REXML is, perhaps, not the best serializer for producing HTML5 (as opposed to XHTML5) output. It seems that the Python version of HTML5lib has an HTMLSerializer
class, but not – currently – the Ruby version.
This is, I’ll agree, not an issue for the Sanitizer itself. But it’s still an issue.
It seems that the Python version of HTML5lib has an HTMLSerializer class, but not – currently – the Ruby version.
That’s new. Cool. Must port.
That’s new. Cool. Must port.It seems that the Python version of HTML5lib has an HTMLSerializer class, but not – currently – the Ruby version.
Another thing about the Serializer. It should probably treat HTML5 VOID elements (outputting “<br>
” in the HTML5 and “<br />
” (note the space) in the XHTML5 serialization) differently from empty elements in other namespaces (outputting them as, e.g. "<none></none>
"). That’s the way the SVGfix MovableType plugin works.
“
<br />
” (note the space) in the XHTML5 serialization
Just curious, but why? I’ve seen this convention before, but haven’t been able to find a reason behind it. Oddly the W3C validator seems to accept “<br/>
” (no space) as valid HTML4. It also is valid HTML5.
Just curious, but why?
That’s what Appendix C says.
Oddly the W3C validator seems to accept “
<br/>
” (no space) as valid HTML4.
Just because it’s “valid” (when checked by an SGML parser) doesn’t mean it’s interpreted correctly in legacy HTML UA’s.
I don’t know which (if any) current HTML UAs would stumble over “<br/>
”. Perhaps this bit of Appendix C is now obsolete.
Just a note about the HTMLSerializer in html5lib: you should be theoretically able to have an XHTML5 serialization by setting the options to quote_attr_values=True, minimize_boolean_attributes=False, use_trailing_solidus=True, omit_optional_tags=False
.
The default options should produce a very compact HTML5 serialization with unquoted attribute values when possible, minimized boolean attributes, omitted optional tags and void elements without a trailing solidus. I was wondering whether a whitespace-stripper could be added to remove “unnecessary” whitespace (whitespace between <table>, <tr> and <td>, collapsing multiple whitespace into a single space character, etc.) This would need to be disabled by default to avoid conflicts with eventual white-space: preserve
in CSS.
Thomas: you’ve been busy!
I’ve been waiting to port it while the code was in flux. Is the code either (a) near stable at this point, or (b) if there were a functioning Ruby equivalent, would you be willing to help maintain both versions?
Sam, there’s one thing I’d like to change before you can start the port to Ruby: changing TreeWalker to an iterable object (i.e. for token in TreeWalker(tree)
instead of the current for token in TreeWalker().walk(tree)
pattern).
Apart from the above change, the tree walkers API is near stable (it’s just a matter of generating the same tokens as those from the HTMLTokenizer). I’d accept any proposition regarding the tree walkers implementations though ;-)
I’ve just changed the HTMLSerializer options and added a few new features (injecting a <meta charset="..."> in the <head> –not yet plugged-in–, stripping “unnecessary” whitespace –plugged-in but a no-op at the moment–, choosing the “best” quote char for each attribute value –done, needs tests–) but the implementation is now quite stable (optional tags are a port from genshihtml5 with 100% code coverage).
Wrt the Ruby port, I have no knowledge at all in Ruby (and don’t really want to learn about it for now), so I’m afraid I’ll be unable to help.
Sam, there’s one thing I’d like to change before you can start the port to Ruby: changing TreeWalker to an iterable object (i.e. for token in TreeWalker(tree) instead of the current for token in TreeWalker().walk(tree) pattern).
OK... I committed that change... let me know what you think...
OK... I committed that change... let me know what you think...
Heck, you missed test_treewalkers.py
;-)
Also, I’m not sure about it but how about making __iter__ = walk
instead of
def __iter__(self): for token in self.walk(): yield token
It would prevent making a new no-op generator. Would there be any drawback?
Anyway, the “<meta charset>” injector is now plugged-in and activated as soon as you given an encoding to serialize
and the inject_meta_charset
option is true (default value). It just needs unit tests ;-)
Heck, you missed test_treewalkers.py ;-)
Oops! I must have done the commit from the src directory as I had verified my changes. Thanks!
Also, I’m not sure about it but how about making
__iter__ = walk
Done. Not that it matters much, but I actually did it the other way around as I gather that __iter__
is the preferred entry point.
Hmmm. I haven’t tried the Python version, but the Ruby version definitely has problems with astral-plane Unicode characters.
Test:
def test_should_handle_astral_plane_characters assert_equal "<p>\360\235\222\265 \360\235\224\270</p>", sanitize_xhtml("<p>𝒵 𝔸</p>") end
I’ve gone back to my previous sanitizer. :-(
Jacques, re your comment from Mon 28 May 2007, the serializer’s input is a list of “tokens” (the same that would be produced by the tokenizer, as defined in the HTML5 spec) so there’s no notion of namespace. Wrt empty elements, the serializer uses the void elements list and does not look at the element’s content at all, so a <none> element would always be output as <none></none>
.
To serialize a tree, you pass the serializer a treewalker. Currently, treewalker implementations for namespace-aware trees (DOM, PullDOM and ElementTree) expect the elements to be in the null namespace. This could be changed, we need unit tests for those kind of things to catch for potential bugs (I’m sure we’d have weird results with such trees). Actually, unit tests for treewalkers need refactoring: we should make specific tests rather than using the parser tests.
/me: we should make the HTMLSanitizer a mixin or a filter so that it could be used between a treewalker and the serializer (or between a treewalker and the parser, to build a sanitized tree from an existing tree)
the Ruby version definitely has problems with astral-plane Unicode characters.
Fixed. Thanks!
we should make the HTMLSanitizer a mixin or a filter so that it could be used between a treewalker and the serializer (or between a treewalker and the parser, to build a sanitized tree from an existing tree)
Something like this should work:
Index: src/sanitizer.py =================================================================== --- src/sanitizer.py (revision 639) +++ src/sanitizer.py (working copy) @@ -2,7 +2,7 @@ from xml.sax.saxutils import escape, unescape from tokenizer import HTMLTokenizer -class HTMLSanitizer(HTMLTokenizer): +class HTMLSanitizer(object): """ sanitization of XHTML+MathML+SVG and of inline style attributes.""" acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b', @@ -119,6 +119,19 @@ allowed_svg_properties = acceptable_svg_properties allowed_protocols = acceptable_protocols + def __init__(self, stream, encoding=None, parseMeta=True): + if not hasattr('__iter__', stream) or encoding or not parseMeta: + self.source = HTMLTokenizer(stream, encoding, parseMeta) + else: + self.source = stream + + def getstream(self): + try: + return self.source.stream + else: + return None + stream = property(getstream) + # Sanitize the +html+, escaping all elements not in ALLOWED_ELEMENTS, and # stripping out all # attributes not in ALLOWED_ATTRIBUTES. Style # attributes are parsed, and a restricted set, # specified by @@ -131,7 +144,7 @@ # sanitize_html('<a href="javascript: sucker();">Click here for $100</a>') # => <a>Click here for $100</a> def __iter__(self): - for token in HTMLTokenizer.__iter__(self): + for token in self.source: if token["type"] in ["StartTag", "EndTag", "EmptyTag"]: if token["name"] in self.allowed_elements: if token.has_key("data"):
Needs tests.