HTML5 Sanitizer
A while back, I commented that I would likely backport Jacques’s sanitizer to Python. I still haven’t gotten around to that, but I have ported it to html5lib (source, tests).
My approach was slightly different. I made it a subclass of HTMLTokenizer, meaning that the parsing and sanitization is all done in one pass, with the results sent to the treebuilder of your choice.
Example usage:
require 'html5lib/sanitizer' require 'html5lib/html5parser' include HTML5lib HTMLParser.parse(stream, :tokenizer => HTMLSanitizer).to_s
Other differences worth noting:
For my unit tests, I used REXML to serialize the DOM tree. REXML serializes attribute values using single quotes instead of double quotes, and doesn’t insert a space before the trailing slash
Jacques’s library assumes iso-8859-1. html5lib assumes utf-8. This affected the testing for
in XSS attacks.Because html5lib has more built in knowledge of HTML, a number of results are different. A few examples:
input HTML::Tokenizer html5lib <a>boo</a> <a>boo</a> <a>boo</a> <img>boo</img> <img>boo</img> <img/>boo <image>boo</image> <image>boo</image> <img/>boo <table>boo</table> <table>boo</table> boo This sanitizer “defangs” scripts and the like by exposing the tags as character data. The Universal Feed Parser (and therefore Venus) simply drops the scripts. This likely will need to be an option.
In several cases, the “defanging” is different with html5lib, but in each case in the unit test suite, just as effective.
HTMLParser preemptively downcases element and attribute names. XHTMLParser does not. Pick your poison.
As the differences were more than I had anticipated, doing this work in two steps, and verifying the results via unit tests on each step, will make the overall effort easier.
The code itself was designed to work in Instiki/Rails, which is a utf-8 environment.
Look how I modified the assignment to val_unescaped
If you’re allowing SVG elements and attributes, then your choice of poison is dictated for you.
On the XHTMLParser TODO list is to lowercase elements and attributes, but only in the xhtml namespace.
Now to roll this back into Instiki
Hopefully shortly, all this will be available as a gem.
Posted by Sam Ruby atLook how I modified the assignment to val_unescaped
I don’t think either of us did it right.
Non-breaking space (U+00A0) is, as you say, encoded as \320\240 in utf-8. But, then, the rest of the characters in the range U+0080 – U+009F are also encoded as \320\200 – \320\237 in utf-8.
Curse Ruby’s lack of Unicode support!
In any case, maybe I was being too lenient, allowing hT\320\240tP:// to match an allowed protocol. I should be just as happy dropping protocol names which include embedded control characters, tabs, non-breaking spaces, and the like.
So one could just as well drop that gsub() and get a slightly less lenient, but encoding-independent sanitizer.
So one could just as well drop that
gsub()and get a slightly less lenient, but encoding-independent sanitizer.
Unfortunately, the result would be more lenient. " javascript:alert('foo')" would be allowed.
As this particular implementation derives from HTML5lib::HTMLTokenizer, it is safe to presume utf-8. I’ve adjusted the range accordingly. Thanks!
Posted by Sam Ruby atNeat!
I started something similar that worked with the final tree but the freedom to choose a treebuilder that comes with altering the tokens is a very big win. Any plans to backport this to python or do I need to add it to my todo list?
Posted by jgraham at[from ludoo] Sam Ruby: HTML5 Sanitizer
A while back, I commented that I would likely backport Jacques’s sanitizer to Python. I still haven’t gotten around to that, but I have ported it to html5lib (source, tests)....Excerpt from del.icio.us/network/becks77 at
Any plans to backport this to python
Done. Noticed a small porting error in Jaques’s version that I corrected. In Python “” is false, in Ruby it is true; this difference lead to a empty background-image: to appear in the platypus output... I changed it to be consistent with the Python version which deletes this property entirely.
From this point forward, it should not be hard to keep the test cases in sync.
Posted by Sam Ruby at(Writing from a cyber cafe in Kenya...)
Sweet! I should probably see about having the other people who are now working on FeedTools use this to replace the existing sanitization code, which really needs work.
Posted by Bob Aman atShouldn’t
require 'html5lib/sanitizer' require 'html5lib/liberalxmlparser' include HTML5lib XHTMLParser.parseFragment(stream, :tokenizer => HTMLSanitizer).to_s
treat attribute names in a case-sensitive manner?
Posted by Jacques Distler atWith the code you have, this works:
puts HTML5lib::XHTMLParser.new.parse('<html><FOO/></html').to_s
If you pull the latest, this will now also work too:
puts HTML5lib::XHTMLParser.parse('<html><FOO/></html').to_s
(this also applies to the parseFragment methods)
Cool! Thanks.
In return, tests for case-sensitive elements and attributes.
(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)
Just a side-note to anyone wishing to use this sanitizer: the HTML5 parser is quite unfriendly to MathML named entities. So you have to convert them to NCR’s before calling the sanitizer.
Posted by Jacques Distler atIn return, tests for case-sensitive elements and attributes.
Committed. If you are interested, I will add you as a committer.
(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)
test case?
Just a side-note to anyone wishing to use this sanitizer: the HTML5 parser is quite unfriendly to MathML named entities. So you have to convert them to NCR’s before calling the sanitizer.
It seems to me that if HTMLTokenizer::initialize simply set @entities = ENTITIES, and both provided an appropriate accessor and made use of it, XHTMLParser could set @tokenizer.entities = HTML5lib::ENTITIES.merge(MATHML_ENTITIES) or some such.
Just a thought.
Posted by Sam Ruby atIf you are interested, I will add you as a committer.
Works for me either way. (The good thing about BZR, as I learned from you, is that no such decision is necessary.)
test case?(Which reminds me that the current handling of void elements in HTML5 – as opposed to XHTML5 – mode seems incorrect to me.)
I believe the tests should read:
--- HTML5lib/tests/test_sanitizer.rb.orig 2007-05-26 09:31:16.000000000 -0500
+++ HTML5lib/tests/test_sanitizer.rb 2007-05-26 09:43:28.000000000 -0500
@@ -21,11 +21,15 @@
next if %w[caption col colgroup optgroup option table tbody td tfoot th thead tr].include?(tag_name) ### TODO
define_method "test_should_allow_#{tag_name}_tag" do
if tag_name == 'image'
- assert_equal "<img title=\"1\"/>foo <bad>bar</bad> baz",
+ assert_equal "<img title=\"1\">foo <bad>bar</bad> baz",
sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
+ assert_equal "<img title=\"1\"/>foo <bad>bar</bad> baz",
+ sanitize_xhtml("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
elsif VOID_ELEMENTS.include?(tag_name)
- assert_equal "<#{tag_name} title=\"1\"/>foo <bad>bar</bad> baz",
+ assert_equal "<#{tag_name} title=\"1\">foo <bad>bar</bad> baz",
sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
+ assert_equal "<#{tag_name} title=\"1\"/>foo <bad>bar</bad> baz",
+ sanitize_xhtml("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
else
assert_equal "<#{tag_name.downcase} title=\"1\">foo <bad>bar</bad> baz</#{tag_name.downcase}>",
sanitize_html("<#{tag_name} title='1'>foo <bad>bar</bad> baz</#{tag_name}>")
It seems to me that if
HTMLTokenizer::initializesimply set@entities = ENTITIES, and both provided an appropriate accessor and made use of it,XHTMLParsercould set@tokenizer.entities = HTML5lib::ENTITIES.merge(MATHML_ENTITIES)or some such.
That might be useful to some people. For me, it was just as easy to change
sanitize_xhtml(text).to_ncr
to
sanitize_xhtml(text.to_ncr)
.
Posted by Jacques Distler at- assert_equal "<img title=\"1\"/>foo <bad>bar</bad> baz", + assert_equal "<img title=\"1\">foo <bad>bar</bad> baz",
Let’s be clear what is going on here. The default treebuilder is (currently) rexml, so what is happening here is a REXML::Document is being produced, and then to_s is being called on that document.
Understood.
All I’m saying is that REXML is, perhaps, not the best serializer for producing HTML5 (as opposed to XHTML5) output. It seems that the Python version of HTML5lib has an HTMLSerializer class, but not – currently – the Ruby version.
This is, I’ll agree, not an issue for the Sanitizer itself. But it’s still an issue.
Posted by Jacques Distler atIt seems that the Python version of HTML5lib has an HTMLSerializer class, but not – currently – the Ruby version.
That’s new. Cool. Must port.
Posted by Sam Ruby atThat’s new. Cool. Must port.It seems that the Python version of HTML5lib has an HTMLSerializer class, but not – currently – the Ruby version.
Another thing about the Serializer. It should probably treat HTML5 VOID elements (outputting “<br>” in the HTML5 and “<br />” (note the space) in the XHTML5 serialization) differently from empty elements in other namespaces (outputting them as, e.g. "<none></none>"). That’s the way the SVGfix MovableType plugin works.
“
<br />” (note the space) in the XHTML5 serialization
Just curious, but why? I’ve seen this convention before, but haven’t been able to find a reason behind it. Oddly the W3C validator seems to accept “<br/>” (no space) as valid HTML4. It also is valid HTML5.
Just curious, but why?
That’s what Appendix C says.
Oddly the W3C validator seems to accept “
<br/>” (no space) as valid HTML4.
Just because it’s “valid” (when checked by an SGML parser) doesn’t mean it’s interpreted correctly in legacy HTML UA’s.
I don’t know which (if any) current HTML UAs would stumble over “<br/>”. Perhaps this bit of Appendix C is now obsolete.
Just a note about the HTMLSerializer in html5lib: you should be theoretically able to have an XHTML5 serialization by setting the options to quote_attr_values=True, minimize_boolean_attributes=False, use_trailing_solidus=True, omit_optional_tags=False.
The default options should produce a very compact HTML5 serialization with unquoted attribute values when possible, minimized boolean attributes, omitted optional tags and void elements without a trailing solidus. I was wondering whether a whitespace-stripper could be added to remove “unnecessary” whitespace (whitespace between <table>, <tr> and <td>, collapsing multiple whitespace into a single space character, etc.) This would need to be disabled by default to avoid conflicts with eventual white-space: preserve in CSS.
Thomas: you’ve been busy!
I’ve been waiting to port it while the code was in flux. Is the code either (a) near stable at this point, or (b) if there were a functioning Ruby equivalent, would you be willing to help maintain both versions?
Posted by Sam Ruby atAs an irrelevant aside, every time I see the title “HTML5 Sanitizer” (which is often, because I’m subscribed to the comments feed), my brain wants to read it “Hand Sanitizer” for some reason.
Posted by Aristotle Pagaltzis at
Sam, there’s one thing I’d like to change before you can start the port to Ruby: changing TreeWalker to an iterable object (i.e. for token in TreeWalker(tree) instead of the current for token in TreeWalker().walk(tree) pattern).
Apart from the above change, the tree walkers API is near stable (it’s just a matter of generating the same tokens as those from the HTMLTokenizer). I’d accept any proposition regarding the tree walkers implementations though ;-)
I’ve just changed the HTMLSerializer options and added a few new features (injecting a <meta charset="..."> in the <head> –not yet plugged-in–, stripping “unnecessary” whitespace –plugged-in but a no-op at the moment–, choosing the “best” quote char for each attribute value –done, needs tests–) but the implementation is now quite stable (optional tags are a port from genshihtml5 with 100% code coverage).
Wrt the Ruby port, I have no knowledge at all in Ruby (and don’t really want to learn about it for now), so I’m afraid I’ll be unable to help.
Posted by Thomas Broyer atSam, there’s one thing I’d like to change before you can start the port to Ruby: changing TreeWalker to an iterable object (i.e. for token in TreeWalker(tree) instead of the current for token in TreeWalker().walk(tree) pattern).
OK... I committed that change... let me know what you think...
Posted by Sam Ruby atOK... I committed that change... let me know what you think...
Heck, you missed test_treewalkers.py ;-)
Also, I’m not sure about it but how about making __iter__ = walk instead of
def __iter__(self):
for token in self.walk():
yield token
It would prevent making a new no-op generator. Would there be any drawback?
Anyway, the “<meta charset>” injector is now plugged-in and activated as soon as you given an encoding to serialize and the inject_meta_charset option is true (default value). It just needs unit tests ;-)
Heck, you missed test_treewalkers.py ;-)
Oops! I must have done the commit from the src directory as I had verified my changes. Thanks!
Also, I’m not sure about it but how about making
__iter__ = walk
Done. Not that it matters much, but I actually did it the other way around as I gather that __iter__ is the preferred entry point.
Hmmm. I haven’t tried the Python version, but the Ruby version definitely has problems with astral-plane Unicode characters.
Test:
def test_should_handle_astral_plane_characters
assert_equal "<p>\360\235\222\265 \360\235\224\270</p>",
sanitize_xhtml("<p>𝒵 𝔸</p>")
end
I’ve gone back to my previous sanitizer. :-(
Posted by Jacques Distler atJacques, re your comment from Mon 28 May 2007, the serializer’s input is a list of “tokens” (the same that would be produced by the tokenizer, as defined in the HTML5 spec) so there’s no notion of namespace. Wrt empty elements, the serializer uses the void elements list and does not look at the element’s content at all, so a <none> element would always be output as <none></none>.
To serialize a tree, you pass the serializer a treewalker. Currently, treewalker implementations for namespace-aware trees (DOM, PullDOM and ElementTree) expect the elements to be in the null namespace. This could be changed, we need unit tests for those kind of things to catch for potential bugs (I’m sure we’d have weird results with such trees). Actually, unit tests for treewalkers need refactoring: we should make specific tests rather than using the parser tests.
/me: we should make the HTMLSanitizer a mixin or a filter so that it could be used between a treewalker and the serializer (or between a treewalker and the parser, to build a sanitized tree from an existing tree)
Posted by Thomas Broyer atthe Ruby version definitely has problems with astral-plane Unicode characters.
Fixed. Thanks!
we should make the HTMLSanitizer a mixin or a filter so that it could be used between a treewalker and the serializer (or between a treewalker and the parser, to build a sanitized tree from an existing tree)
Something like this should work:
Index: src/sanitizer.py
===================================================================
--- src/sanitizer.py (revision 639)
+++ src/sanitizer.py (working copy)
@@ -2,7 +2,7 @@
from xml.sax.saxutils import escape, unescape
from tokenizer import HTMLTokenizer
-class HTMLSanitizer(HTMLTokenizer):
+class HTMLSanitizer(object):
""" sanitization of XHTML+MathML+SVG and of inline style attributes."""
acceptable_elements = ['a', 'abbr', 'acronym', 'address', 'area', 'b',
@@ -119,6 +119,19 @@
allowed_svg_properties = acceptable_svg_properties
allowed_protocols = acceptable_protocols
+ def __init__(self, stream, encoding=None, parseMeta=True):
+ if not hasattr('__iter__', stream) or encoding or not parseMeta:
+ self.source = HTMLTokenizer(stream, encoding, parseMeta)
+ else:
+ self.source = stream
+
+ def getstream(self):
+ try:
+ return self.source.stream
+ else:
+ return None
+ stream = property(getstream)
+
# Sanitize the +html+, escaping all elements not in ALLOWED_ELEMENTS, and
# stripping out all # attributes not in ALLOWED_ATTRIBUTES. Style
# attributes are parsed, and a restricted set, # specified by
@@ -131,7 +144,7 @@
# sanitize_html('<a href="javascript: sucker();">Click here for $100</a>')
# => <a>Click here for $100</a>
def __iter__(self):
- for token in HTMLTokenizer.__iter__(self):
+ for token in self.source:
if token["type"] in ["StartTag", "EndTag", "EmptyTag"]:
if token["name"] in self.allowed_elements:
if token.has_key("data"):
Needs tests.
Posted by Sam Ruby atHello World
My blog is full of bombs and or keys . It has been a very entertaining and busy two weeks. In screencasts DHH shows us how easy it is to create a web log using the rails framework. In fact, it only takes 15 minutes. If only . Building my blog took...Excerpt from Sam's Spot at
Hello World
My blog is full of bombs and or keys . It has been a very entertaining and busy two weeks. In screencasts DHH shows us how easy it is to create a web log using the rails framework. In fact, it only takes 15 minutes. If only . Building my blog took...Excerpt from Sam's Spot at
Ah! Cool. I was think that, perhaps, I should do something like this.
Maybe some of the tests, which I cribbed from elsewhere, do. The code itself was designed to work in Instiki/Rails, which is a utf-8 environment.
If you’re allowing SVG elements and attributes, then your choice of poison is dictated for you.
Now to roll this back into Instiki ...
Posted by Jacques Distler at