My comment system is based on a number of regular expressions
which seem to work tolerably well in most instances when coupled
with a preview function. Unfortunately, the results are not
quite as good when used in a API context. So, today, I
finally did something about it.
The way it works is as follows:
If your content is marked as well formed XML (either xhtml:body
or atom:content[@mode='xml']), then a simple scan is done for
objectionable tags. If there are any such tags, the
entire request is rejected. Otherwise, the request is
posted as is.
If all I find is escaped content, I'll continue to support that
as I always have - with the regular expressions that mostly
work. Kinda.
postcomment.py is a small Python script which posts a comment.
What's wrong with simply putting together a DTD of XHTML modules that are acceptable, and running comments through a validator? It seems like ad-hoc checks here and there is just reinventing the wheel.
Jim, I have no plans of requiring people to enter well formed XHTML in the online HTML forms based interface if that is what you are suggesting.
As to the API: I'm not aware of a validating parser that comes with Python. In any case, the code that we are talking about here is the validate function in entryparser.py. Quite small, actually.
No, I'm not saying that you should require well-formed XHTML. I'm merely suggesting doing it in place of "a simple scan... for objectionable tags". In other words, only requiring well-formed XHTML when it will be treated as XHTML.
I don't know of a purely Pythonic validator either; I was thinking of simply using an external validator like xmllint.
I was actually considering something along the lines Jim is talking about originally. I'm working on a restricted subset of XHTML basic for use in my commenting system. A brief newsgroup discussion of it can be found in googles archive.
Maybe using some kind of sax implementation to process the XML input would be more lightweight for gathering the input, and then a validation at a later point to ensure you caught everything.