FeedValidator.rb?
This started out as a Random Thought (RT).
background
The Feed Validator is organized as a recursive descent parser for various feed formats. It is implemented in an object oriented fashion, where each element ‘knows’ what the possible children are for that element.
This was all well and good when the vocabulary is relatively small and stable. But now we are getting some rather large new extensions being defined. Some even change the validation rules for existing elements.
The problem is that the current design requires each element needs to know all potential child elements that can occur — even from the most obscure and rarely used namespaces.
What would be better is a more modular approach. One where
the loading of additional definitions were triggered by the
xmlns attribute itself.
Modifying existing classes is impossible in statically compiled languages, like Java. Modifying existing classes is possible in dynamic languages like Python, but difficult enough to be rarely used. Modifying existing classes is trivial and commonplace in Ruby.
listener
The design starts with a SAX2 listener. For prototyping purposes, I started with REXML, but the more I use it, the more I am convinced that it is not a suitable base for building a validator. My current nemesis: SAX character events receive the text data in a partially digested form. But that’s why I chose SAX2, as that permits me to plug in another parser with relative ease.
The Listener's job is pretty easy:
- initialize name, stack, and parser
- define a default log action of writing to STDERR
start_prefix_mappinglooks up the xmlns, and does arequireon that name in the module directory. Subsequent calls torequirehave no effect, which is exactly what we want.- for all other methods,
method_missingsimply forwards the message to the rules on the top of the stack start_elementcallsmethod_missingand then pushes all child element rules on the stack, and directly executes all attribute rules.end_elementalso callsmethod_missingand then pops the stack.
element
The Element's job is also straightforward:
- initialize various stuff
logadds attribute/element/parent name information to the log message and delegate upwards to the parent element- Three “rule” methods do some minor housekeeping
- Include the
SAX2Listenermixin to define default (null) behavior for all SAX2 events
But the real work is in the Element metaclass, which defines methods for defining rules for attributes and elements, and methods for retrieving these rules.
Several specialized subclasses are defined:
TextElementcaptures the character value for a given element, useful for elements liketitleDataElementextendsTextElement, but also throws an error if there is extra whitespace, useful for elements likeupdated.Cardinalityis stubbed out right now, ultimately it will be used to implementREQUIREDandMANY— the latter will allow multiples of elements likecategoryDiscriminatedUnionis a fancy name for elements whose definition depends on the value of an attribute. Useful for elements likesummary, and amazingly easy to implement.
modules and rules
Modules effectively make use of a domain specific grammar for defining elements, attributes, and their associated validation rules. This is largely declarative, with the ability to seamlessly drop down into code in the instances where it is necessary.
Rules typically involve a regular expression or a table lookup.
While initially, the split between elements and rules seemed to make sense, as implementation has proceeded, this distinction has become increasingly less self evident. Ultimately, it may need to be refactored away.
test
test overrides the logging and comment mechanisms of the listener to check if the test was successful. It also initializes an xml:base value.
ultimately, this would be converted to use Test::Unit. For the moment, I want to stop on first error.
overall
Overall, I’m impressed by how clean and simple a Ruby implementation could be. If I do proceed further with this (at the moment, there probably is only about 20% test coverage), I will definitely need to look into converting to libxml2.
At the moment, there is essentially no UI, but this could easily be provided by Rails. Rails would also make it trivial to add an HTTP Test Suite.
What if you could abstract most of the validation rules in a programming language agnostic document?
Posted by Randy Charles Morin atRandy: isn’t schematron: "a language for making assertions about patterns found in XML documents"?
How does this the address the point, namely "What would be better is a more modular approach. One where the loading of additional definitions were triggered by the xmlns attribute itself"?
Posted by Sam Ruby atSam,
I think it would be a stretch to call Schematron a ‘programming language’. Do you also consider XSD or XSLT ‘programming languages’ given that their primary tasks could be performed using traditional programming languages?
How does this the address the point, namely "What would be better is a more modular approach. One where the loading of additional definitions were triggered by the xmlns attribute itself"?
This is basically what XML validation languages are designed to do. And Schematron is one of them.
Posted by Dare Obasanjo atDo you also consider XSD or XSLT ‘programming languages’ given that their primary tasks could be performed using traditional programming languages?
I wrote the first implementation of Gump entirely in XSLT. Gump is a program. Ipso facto, yes, I do consider XSLT a programming language. Perhaps my usage of these terms are too broad, or perhaps yours are too narrow.
Oddly enough, I found it difficult to build a development community around the XSLT implementation of Gump. So I seeded a Python implementation, stepped back, and let a community form around it.
In any case, back to the point. Let’s try a test case. Inside itunes.py there are the lines:
if self.dispatcher.encoding.lower() not in ['utf-8','utf8']:
from logging import NotUTF8
self.log(NotUTF8({"parent":self.parent.name, "element":self.name}))
This is in support of the requirement in Apple’s Podcasting and iTunes: Technical Specification.
Without questioning the sanity of the requirement, how would such a test be coded in Schematron?
Posted by Sam Ruby atSam,
I wasn’t trying to address your question specifically, but I do think specifying the rules in a declarative language like Schematron would help with modularity.
Also, I believe XSLT is Turing complete and thus, I agree, it’s a programming language. Schematron on the other hand, is not.
Posted by Randy Charles Morin atAs for testing the encoding, I implement the XML validation (minus well-formedness) rules in Schematron and leave the HTTP and well-formedness to a traditional language like Python, Ruby or C#.
Posted by Randy Charles Morin at
Validating feeds in functional tests
In the past I usually tested the feeds a Rails application generated by writing a functional test that checked the HTTP status code and matched certain strings in the feed using a regular expression. If that checked out I hand-tested the feed using...Excerpt from Fingertips at
Interesting. Does the Ruby version actually handle multibyte Unicode characters? Last I checked, Ruby had its head in the sand saying “a character is just a byte” (see [link]) which is why I’ve pretty much avoided Ruby so far.
Posted by David at