It’s just data

FeedValidator.rb?

This started out as a Random Thought (RT).

background

The Feed Validator is organized as a recursive descent parser for various feed formats.  It is implemented in an object oriented fashion, where each element ‘knows’ what the possible children are for that element.

This was all well and good when the vocabulary is relatively small and stable.  But now we are getting some rather large new extensions being defined.  Some even change the validation rules for existing elements.

The problem is that the current design requires each element needs to know all potential child elements that can occur — even from the most obscure and rarely used namespaces.

What would be better is a more modular approach.  One where the loading of additional definitions were triggered by the xmlns attribute itself.

Modifying existing classes is impossible in statically compiled languages, like Java.  Modifying existing classes is possible in dynamic languages like Python, but difficult enough to be rarely used.  Modifying existing classes is trivial and commonplace in Ruby.

listener

The design starts with a SAX2 listener.  For prototyping purposes, I started with REXML, but the more I use it, the more I am convinced that it is not a suitable base for building a validator.  My current nemesis: SAX character events receive the text data in a partially digested form.  But that’s why I chose SAX2, as that permits me to plug in another parser with relative ease.

The Listener's job is pretty easy:

element

The Element's job is also straightforward:

But the real work is in the Element metaclass, which defines methods for defining rules for attributes and elements, and methods for retrieving these rules.

Several specialized subclasses are defined:

modules and rules

Modules effectively make use of a domain specific grammar for defining elements, attributes, and their associated validation rules.  This is largely declarative, with the ability to seamlessly drop down into code in the instances where it is necessary.

Rules typically involve a regular expression or a table lookup.

While initially, the split between elements and rules seemed to make sense, as implementation has proceeded, this distinction has become increasingly less self evident.  Ultimately, it may need to be refactored away.

test

test overrides the logging and comment mechanisms of the listener to check if the test was successful.  It also initializes an xml:base value.

ultimately, this would be converted to use Test::Unit.  For the moment, I want to stop on first error.

overall

Overall, I’m impressed by how clean and simple a Ruby implementation could be.  If I do proceed further with this (at the moment, there probably is only about 20% test coverage), I will definitely need to look into converting to libxml2.

At the moment, there is essentially no UI, but this could easily be provided by Rails.  Rails would also make it trivial to add an HTTP Test Suite.


Interesting.  Does the Ruby version actually handle multibyte Unicode characters?  Last I checked, Ruby had its head in the sand saying “a character is just a byte” (see [link]) which is why I’ve pretty much avoided Ruby so far.

Posted by David at

David: HowToUseUnicodeStrings

Posted by Sam Ruby at

Sam Ruby: FeedValidator.rb?

[link]...

Excerpt from del.icio.us/tag/ruby at

What if you could abstract most of the validation rules in a programming language agnostic document?

[link]

Posted by Randy Charles Morin at

Randy: isn’t schematron: "a language for making assertions about patterns found in XML documents"?

How does this the address the point, namely "What would be better is a more modular approach.  One where the loading of additional definitions were triggered by the xmlns attribute itself"?

Posted by Sam Ruby at

Sam,
I think it would be a stretch to call Schematron a ‘programming language’. Do you also consider XSD or XSLT  ‘programming languages’ given that their primary tasks could be performed using traditional programming languages?

How does this the address the point, namely "What would be better is a more modular approach.  One where the loading of additional definitions were triggered by the xmlns attribute itself"?

This is basically what XML validation languages are designed to do. And Schematron is one of them.

Posted by Dare Obasanjo at

Do you also consider XSD or XSLT  ‘programming languages’ given that their primary tasks could be performed using traditional programming languages?

I wrote the first implementation of Gump entirely in XSLT.  Gump is a program.  Ipso facto, yes, I do consider XSLT a programming language.  Perhaps my usage of these terms are too broad, or perhaps yours are too narrow.

Oddly enough, I found it difficult to build a development community around the XSLT implementation of Gump.  So I seeded a Python implementation, stepped back, and let a community form around it.

In any case, back to the point.  Let’s try a test case.  Inside itunes.py there are the lines:

if self.dispatcher.encoding.lower() not in ['utf-8','utf8']:
  from logging import NotUTF8
  self.log(NotUTF8({"parent":self.parent.name, "element":self.name}))

This is in support of the requirement in Apple’s Podcasting and iTunes: Technical Specification.

Without questioning the sanity of the requirement, how would such a test be coded in Schematron?

Posted by Sam Ruby at

Sam,
I wasn’t trying to address your question specifically, but I do think specifying the rules in a declarative language like Schematron would help with modularity.

Also, I believe XSLT is Turing complete and thus, I agree, it’s a programming language. Schematron on the other hand, is not.

Posted by Randy Charles Morin at

As for testing the encoding, I implement the XML validation (minus well-formedness) rules in Schematron and leave the HTTP and well-formedness to a traditional language like Python, Ruby or C#.

Posted by Randy Charles Morin at

Validating feeds in functional tests

In the past I usually tested the feeds a Rails application generated by writing a functional test that checked the HTTP status code and matched certain strings in the feed using a regular expression. If that checked out I hand-tested the feed using...

Excerpt from Fingertips at

Add your comment