intertwingly

It’s just data

Venus Filters


Avi Bryant: With Dabble, anyone can now import data from a feed, combine it with data from elsewhere, restructure and filter it as needed, and push it out as another feed so the process can repeat.

What Avi is describing is right on target.  But the reality is that in today’s world you need to deal with feeds that aren’t well formed.  Are in multiple feed formats.  Consist of tag soup.  And may contain evil.

Concrete example: if you want to extract the name of the author of a comment from an RSS 2.0 feed, you need to be able to deal with the following variants:

<author>jsmith@example.com (John Smith)</author>
<author>John Smith &lt;jsmith@example.com&gt;</author>
<dc:creator>John Smith</dc:creator>

Demonstration

Input list, full content, generated excerpts.

Architecture, configuration, theme, filter, template

Explanation

My goal with Venus is to bring a GreaseMonkey like simplicity to the development of feed processing tools through the use of components that aggressively sanitize and canonicalize the input, namely the Universal Feed Parser and Beautiful Soup.

By this, I mean that a filter that is designed to convert image URIs to take advantage of the Coral Content Distribution Network, need not worry about whether the input is single escaped or double escaped, whether attributes values are single quoted, double quoted, or not quoted at all.  By eliminating all variability, such a filter can be as simple as this.

Furthermore, the design is that both filters and templates read from stdin and produce output using stdout.  This means that any programming language may be used.  Furthermore, as filters can be real programs, they need not limit themselves to filtering.  In Unix terms, the can be tees.  They can scan the input for interesting data, and POST ones that are of interest elsewhere.  They can index the data using something like Lucene.