It’s just data

Venus Filters

Avi Bryant: With Dabble, anyone can now import data from a feed, combine it with data from elsewhere, restructure and filter it as needed, and push it out as another feed so the process can repeat.

What Avi is describing is right on target.  But the reality is that in today’s world you need to deal with feeds that aren’t well formed.  Are in multiple feed formats.  Consist of tag soup.  And may contain evil.

Concrete example: if you want to extract the name of the author of a comment from an RSS 2.0 feed, you need to be able to deal with the following variants:

<author>jsmith@example.com (John Smith)</author>
<author>John Smith &lt;jsmith@example.com&gt;</author>
<dc:creator>John Smith</dc:creator>

Demonstration

Input list, full content, generated excerpts.

Architecture, configuration, theme, filter, template

Explanation

My goal with Venus is to bring a GreaseMonkey like simplicity to the development of feed processing tools through the use of components that aggressively sanitize and canonicalize the input, namely the Universal Feed Parser and Beautiful Soup.

By this, I mean that a filter that is designed to convert image URIs to take advantage of the Coral Content Distribution Network, need not worry about whether the input is single escaped or double escaped, whether attributes values are single quoted, double quoted, or not quoted at all.  By eliminating all variability, such a filter can be as simple as this.

Furthermore, the design is that both filters and templates read from stdin and produce output using stdout.  This means that any programming language may be used.  Furthermore, as filters can be real programs, they need not limit themselves to filtering.  In Unix terms, the can be tees.  They can scan the input for interesting data, and POST ones that are of interest elsewhere.  They can index the data using something like Lucene.


Neat; that’s the kind of thing I was thinking about in this blog post: [link]

Posted by Ian Bicking at

DabbleDB: a better database?

Sam quotes Avi Bryant being right on target: With Dabble, anyone can now import data from a feed, combine it with data from elsewhere, restructure and filter it as needed, and push it out as another feed so the process can repeat. I sort of agree...

Excerpt from Elias Torres at

In Perl land, we have Plagger. Tatsuhiko Miyagawa, the main Plagger author, has done a few presentations on his creation. Here are his slides from YAPC::EU.

Posted by Brian Cassidy at

I had forgotten how cool DabbleDB was.  I’ve really been wanting to come up with a good reason to use DabbleDB, but honestly, I can’t think of one.  It solves some problems in really, really mind-bogglingly cool ways, but it keeps solving problems I don’t have.  Avi needs to hurry up and try to solve a problem I do have so that I can throw some subscription money at him.

On another note... You and I seem to be trying to solve a very similar problem, or at least you’re definitely trying to solve a subset of the problem I’m currently trying to tackle with GentleCMS.

This hypothetical code (or something very close to it) should work perfectly in a few days:

content_node = GentleCMS::ResourceNode.new(
  "file:///someplace/input.html",
  <<-HTML
<html>
  <body>
    <img src="http://sporkmonger.com/files/zomgcute.jpg" />
  </body>
</html>
  HTML,
  {"cms:filters" => "stdio(file:///someplace/coral_cdn_filter.py)"}
)
filter_node = GentleCMS::ResourceNode.new(
  "file:///someplace/coral_cdn_filter.py",
  <<-PYTHON
#!/usr/bin/env python

"""
Remap all images to take advantage of the Coral Content Distribution
Network <<a href="http://www.coralcdn.org/">[link]</a>>.
"""

import sys, urlparse, xml.dom.minidom

entry = xml.dom.minidom.parse(sys.stdin).documentElement

for node in entry.getElementsByTagName('img'):
    if node.hasAttribute('src'):
        component = list(urlparse.urlparse(node.getAttribute('src')))
        if component[0]=='http' and component[1].find(':')<0:
            component[1] += '.nyud.net:8080'
            node.setAttribute('src', urlparse.urlunparse(component))

print entry.toxml('utf-8')
  PYTHON,
  {"cms:executable" => "ON"}
)
filtered_node = content_node.represent_as(:filtered)
puts filtered_node.content

Because of GentleCMS’s not-yet-finished caching system, I’m pretty sure the above will end up actually working even though the URIs are bogus, since it’ll check the cache for the filter node, and the output would be:

<html>
  <body>
    <img src="http://sporkmonger.com.nyud.net:8080/files/zomgcute.jpg"/>
  </body>
</html>

But yeah, you get the idea.  The main difference is that GentleCMS really wants to see a shebang line.  It’ll make guesses based on the extension of the file, but it really shouldn’t have to.

Btw, normally, GentleCMS isn’t nearly that verbose.  It’s usually more like this:

content_node = GentleCMS::ResourceLoader.load(
  "file:///someplace/input.html")

# Not needed in this example, since the file would actually exist
# filter_node = GentleCMS::ResourceLoader.load(
#   "file:///someplace/coral_cdn_filter.py")

filtered_node = content_node.represent_as(:filtered)
puts filtered_node.content

But that snippet wouldn’t have been even remotely enlightening in this context. :-P

Posted by Bob Aman at

-grumble-

The code above works a lot better if you replace the heredocs with %{} strings.  Stupid comma.

Posted by Bob Aman at

Slightly more advanced example:

content_node = GentleCMS::ResourceNode.new(
  "file:///someplace/input.html",
  "!http://sporkmonger.com/files/zomgcute.jpg(ZOMG Cute)!",
  {
    "cms:filters" =>
      "textile, stdio(file:///someplace/coral_cdn_filter.py)"
  }
)
puts content_node.represent_as(:filtered).content

Output:

<p><img alt="ZOMG Cute" src="http://sporkmonger.com.nyud.net:8080/files/zomgcute.jpg" title="ZOMG Cute"/></p>

Also,

content_node.properties["cms:mime-type"]
# => "text/plain"
content_node.represent_as(:filtered).properties["cms:mime-type"]
# => "application/xhtml+xml"

Any number of these things can be chained together, and there are other more-powerful mechanisms for chaining stuff as well.

Posted by Bob Aman at

FeedMonkey?

Posted by Mark at

Links - 09.01.2006

XStandard is the leading standards-compliant plug-in WYSIWYG editor for desktop applications and browser-based content management systems (IE/Mozilla/Firefox/Opera/Safari/Netscape). [via Tim’s Weblog] Rate That Commentary.com: Top 100 Ten Reasons...

Excerpt from discipline and punish at

Sam has done some amazing things with Venus, turning it into a feed processing platform. The really amazing part is at the bottom of the architecture document. Everything, absolutely everything, is turned into Atom, and not just Atom, but Atom with...

Excerpt from BitWorking | Joe Gregorio at

links for 2006-09-02

From the blogroll… Front Projector Shipments Up 28%, Music to My Ears New podcast: Career Mom Radio Venus Filters How to shoot yourself in the foot with your post From around the web… HOW-TO: Debug JavaScript in Internet Explorer IE,...

Excerpt from The Robinson House at

Project Venus

Sam is working on an feed processing platform called Venus. Just a few months ago, I was thinking a bit about how large scale feed processing might work. It’s great to see what others are thinking. Sam’s architecture includes two notable choices....

Excerpt from discipline and punish at

How did you create the SVG diagram?  By hand?  Via a native SVG authoring tool or exporting it to SVG from standard-diagram software?

Posted by J$ at

How did you create the SVG diagram?

The only tool I used was a text editor named vim.

Posted by Sam Ruby at

Venus

Sam Ruby has been giving out plenty of examples from his version of the Planet software, called Venus. Here are the posts so far: Reading Lists, Filters, MeMeme, Stream Editing. For me, what this needs is to be hooked up to a real database. So the...

Excerpt from ronin at

Venus

As the eagle-eyed among you may already have noticed, Planet Musings is now powered by Sam Ruby’s Venus. What...... [more]

Trackback from Musings

at

Sam Ruby: Venus Filters

[link]...

Excerpt from del.icio.us/akeys at

Add your comment