Avi Bryant: With Dabble, anyone can now import data from a feed, combine it with data from elsewhere, restructure and filter it as needed, and push it out as another feed so the process can repeat.
What Avi is describing is right on target. But the reality is that in today’s world you need to deal with feeds that aren’t well formed. Are in multiple feed formats. Consist of tag soup. And may contain evil.
Concrete example: if you want to extract the name of the author of a comment from an RSS 2.0 feed, you need to be able to deal with the following variants:
<author>jsmith@example.com (John Smith)</author>
<author>John Smith <jsmith@example.com></author>
<dc:creator>John Smith</dc:creator>
My goal with Venus is to bring a GreaseMonkey like simplicity to the development of feed processing tools through the use of components that aggressively sanitize and canonicalize the input, namely the Universal Feed Parser and Beautiful Soup.
By this, I mean that a filter that is designed to convert image URIs to take advantage of the Coral Content Distribution Network, need not worry about whether the input is single escaped or double escaped, whether attributes values are single quoted, double quoted, or not quoted at all. By eliminating all variability, such a filter can be as simple as this.
Furthermore, the design is that both filters and templates read from stdin and produce output using stdout. This means that any programming language may be used. Furthermore, as filters can be real programs, they need not limit themselves to filtering. In Unix terms, the can be tees. They can scan the input for interesting data, and POST ones that are of interest elsewhere. They can index the data using something like Lucene.
Neat; that’s the kind of thing I was thinking about in this blog post: [link]
Sam quotes Avi Bryant being right on target: With Dabble, anyone can now import data from a feed, combine it with data from elsewhere, restructure and filter it as needed, and push it out as another feed so the process can repeat. I sort of agree...
I had forgotten how cool DabbleDB was. I’ve really been wanting to come up with a good reason to use DabbleDB, but honestly, I can’t think of one. It solves some problems in really, really mind-bogglingly cool ways, but it keeps solving problems I don’t have. Avi needs to hurry up and try to solve a problem I do have so that I can throw some subscription money at him.
On another note... You and I seem to be trying to solve a very similar problem, or at least you’re definitely trying to solve a subset of the problem I’m currently trying to tackle with GentleCMS.
This hypothetical code (or something very close to it) should work perfectly in a few days:
content_node = GentleCMS::ResourceNode.new(
"file:///someplace/input.html",
<<-HTML
<html>
<body>
<img src="http://sporkmonger.com/files/zomgcute.jpg" />
</body>
</html>
HTML,
{"cms:filters" => "stdio(file:///someplace/coral_cdn_filter.py)"}
)
filter_node = GentleCMS::ResourceNode.new(
"file:///someplace/coral_cdn_filter.py",
<<-PYTHON
#!/usr/bin/env python
"""
Remap all images to take advantage of the Coral Content Distribution
Network <<a href="http://www.coralcdn.org/">[link]</a>>.
"""
import sys, urlparse, xml.dom.minidom
entry = xml.dom.minidom.parse(sys.stdin).documentElement
for node in entry.getElementsByTagName('img'):
if node.hasAttribute('src'):
component = list(urlparse.urlparse(node.getAttribute('src')))
if component[0]=='http' and component[1].find(':')<0:
component[1] += '.nyud.net:8080'
node.setAttribute('src', urlparse.urlunparse(component))
print entry.toxml('utf-8')
PYTHON,
{"cms:executable" => "ON"}
)
filtered_node = content_node.represent_as(:filtered)
puts filtered_node.content
Because of GentleCMS’s not-yet-finished caching system, I’m pretty sure the above will end up actually working even though the URIs are bogus, since it’ll check the cache for the filter node, and the output would be:
But yeah, you get the idea. The main difference is that GentleCMS really wants to see a shebang line. It’ll make guesses based on the extension of the file, but it really shouldn’t have to.
Btw, normally, GentleCMS isn’t nearly that verbose. It’s usually more like this:
content_node = GentleCMS::ResourceLoader.load(
"file:///someplace/input.html")
# Not needed in this example, since the file would actually exist
# filter_node = GentleCMS::ResourceLoader.load(
# "file:///someplace/coral_cdn_filter.py")
filtered_node = content_node.represent_as(:filtered)
puts filtered_node.content
But that snippet wouldn’t have been even remotely enlightening in this context. :-P
XStandard is the leading standards-compliant plug-in WYSIWYG editor for desktop applications and browser-based content management systems (IE/Mozilla/Firefox/Opera/Safari/Netscape). [via Tim’s Weblog] Rate That Commentary.com: Top 100 Ten Reasons...
Sam has done some amazing things with Venus, turning it into a feed processing platform. The really amazing part is at the bottom of the architecture document. Everything, absolutely everything, is turned into Atom, and not just Atom, but Atom with...
From the blogroll… Front Projector Shipments Up 28%, Music to My Ears New podcast: Career Mom Radio Venus Filters How to shoot yourself in the foot with your post From around the web… HOW-TO: Debug JavaScript in Internet Explorer IE,...
Sam is working on an feed processing platform called Venus. Just a few months ago, I was thinking a bit about how large scale feed processing might work. It’s great to see what others are thinking. Sam’s architecture includes two notable choices....
Sam Ruby has been giving out plenty of examples from his version of the Planet software, called Venus. Here are the posts so far: Reading Lists, Filters, MeMeme, Stream Editing. For me, what this needs is to be hooked up to a real database. So the...