Planet Webtuesday is a aggregator of member blogs. One cool feature is that members can add new feeds to the list simply by editing their member page to include a wiki link to a feed along with the word “feed” in the link description.
This setup can also automatically extract individual user posts from a group blog, simply by specifying the desired author’s name (example). This works independent of the character encoding of the source — not everybody in Zürich has the foresight or hospitality to limit their names to the ASCII character set. This works independent of the feed format — RSS 2.0, for example doesn’t have a place defined to place people’s names, so some put their names in RFC 822 style comments, others ignore the specification and put their names in place of email addresses, and still others resort to so-called “funky” extensions. In every case, the Universal Feed Parser ferrets this information out and canonicalizes it.
Being able to depend on the canonical well formed, utf-8, xhtml, fully qualified (non-relative URIs), and Atom 1.0 format for every entry does make many things easier for designers of filters and templates, but it does require some ability to visualize the mapping. One thing I often found myself doing to test things out is to build a temporary configuration file, creating a temporary cache, running a few tests, viewing the outputs, and then cleaning up afterwards.
This type of repetitive stuff that scripts are good at, so I wrote one. It is called tests/reconsititute.py
, and an example usage is as follows:
python tests/reconstitute.py http://feeds.feedburner.com/boingboing/iBag
You can get it here.
Thanks for advocacy and for Venus. Also helps with Dokuwiki storing content in files.
Will get to exploring your suggestion on using URI fragments in the feed URL to setup xpath statements for the xpath_sifter.py soonish.
Have yet to properly research what’s allowed in a URI fragment but it might be nice / easier to use if they avoid percent encoding e.g. ‘_’ as a rule separator (instead of comma), ‘-’ as a test for equality (instead of = or ==), corresponding to a require rule and ‘!-’ corresponding to an excludes rule. And perhaps ‘~’ and ‘!~’ for an xpath statement using a regexp - not sure - could be a road to very interesting escaping issues.
So a feed URL like;
Might mean all entries from this feed where the author is “Jürg Stuker” and which are not labeled with a category “Business”