Publishing a Blog From a mod_atom Store
Seth Gordon: Planet (http://www.planetplanet.org/) was designed to crawl all the feeds on the blogroll and produce some appropriately formatted HTML page with all their contents; you could just set it up so it only read your own blog’s mod_atom feed, make some appropriate template, and voila!
That would certainly cover the front page, but that’s about it.
Fortunately, there are bits and pieces that cover the rest. I’ve contributed heavily to Planet, the Universal Feed Parser, and html5lib, and maintain what effectively is the only active development branch of Planet at this point, which I call Venus. As Venus has been refactored, it is easier to discuss this in terms of Venus’s architecture than of Planet’'s.
Venus has been split into two phases, Spider which fetches the data, and Splice which selects and formats entries. They communicate by means of an Atom store. Let’s look at each in turn.
First Spider fetches your feed. If you select the right options, it will make use of httplib2, which in general I highly recommend, but in this scenario the data is already on disk, so it isn’t necessary.
The Universal Feed Parser accomplishes multiple things, most notably:
It handles multiple feed formats, date formats, and ill formed feeds. Just one concrete example to illustrate the point: the author name can be obtained from one of eight different places in your feed; and the Universal Feed Parser even handles cases where people simply put names in place of email addresses or tack on names as comments after an email address.
It partially cleans the HTML. It uses SGMLLIB to clean up the tokens, then it removes unsafe constructs (like plaintext), and resolves relative URIs.
html5lib completes the HTML cleansing. Truth be told, it has a better tokenizer and a better sanitizer than the one in the feed parser, but for the moment all Venus uuses it for is parsing. The output of this phase is unfailingly well formed.
reconstitute reconstructs an element from the feed parser data.
The output of all this is placed on disk, one file per entry. At this point, it is worth considering the internal data format of Tim’s mod_atom, where all data placed on disk, one file per entry. Hmmm... Atom Store!
The bazillion feed formats issue is a non-issue here, nor is the eight ways to specify an author name, nor is the seemingly endless creative ways in which people seem to misuse RFC 822 formatted dates; all that remains as an unaddressed issue is the cleansing of the HTML. In terms of this diagram, that simply means that html5lib needs to shift from the left to the right, and Spider is no longer necessary.
Now, lets look at that right hand side. Splice is brain dead simple. It reads a sets of entries, concatenates them into a feed, and then sends that feed to the template engine of your choice.
It actually is simple enough that I don’t believe that there actually will be any code worth reusing. If you are producing your web site dynamically, you need a controller that parses the URI to determine which file(s) to read off of disk, parse those files (an XML parser will do just fine here), sanitize the HTML (again, all you need is in html5lib), resolve relative URIs, and then pass the output through a template of your choice.
If you are generating your website statically, you do basically the same thing, but place the output on disk instead.
Oh, and did I mention that html5lib was available in two languages: Python and Ruby?
But enough with hand-waving. Time for some real code. Checkout this. Download this. Tailor two lines. And then:
eruby atompub.rhtml
Joe can port it to Python in 10 minutes. Steve to JavaScript in 20 hours or so. Prefer Java? C#? Perl? Go for it!
Wouldn’t it be better to make html5lib part of mod_atom? You really want your atom store to contain clean xhtml.
Posted by Sjoerd Visscher at
Sam Ruby: Publishing a Blog From a mod_atom Store
[link] [more]...Excerpt from reddit.com: programming - newest submissions at
Sam Ruby: Publishing a Blog From a mod_atom Store
Sam Ruby: Publishing a Blog From a mod_atom Store by benoit & 1 other(s) python mod_atom atom Copy | React (0) [link]...Excerpt from Public marks from user benoit at
I’m starting to agree with Sjoerd. I’m really uncomfortable about accepting raw claims-to-be-HTML from the wild and sticking it in something with a URI which can be publicly fetched by anyone. So a C version of html5lib that I could jam into mod_atom (at least as an optional step) would be a good thing.
Posted by Tim Bray at
I’m really uncomfortable about accepting raw claims-to-be-HTML from the wild and sticking it in something with a URI which can be publicly fetched by anyone.
Yeah, I don’t see how that could possibly work on a large scale. Oh, wait...
Posted by Mark atSo a C version of html5lib that I could jam into mod_atom (at least as an optional step) would be a good thing.
Bring it on. :) mod_atom would be insanely capable with this addition.
Posted by Scott Johnson at[from ttopper] Sam Ruby: Publishing a Blog From a mod_atom Store
If you are producing your web site dynamically, you need a controller that parses the URI to determine which file(s) to read off of disk, parse those files (an XML parser will do just fine here), sanitize the HTML (again, all you need is in...Excerpt from del.icio.us/network/2mm at
Sam Ruby: Publishing a Blog From a mod_atom Store
The bazillion feed formats issue is a non-issue here, nor is the eight ways to specify an author name, nor is the seemingly endless creative ways in which people seem to misuse RFC 822 formatted dates; all that remains as an unaddressed issue is the...Excerpt from Public marks with search atom publishing protocol at
Tab Sweep
I think my list is indicative that I divide my attention too thin: Java’s Fear of Commitment ObjectGrid v6.1 User Guide Grails Object Relational Mapping Exhibit Examples from the SMILE project World of Resources in Rails Planet Venus Code Robaccia...Excerpt from 16cards at
I’ve been puzzled by this. Couldn’t something similar be written in 50 lines of Python combined with some “Script PUT” directives in an .htaccess file? If the POST and PUT handlers generate static XML files, and run the feedvalidator code on incoming requests, you’d catch most of the bogus client implementations out there.
Posted by Robert Sayre at