Reading Lists for Planet

Mary Gardiner: Anyone know of any showstopper bugs preventing a stable release? Anyone else in favour of poking Jeff until he does some magic?

Until now, planet.intertwingly has focused on ensuring that Atom 1.0 support is complete and correct.  It is time to widen the net.

The best and easiest way to do this is to try a diverse bunch of popular feeds.  And as luck would have it Dave Winer just posted a top100.opml file.

The native format for Planet configuration is a config.ini file, so adding support for OPML requires a converter.  The conversion I wrote is rather dumb and unforgiving.  If the document is not well formed, it will not be accepted.  I realize that some people say xmlUrl, others say xmlurl, but only the former is accepted for now.  I realize that some use text, others use title and still others use both, but again, in this initial implementation will only accept text.

When I was done, I realized that I had just implemented Reading Lists.  To use, simply place a list of OPML files, one per line, into your config.ini.  If you are online, these files will be fetched on every run.  If there are fetch errors or you are offline, the last cached version will be used.  Feeds that are no longer in the list will no longer be polled.  Feeds that are added to the list will start to be polled.

Results

See for yourself.

One of the first things I noticed is that the OPML top 100 list is really a top 91 list.  Several weblogs publish multiple feeds, either in multiple formats, or have one that they publish and one that FeedBurner provides.  This means that the popularity of people who publish only one feed is overstated, and the popularity of people who publish multiple feeds is understated.

In all, multiple feeds causes more work for everybody.  People really should pick one x.0 format (RSS 1.0, RSS 2.0, or Atom 1.0) per feed, and stick with it.

That being said, Planet seems to do a fairly decent job of detecting and eliminating these duplicates.

And planet also excels at handing encoding issues.  £500 comes out as £500 on planet.intertwingly, whereas it shows up as £500 on hosting.opml.  And the ever so popular ’ is correctly displayed as a smart quote.

It seems that not everybody provides author information.  Even for group blogs like Make — though this information is in Make’s Atom 0.3 feed.  This is silly.  Pick one feed format.  Preferably with an x.0 version number.  And provide the same information to everyone.

I have yet to spot a relative URI reference issue.

All of the popular feeds seem to be present, active, responsive.

Potential improvements

ExtremeTech’s managing editor value is not exactly an email address, and this confuses the feed parser slightly.  This feed also doesn’t have any pubDates.

Ars Technica double escapes titles.  While what the Universal Feed Parser is doing is defensible, there are some heuristics that can be added.  While guessing can never be as good as knowing, the odds can certainly be improved.

While tools like IE7 can be described as Draconian, there are a few places where the feed parser borders on being Procrustean<font size="1"> is AOK, but <span style="font-style: italic;"> is not.  A white-list of css properties should be defined.

Conclusions?

Anybody spot anything I missed?

As near as I can tell, the issues are are minor, and the process of tweaking is one that never ends.  It’s time to start wrapping this up.


A white-list of css properties should be defined. ... Anybody spot anything I missed?

You forgot to use the active voice, as in “I should define a white-list of CSS properties and give Mark a patch and a heaping pile of test cases.”

Posted by Mark at

give Mark a patch and a heaping pile of test cases

Now serving number 4

Posted by Sam Ruby at

require 'open-uri'
page=open('http://hosting.opml.org/sharemonster/aggregator.html').read

page.scan(/style="(.*?)"/).flatten.join(';').scan(/([-\w]+):\s*.*?;/).flatten.sort.uniq
=> ["border", "clear", "float", "font-size", "height", "margin", "margin-bottom", "margin-left", "margin-top", "padding", "text-align"]

page.scan(/style="(.*?)"/).flatten.join(';').scan(/([-\w]+):\s*(.*?);/).sort.uniq
=> [["border", "0"], ["border", "solid 1px #E5C2C3"], ["clear", "both"], ["float", "left"], ["font-size", "1px"], ["height", "2px"], ["margin", "0"], ["margin", "0 10px 0 0"], ["margin-bottom", "12px"], ["margin-left", "20px"], ["margin-top", "-5px"], ["padding", "0"], ["padding", "8px 0 0 0"], ["text-align", "center"]]
Posted by Sam Ruby at

lookslikehtml.py

Posted by Sam Ruby at


I was a little curious about the ExtremeTech thing, since originally I basically copied Mark’s code for handling author stuff for FeedTools.  Except, FeedTools produces pretty much exactly what I would have expected:

feed = FeedTools::Feed.open('http://rssnewsapps.ziffdavis.com/extreme.xml')
=> #<FeedTools::Feed:0x152d13a URL:http://rssnewsapps.ziffdavis.com/extreme.xml>
feed.author
=> #<FeedTools::Author:0x2a338cc @name=nil, @email="editor@extremetech.com", @href=nil, @raw="editor@extremetech.com?subject=RSS_feed">

I can’t remember though, does the UFP expose the original text value of this field anywhere?

Posted by Bob Aman at


Manual trackback:
SPARQLing FOAFrolls

Posted by Danny at


Themes for Planet

The next feature I added to Planet Venus (though it could easily be backported to classic Planet) is that of themes. The basic idea is refactoring with an eye towards reducing the amount of configuration required to get started with planet.  And the im... [more]

Trackback from Sam Ruby at

Sam Ruby: Themes for Planet

Planet WebservicesThe next feature I added to Planet Venus (though it could easily be backported to classic Planet) is that of themes. The basic idea is refactoring with an eye towards reducing the amount of configuration required to get started...

Excerpt from java.blogs Recent Entries at

Add your comment












Nav Bar