It’s just data

Stream Editing

As discussed on Nick Bradbury’s weblog, guids/ids are the ideal way to identify duplicates, except of course for when they don’t work.  Like, for example, in this feed.

The problems aren’t unique to RSS 2.0 feeds, I’ve seen undeclared HTML in summaries and even HTML in author names (in hindsight, I wish that atom:name was a Text Construct).

In other case, I know of a feed where the updated element gets updated frequently — something that is entirely legal as the spec says that this indicates a change that the publisher considers significant.

So for Venus, I’ve provided for the ability to specify, on a per feed basis, the elements to ignore_in_feed, as well as an ability to override the name_type, title_type, summary_type, and content_type.  In each case, these options are rarely needed, but in the rare instance where they are appropriate, they are very handy.

Each of those cases are surgical and affect only one type of element each.  The more general solution is a filter, and as of now, filters can also be defined on a per-feed basis.  I’ve mentioned previously that they can be in any language, and to demonstrate that, I’ve make use of sed in a filter that strips Yahoo ads.


Feeds like [link] are why some folks at Microsoft came up with the Simple List Extensions for RSS spec. Simply ignoring elements won’t give the right experience in an aggregator even if it is a Planet feed.

However I haven’t seen much take up of SLE and given how I’ve never gotten a complaint about such feeds from RSS Bandit users, I wonder if anyone actually subscribes to such feeds at all.

Posted by Dare Obasanjo at

Guids/ids are the ideal way to identify duplicates, except of course for when they don’t work.  Like, for example, in this feed.

When I was testing RSS duplicate detection that was one of the things I looked at (not specifically that feed, but a test feed where all guids were the same). The results were all over the place. Out of about 20 aggregators I think I got 9 different interpretation.

I’ve seen undeclared HTML in summaries

I’ve seen this fairly often in Atom 0.3 wordpress feeds. Om Malik and Ian Davis are two high-profile examples. Those feeds only seem to contain numeric entities though - actual HTML elements are rarer.

Posted by James Holderness at

Those feeds only seem to contain numeric entities though - actual HTML elements are rarer.

Just in case I was being too subtle, in the original post I provided a link to an active Atom 1.0 feed that included an HTML character entity reference in an atom:name.

I will agree that it is rare.

Posted by Sam Ruby at

Just in case I was being too subtle, in the original post I provided a link to an active Atom 1.0 feed that included an HTML character entity reference in an atom:name.

That’s an easy one though. If you’re seeing an HTML entity in atom:name, there’s very little chance it was intended to be the sequence of letters AMP, E, T, H, SEMICOLON (or whatever the entity). I would think it’d be fairly safe for an aggregator to check for mistakes like that and auto-correct the entities, no? Not that I’m advocating that in general - just something to consider for fans of the liberal parsing philosophy.

Posted by James Holderness at

I would think it’d be fairly safe for an aggregator to check for mistakes like that and auto-correct the entities, no?

Those that wish to have this behavior with Venus can simply move this option into a [DEFAULT] section.

Liberal/strict is not a binary thing.  When the spec is clear and the feed is valid, I personally prefer the default to be to conform to the specification precisely.

Posted by Sam Ruby at

Venus

As the eagle-eyed among you may already have noticed, Planet Musings is now powered by Sam Ruby’s Venus. What...... [more]

Trackback from Musings

at

Venus

Sam Ruby has been giving out plenty of examples from his version of the Planet software, called Venus. Here are the posts so far: Reading Lists, Filters, MeMeme, Stream Editing. For me, what this needs is to be hooked up to a real database. So the...

Excerpt from ronin: Venus at

Add your comment