In other case, I know of a feed where the updated element gets updated frequently — something that is entirely legal as the spec says that this indicates a change that the publisher considers significant.
So for Venus, I’ve provided for the ability to specify, on a per feed basis, the elements to ignore_in_feed, as well as an ability to override the name_type, title_type, summary_type, and content_type. In each case, these options are rarely needed, but in the rare instance where they are appropriate, they are very handy.
Each of those cases are surgical and affect only one type of element each. The more general solution is a filter, and as of now, filters can also be defined on a per-feed basis. I’ve mentioned previously that they can be in any language, and to demonstrate that, I’ve make use of sed in a filter that strips Yahoo ads.
Feeds like [link] are why some folks at Microsoft came up with the Simple List Extensions for RSS spec. Simply ignoring elements won’t give the right experience in an aggregator even if it is a Planet feed.
However I haven’t seen much take up of SLE and given how I’ve never gotten a complaint about such feeds from RSS Bandit users, I wonder if anyone actually subscribes to such feeds at all.
Guids/ids are the ideal way to identify duplicates, except of course for when they don’t work. Like, for example, in this feed.
When I was testing RSS duplicate detection that was one of the things I looked at (not specifically that feed, but a test feed where all guids were the same). The results were all over the place. Out of about 20 aggregators I think I got 9 different interpretation.
I’ve seen undeclared HTML in summaries
I’ve seen this fairly often in Atom 0.3 wordpress feeds. Om Malik and Ian Davis are two high-profile examples. Those feeds only seem to contain numeric entities though - actual HTML elements are rarer.
Just in case I was being too subtle, in the original post I provided a link to an active Atom 1.0 feed that included an HTML character entity reference in an atom:name.
That’s an easy one though. If you’re seeing an HTML entity in atom:name, there’s very little chance it was intended to be the sequence of letters AMP, E, T, H, SEMICOLON (or whatever the entity). I would think it’d be fairly safe for an aggregator to check for mistakes like that and auto-correct the entities, no? Not that I’m advocating that in general - just something to consider for fans of the liberal parsing philosophy.
Sam Ruby has been giving out plenty of examples from his version of the Planet software, called Venus. Here are the posts so far: Reading Lists, Filters, MeMeme, Stream Editing. For me, what this needs is to be hooked up to a real database. So the...