As discussed on Nick Bradbury’s weblog, guids/ids are the ideal way to identify duplicates, except of course for when they don’t work. Like, for example, in this feed.
The problems aren’t unique to RSS 2.0 feeds, I’ve seen undeclared HTML in summaries and even HTML in author names (in hindsight, I wish that atom:name was a Text Construct).
In other case, I know of a feed where the updated element gets updated frequently — something that is entirely legal as the spec says that this indicates a change that the publisher considers significant.
So for Venus, I’ve provided for the ability to specify, on a per feed basis, the elements to
ignore_in_feed, as well as an ability to override the
content_type. In each case, these options are rarely needed, but in the rare instance where they are appropriate, they are very handy.
Each of those cases are surgical and affect only one type of element each. The more general solution is a filter, and as of now, filters can also be defined on a per-feed basis. I’ve mentioned previously that they can be in any language, and to demonstrate that, I’ve make use of sed in a filter that strips Yahoo ads.