It’s just data

Clone Wars

In RSS 2.0, guid is “a string that uniquely identifies the item.”.

In this feed, there are 100 items, all with the same guid.

So far, I’ve tested this feed with four aggregators, and each display 100 separate items.  Should they?

Is this why we are seeing so many duplicates?


No, they shouldn’t; but i can understand why they do as users would be frustrated if they didn’t see the content they expected.

My homegrown aggregator didn’t display them but bloglines does.  Bloglines has some funny dupe/update logic that results in me seeing dupes and updates even after setting all feeds to “Ignore Updates”.

But Yahoo! has had problems for awhile: [link]

Posted by James E. Robinson, III at

Duplication in RSS is almost always due to publishers not using GUIDs to uniquely identify each item.  When a feed doesn’t use GUIDs, each aggregator is forced to come up with its own method of uniquely ID-ing items.  This is usually a combination (or hash) of the title, link and pubDate (if it exists).  In the absence of a GUID, duplication will occur when the feed publisher changes any of these elements for an item, causing the aggregator to treat it as a new item.

FWIW, FeedDemon displays a single item from this Yahoo! feed due to the duplicate GUIDs.

Posted by Nick Bradbury at

It’s hard to counter this problem. Here’s ESPN: [link]. Duplicate GUIDs and links. This is really fun with an RDF store.

You tend to get bug reports from users who’ve seen all the entries shown by simple implementations that loop through whatever happens to be in the feed.

Posted by Robert Sayre at

Robert, while each item in that feed has the same value for guid and link, as near as I can tell no two items have the same value for guid or link.  What problem do you see with that feed?

Posted by Sam Ruby at

Nick Bradbury hit it on the head. Because not all feeds have GUIDs, the aggregator writer must develop an alternative. The “alternative” that is developed then works for feeds whether they have GUIDs or not. The aggregator writer then has a choice to “support” GUIDs or not (i.e. to show updates as new items or not). I implemented FeederReader with the “alternative” and am just now writing the logic to optionally handle GUIDs. Note that this gets pretty rough when a podcaster updates the show notes to a podcast and the aggregator has already partially or fully downloaded the podcast/enclosure.

Greg Smith
Author, FeederReader - The Pocket PC RSS, podcatcher, videocatcher
www.FeederReader.com - Download on the Road

Posted by Greg Smith at

This confused me, as I agree with you in terms of behavior, so I went looking through the Bloglines code. And what I found (some of the code’s dusty at this point), is that we special case feeds that have all the guids point to the same thing. In that case, we ignore all the guids in that feed. I don’t remember the specific feeds that were broken at the time which caused us to do this (and it’s not mentioned in the CVS logs). I can understand why we do this, because a feed configured like this is almost certainly ‘broken’ and the people responsible don’t mean for it to be this way (if even Yahoo can screw it up, anyone can...).

I think there are other issues with duplicates. Anytime someone moves their feed to/from a service like Feedburner, or turns on/off ads from Feedburner, that causes duplicates. Sometimes, like with Google’s blog for a long time, the feed is served from multiple machines, and the copy on each machine is subtly different by a byte or two. We’ve seen cases where Blogspot will occasionally serve up the wrong feed for a given feed URL, which causes lots of fun.

Posted by Mark Fletcher at

Sam: looks like they’ve fixed it since I last looked. They used to be the same. Sorry!

Posted by Robert Sayre at

The problem of bad feed implementations is so pervasive that in my own aggregator I allow the user to configure a property for each feed.  “Unique key:  _ rdf:about  _ link  _ guid  _ auto”.  (I haven’t gone so far as to add the option for combining any of the above with the subject and/or date-time, but I’ve been tempted!)  The default is “auto”, which puts my code into logic that I’m sure is not that different from what other developers have done: a “best guess” based on whatever information is available in the feed — and with little or no trust placed in whatever dtd, schema, or namespace info might exist.  Giving the users a way to over-ride the best guess is certainly not an ideal solution, but it does help in some cases.  I think that a better solution will have to emerge through some sort of explicit instruction contained in the feed, whose very presence we can take to mean “trust me... my guids really are unique”.

-rich

Posted by Richard H Schwartz at

Why Atom won't be a magic balm

This post from Sam Ruby explains why Atom won’t be the “magic balm” so many people seem to think it will be. Sadly, bad practices don’t get fixed due to a change in the underlying format......

Excerpt from Smalltalk Tidbits, Industry Rants at

Sam: Newzcrawler displayed a single item, as expected.

Annoyingly, JournURL sucked 'em all in and went on its merry way. But it’s fixed now... thanks for bringing it up.

Posted by Roger Benningfield at

If these are our Clone Wars, who’s playing the role of the Sith?

Posted by Mark at

Mark, Clone Wars happened to be the top entry in this feed.

Posted by Sam Ruby at

That’s just funny! I’m certain it’s a bug that will shortly be fixed.

...

And hope.

Posted by Randy Charles Morin at

Sam Ruby: Clone Wars

[link]...

Excerpt from del.icio.us/markstanton at

RSS Duplicate Detection

How does one uniquely identify an item in a feed while still allowing for that item to be updated? RSS 2.0 has a guid element that fits the bill perfectly, but it’s not a required element and many feeds don’t use it. As a result, aggregator authors...

Excerpt from 詹姆斯 at

I’m using:

Opera/9.63 (Windows NT 6.0; U; en) Presto/2.1.1

It displays a single item.

Posted by Matt at

Add your comment