Bah. Over half of them are bogus. Not sure how that qualifies as “pure greatness”. For an AI company, this seems to be strong on one letter, and weak on the other...
301 is not bogus, it just means that the list does not reflect the currently preferred URI. But if you look at the list itself, two sites are grossly overrepresented, and they account for the majority of the 301s and Timeouts.
If you exclude those two sites, the percentages are much better. What’s left is a cross sample of the craziness that you find on the internet. My favorite is 304. Note that this is my first request to that particular URI...
Of course, even a 200 OK is no guarantee that the URI returns back a feed. I’ve seen too many misconfigured sites that return back HTML with an OK... Heck, I’ve seen a number of sites that return back a status code of 404... along with the data.
Persai has not yet revealed what their backend AI does, other than it uses Apache Hadoop. Getting a good set of input is a hard, and orthogonal, problem. Bloglines and GoogleReader certainly have that data. One solution (if they have access to the necessary servers to do it) is to set up their own Bloglines / GoogleReader system. It it is attractive enough, the users will come to them, and bring their data with them.
Perhaps Venus could help bootstrap this effort. It even can take care of a lot of the data cleansing needs.
IMHO, a corpus that big is going to be only useful for performance testing and figuring out just how liberal your parser is going to need to be. And in the case of performance testing at least, I don’t see the 301s as likely to be particularly problematic.
The 412 Precondition Failed responses are usually the result of overzealous security code, IIRC. I wouldn’t be surprised if you were able to retrieve the feed with a browser, but not with the Ruby HTTP client as configured by default.
The 408 is something that your crawler returned for socket timeouts?
If people just want a list of ‘popular’ feeds, we could likely get Bloglines to dump a list sorted by popularity. Lemme know if people are interested in it...
Wow. I’m really not sure how any of you found our posting but this is the internet... The feed corpus is just 2 days worth of work trying to build a list of unique feeds to seed our crawl with. It is very immature and I was being sarcastic when I called it a piece of pure greatness ;)
Ideally we want to compile a list of all known RSS feeds.
@PaulQuerna: We don’t care about popularity. We just want them all. :)
Kyle, you might want to watch rpc.weblogs.com/changes.xml then. The data is very active, but very dirty. In many cases, all you get is the address of a website. But if you do fetch that website, you often can find autodiscovery links in the response.
Off-topic: in searching for that link, I came across yet another Feed Validator. At the present time, Doc Searls’s feed is not well formed XML, but that “validator” declares it to be “a valid syndication Feed”.
Sam Ruby ran an analysis of the Persai Feed Corpus , showing how many of each HTTP status code he got back when he requested each feed. Even after looking at the site, I don’t really know what Persai is, but I have some experience with long lists...
Sam: I’ve looking into the various ping trackers and all of them are overridden with deceptive feeds and spam. Take a look at the changes.xml for a long list urls and a majority appeared to be spam.
Here’s a better source then: planet opml files. If you feel so inclined, here are a few more feeds — I’ve blocked planet intertwingly from being indexed via robots.txt, but you are welcome to include the feeds in your list.
Thanks for the suggestion Sam! This list already includes a crawl of the top 1000 opml feeds returned from Google. The planet query should yield many more.
pardon my grammar in the previous comment, its gotta be the coffee :)
If people just want a list of ‘popular’ feeds, we could likely get Bloglines to dump a list sorted by popularity. Lemme know if people are interested in it...
I would kill for something like this if it were large enough (say 20,000+ which is the size of my current test list). I wouldn’t say no to a smaller list either. I do a lot of interoperability testing and a good, representative source of feeds is hard to come by. I’m mostly interested in RSS 2.0 feeds (for RSS Advisory Board work), but everything is good.
Valleywag outed the startup day job of the guys who collectively edit the the hilarious snark site uncov. The startup, Persai, was “hiding in plain site” since they have a blog and have been pretty open about about the tech......
Looks like Rich was playing with the Persai tar.gz web crawl they posted the other day. I got a sinking feeling as I read this. I had curl’d over the corpus already to eyeball it …yeah that’s a list of feeds all right… but...
“Blogging Persai” is the title of the blog run by the Persai guys. If you needed an indication of how this post is going to proceed, a major hint would be that I was sorely tempted to give the title “Flogging Persai” to it....
“Blogging Persai” is the title of the blog run by the Persai guys. If you needed an indication of how this post is going to proceed, a major hint would be that I was sorely tempted to give the title “Flogging Persai” to it. For a bunch of guys who...