ScalableAtomspace - Atom Wiki

Scaling Atom: P2P and Cached Feeds.

Scaling Atom: P2P and Cached Feeds.

Proposals
The syndicated blogosphere will reach 300 million feeds in 3 years.
Feed payloads will grow 100 to 10,000 times.
Each reader may consume 1000 feeds.
Syndication Growth = Denial Of Service
Discussion

Proposals

AtomCaching

The syndicated blogosphere will reach 300 million feeds in 3 years.

This is very early in the adoption of RSS feeds. Very few publishers. Even fewer readers. How will this change?

Assume growth.

In two years:

Every blogger will publish a main feed.
Each blog's category or topic will have a mirror feed.
Every business system requiring a user ID will customize feeds for each user.
Every major media outlet will drive traffic and affilliation by publishing feeds.
Some consumers will add editorial value by blending existing feeds into new, focused feeds.

I assume AOL, Microsoft, Yahoo!, and Terra will turn on blogging tools in the next 18 months, and 10% of the online community (70 million people) become bloggers.

So, many feeds.

Feed payloads will grow 100 to 10,000 times.

TiVo users often record more programming than they can possibly watch (assuming employment and sleep). This assures freedom and choice. There is every reason to believe that newsreader users will behave likewise.

Which brings us to bandwidth...

A picture is worth a thousand words. Literally.

If so, what is audio and video? Moblogging and photoblogs will exacerbate this.

In bandwidth terms, text is nearly free over land lines. Images, sounds, and video will comprise a growing share of bandwidth costs.

Each reader may consume 1000 feeds.

We'll also grow in our ability to read them.

Newsreaders will help us filter and prioritize our reading.

Security filters (I required special authorization to read these posts)
Social filters (people who read this post also liked these)
Social network proximity (rank higher posts from family, friends, colleagues, FOAFs, my bowling team)
Commercial spam filters (black lists and bayesian filters)
Manual parameter filters (like the ones in most email clients)
Categorizers and Threaders (these posts from different sources belong together)
Pre-processors (these posts in German are interesting to you in English)
Content filters (flagging or filtering-in/out "offensive" pictures or phrases)
Geographic proximity filters (these posts are by people in your neighborhood, those posts are about events near you)

So our capacity to follow more feeds will also grow by at least one to two orders of magnitude. Most people follow under 100 feeds in their newsreaders now. I follow nearly 1000, 50 religiously, 200 regularly. But all of them are searchable on my hard drive and they all pop-up in a balloon when they update.

And we don't have useful filters now. When the tools start to do more, the number of feeds consumed per reader will grow.

Syndication Growth = Denial Of Service

Lets assume that for every blogger there are two non-bloggers reading. That puts feed readership around 200 million. So you have 200 million people reading probing a thousand feeds an hour for updates. That's 200 billion probes an hour. Don't get me started on how many terabytes of flow that represents.

Do people only probe hourly? How often would you update for the latest scores during the World Cup or SuperBowl? For election results? For your medical report? Some fraction of services must support updates at closer intervals.

What architectures will support this scale? Peer-To-Peer (P2P) and caching by intermediaries (communal aggregators) have helped others to scale. Both add delays to distribution while absorbing publisher bandwidth costs and connections. There is no reason why we shouldn't apply both architectures to this problem.

So, I ask the Echo community, What changes if:

The physical location of the feed is not the feed's original source?
A client must choose from among multiple sources of the same feed?
The publisher, while not abdicating a feed's authoritative url, wishes to redirect some or all consumers to any of a list of alternative locations? (Think mirrored downloads or bitstream)
The copy of a feed file has been passed on ten times before you receive it?

P.S. I welcome challenges to my assumption, estimates, and conclusions. While I'm pretty confident in the shape of this analysis, details matter.

Discussion

[AdamRice] Quite an interesting bundle of ideas to chew on.

First off, you need some mechanism for authentication. Signing with a public key would be desirable, but the infrastructure isn't there, and most people's eyes glaze over when you start talking about PKI. Inserting a hash as a watermark would also work, I guess. Better to generate watermarks for each entry, rather than for the feed as a whole. No, that's still no good--if someone forged an entry before relaying it, they could insert a valid watermark for it. Unless I'm missing something, you really need some kind of PKI.
There would need to be a new "relay OK" tag.
I am guessing that the desktop aggregators we currently use could be pressed into service as P2P relays. It would probably be efficient to poll the desktop aggregators used by the people whose feeds you read anyhow for relays--there's probably a lot of overlap in reading habits, so this would minimize the degrees of separation. The home machine would need to update a "here I am" file on their website every time its IP number changed, and your peer would look first for the "here I am" files published by the people you subscribe to.
Of course, not everyone has full-time connections, and for those who don't--but who do have some tech savvy (and have websites at compliant hosting services)--they might be able to set up proxy peers on their web hosts.

[MichaelManley RefactorOk] Perhaps mirroring Atomic Feeds is an opportunity to start establishing a PKI infrastructure. If the managing editor or some other authority sign the feed as a whole, that could open up possibilities for mirroring of feeds without fear of the feeds themselves being compromised. On the initial subscription to a feed, the aggregator would connect to the feed originator and get the public key of the keypair used to sign the feed. The aggregator could pick up the feed from any mirror (or other distribution mechanism) and be reasonably assured that the feed had not been tampered with since the original publication by verifying the signature. Mirroring feeds with authentication, alongside whatever caching mechanism the transport provides, could mitigate bandwidth concerns for popular feeds (thinking of feeds distributed via bittorrent, for example). Also, should pointers to public keys be made part of the AutoDiscovery mechanism?

[TomasJogin] To me, this sounds like a case of trying to solve problems that would be cool to have in a distant future. The web has been growing with an incredible speed for five to ten years now. As the selection of weblogs and webpages increases, the smaller the chance is that someone is going to handpick your weblog as one they subscribe to.

[AdamRice] Yeah, Tomas is right. In the meantime, Shrook is interesting.

[AsbjornUlsberg] What about providing an HTTP PUSH method for aggregators who don't want to be polled every n'th second, and where authors (and not readers) want to have control over when their article should be updated all around the world? PUSH requires subscription and some processes around this, but that will probably be required in a lot of cases and should be defined anyhow.

Original Author: PhilWolff in Scaling Echo

CategoryArchitecture, CategoryModel, CategoryApi