It’s just data
There's gotta be something in NNTP in relation to weblogs/notifications ... Since the idea was initially discussed it's been raised time and time again.
Old protocols never die, they (or the ideas of them) just get repurposed.
Posted by DJ atIt is getting time for the Newswire beta to become ready. More frequently people (Dan, <a...
[more]
Trackback from All Things Distributed
NNTP is evil!
See comments in http://www.advogato.org/article/651.html
Really, do we want to replace all-pull with all-push? Even if what we're pushing is highly optimized. Can we at least learn from NNTP's mistakes?
Posted by Mark atI don't want to replace all-pull with all-push, but what I do want to do is make it possible to push (which it isn't feasible now) and to push actual useful information. (comments, pings, whatever) I also want to yank in public key signing to minimize spoofing, and limits on what's required to be kept, which should set some folks off as well.
NNTP is useful as a source of ideas that worked, half worked, or failed utterly. I don't want to use NNTP as it stands, since I think the problems far outweigh the benefits, and the problem doesn't map quite right onto its solution space.
Also, people aren't really thinking big enough. Think of what you could do with a distributed, moderately secure, resource notification system. Besides just blogs, I'd like this to work well for other sources of info on the web--if it works well for blogs I can see online news sources (AP, Reuters, BBC, CNN, NY Times, EurekAlerts, Weekly World News...) doing it as well.
I, for one, would love a good feed backend system I could plug into the Scientific American, EurekAlert, and NYT science/tech feeds... (And yeah, I can do it now with a series of RSS fetches and scrapes, but dammit I want one source for the client, not many)
Posted by Dan atMark,
Very interesting article you linked to. I particularly liked the respondents idea for RSS caching via HTTP. This does beg the question, how do we make better use of the caching capabilities of the internet today?
This set of comments, I should point out, is an example of one of the reasons that I want to build this system. I don't want to have to refresh the page every once in a while to see if more commens have been posted, nor do I want to lose track of the conversation.
What I want is to be able to subscribe to the blog channel and get a ping when a comment or trackback is posted. (And let the end software filter out which bits of the channel's info is important to me, which is a separate problem--I need the data to solve it, though, and what I want is the ability to easily get the data)
Posted by Dan atDan, all the comments, trackbacks, pingbacks, and referrers (complete with auto-extracted excerpts) on this site are available as an RSS feed:
http://www.intertwingly.net/blog/comments.rdf
Replace ".rdf" with ".rss" or ".rss2" to get different RSS flavors.
Also, every individual post has its own RSS feed (also in several flavors), such as
http://www.intertwingly.net/blog/1350.rdf
All feeds are ETag-enabled (allowing for conditional HTTP GET) and gzipped (for clients that support such things).
I thought individual feeds were also auto-discoverable via a LINK tag in the HEAD section of the HTML page, but apparently they are not. (Sam?)
(I am assuming for the purposes of this discussion that you understand everything I just said. If you don't -- and I mean this is the nicest possible way -- please go away and don't come back until you do.)
In short, Sam is doing just about everything you can do to maximize efficiency within the current paradigm. Which is not to say the paradigm shouldn't be thrown out, but it is worth running the numbers on Sam's blog to see exactly what you're proposing to replace.
Let's assume a maximally efficient client (supports conditional GET and compressed content) checking a feed once an hour that only changes twice a day. Sam's main feed (compressed) + associated HTTP headers is about 10K; conditional GETs are about 200 bytes (HTTP headers only).
For a cost of 24 HTTP requests and 25K a day, a single client gets the full content of Sam's posts, within an hour of when they're posted. Halving the wait time doubles the number of requests, but only increases the total bandwidth by 5K.
Multiply all of this per client.
Realistically, any system you propose will need to be an order of magnitude more efficient than this in order to be seriously considered.
Posted by Mark atYour point's well-taken Mark. Luckily I do have an idea of what I'm talking about. (Or at least the delusion that I do, which'll do for now)
The scheme I'm considering would have, for a new blog posting, an average size of 256 bytes passed upstream, total, for all clients monitoring the blog. (Assuming I'm getting the pgp size right--I'm assuming 128 bytes for the signature, which may be wrong)
No autoextracted excerpts or anything, as I'm not considering doing that, so changes would still cause a per-client fetch of the feed or part of the feed they were monitoring. All polling for feed changes would be eliminated for clients monitoring the change stream.
Is that enough to be worth it? I dunno. I think so, but that's me. This all may not go anywhere, as nobody may use it. If that happens, well, that's life.
(Plus this all makes clients a bit easier as they only need to monitor data coming in on one stream rather than making fetches from multiple remote hosts for all the RSS feeds, with the potential scheduling fun for them, but that's not really at issue at the moment)
Posted by Dan atExcellent. I understand that all you're proposing is the change notification system; clients would still need to download the actual content later (presumably from the source). So what you're proposing to replace (in my previous example) is the 22 HTTP requests per client per day that result in a 304 (not modified) response from Sam's web server. Instead, they would monitor a "change stream" and only download intertwingly.net/blog/index.rdf when it showed up in the stream as having changed.
Let me know if I'm getting this wrong.
Posted by Mark atYep, you've got it Mark. For a single client polling, the removal of load is trivial and likely not worth it. With dozens or hundreds of pollers, it makes rather more difference, and if you get up into the thousands or hundreds of thousands (imagine the watch list for a news website) well, that's a lot of data that isn't being sent, which strikes me as a good thing. Think of it as an explicit 304 check caching webserver network, or something equally buzzword-imbuded. :)
The size of the notification record could be significantly reduced, but at this point I'm resigned to the necessity of a cryptographically non-pathetic authentication system, otherwise some slime somewhere will start pitching in tens of thousands of bogus change notifications into the system causing all the clients to pummel the data provider into oblivion. (Not, of course, that we have any malicious weenies on the network or anything...)
Posted by Dan atHey, I get 20,000 requests per day on my various feeds, so I'm sympathetic to the problem.
OK, so let's talk about this "change stream" that I'm "monitoring". How would this work? Would it involve a central registry like blo.gs or weblogs.com, but then let clients sign up to receive active pings from the registry when sites they care about change?
Posted by Mark atRe: pummelling into oblivion. Considering that early versions of Zoe requested RSS feeds every 3 minutes (and the author did not consider this a bug), and early versions of Syndirella requested feeds every 10 seconds (but only if the server returned a 404), I can't imagine that malicious weenies could devise a DDOS attack that would be any worse than the current generation of standalone RSS readers behaving normally.
Posted by Mark atThe way I'm picturing it is that you, as a blog author, have a server that you notify about changes. Presumably you have some relationship with it and there's some level of trust between you and it. (Which I realize isn't needed with weblogs.com and blo.gs, but I worry about the abuse potential) When a change happens, your blogging software makes a connection and sends a message to it, noting that your channel has had a change. The message encodes what changed (new post, comment, trackback, whatever)
That server is connected to a number of other servers on the net, much in the same way that news servers are connected (hopefully modulo the crap involved in news, though INN's feed system doesn't suck too badly. Relatively speaking). The notification of change goes from your server to all the servers it talks to that have subscribed to your channel, or just get all notifications.
Meanwhile, you, as a client, have connected to your local server in this distribution network, and told it what channels you're interested in. At some point the blog change notice makes it to the local server, which then passes it on to you, the client, potentially along with other change notices from other channels you're watching. If you're not connected the server will batch up the pending notices to the channels you're subscribed to, either in a real batch or just a "you have channels X, Y, Z, and asked for all notices since time Q" thing. Dunno which, or maybe both, though there are storage issues there potentially.
Nothing really fancy, and I want the messages to be as small as possible. (Though no smaller) There's more stuff involved, public keys of some sort for authentication and verification, but that's about it. Small. As simple as possible, and dodging all the content distribution issues, because that's an IP minefield.
Posted by Dan atMuch of what Dan/Mark are talking about can be built today using my weblog watcher service
http://pocketsoap.com/weblog/stories/2002/02/0010.html
This is what blogToaster is built on top off. It currently gets updates from weblogs.com (by polling changes.xml) and blo.gs (pushed from blo.gs by their cloud service)
Posted by Simon Fell atCool. I worry about scaling and robustness issues, so I'm pretty sure that a single central server won't work in the long term (not to mention the sheer amount of data and subscribers it'd have to deal with--technorati claims 200K+ blogs watched, and if each of those only has 25 watchers on average, that's a lot of traffic on a central system)
Posted by Dan atOk, this must be a common problem. I know I left a comment on someone's weblog a few days ago, but I can't remember where it was. This happens a lot, where I have to fill my brain with reminders to go back and check sites I've commented on for...
Excerpt from Keith's Weblog atAnother option is a middle ground between push and pull using P2P. I think Bittorrent and Jabber are in the ballpark.
I've looked at the guts of Bittorrent and its control channel is similar to NNTP (a flooding network), except that the network is created ad-hoc among interested clients, and the server acts mostly as an "always-on" client. Bittorrent uses data channels to do larger data transfers directly between clients, but the data is also be distributed in a flooding style.
Posted by Ken MacLeod atDan Sugalski vented his disgust with the current distribution mechanism for weblogs. I can sympathize with this. I'm getting a ton of hits that result in 304's. Between that and aggregators that don't do ETag checking or at least...
[more]
Trackback from Ted Leung on the air
Dan Sugalski vented his disgust with the current distribution mechanism for weblogs. I can sympathize with this. I'm getting a ton of hits that result in 304's. Between that and aggregators that don't do ETag checking or at least accept gzip...
Excerpt from Ted Leung on the air atA little of everything. I mean, why not? I'm up anyway....
[more]
Trackback from dive into mark
I've since added some code to verify that each of the pages in the cache of pages served in the past 24 hours are well formed and valid XHTML. This uncovered an interesting boundary case that I hadn't considered. Last week, I created a nightly job to...
[more]
Trackback from Sam Ruby