Should an aggregator honor the robots.txt file? I'd say yes. NewsMonster doesn't - http://diveintomark.org/archives/2003/02/20/robotstxt_support_for_uberaggregators.html
Should aggregators be grabbing content other than what's in a feed? NewsMonster (like MsIE) can be set to grab full content pages, with horrifying frequency. If there are some cases where it makes sense, can robots.txt communicate the site owner's wishes? I think not, because it's not possible (I think) to block a UserAgent from all of the site except rss.xml (or whatever URL is used for a feed).
Blocking certain files, or all but certain files, for a given useragent is fairly easy using mod_rewrite. [TomasJogin]
I think it would be a good idea for all WebRobot-s, including Aggregators, to include some standard string like "robot" in their UserAgent. This could reduce the maintenance headache for robots.txt. Or maybe should we pick one label for aggregators to use (maybe defining aggregator as tool which grabs only feeds) and another label for non-aggregator robots?
[MartinAtkins : RefactorOk] I'd argue that an aggregator lives two separate lives. When it's sitting in the background periodically polling feeds for updates, it's a robot. However, when the user is interactively using it, either by forcing it to update a feed "right now" or browsing the items and causing the permalinks referenced to be retrieved, it's not a robot, it's a web browser. A web browser with an unusual interface, but a web browser nonetheless. Thus, aggregators should honor robots.txt when they are being robots, but not when they are being web browsers.
[JamesAylett RefactorOk] I agree. Given that you're advised to be 'liberal' in your passing of the User-agent lines of robots.txt, it would be useful for all aggregators to match 'aggregator' as well as their own user agent (less version numbers, etc. etc.). It's not terribly useful for something to advertise in HTTP as "MyUserAgent/1.0 (aggregator)", but to recognise "User-agent: aggregator" in robots.txt is a big win. And you can still block NewsMonster (or whatever) across the board, or for non-RSS/non-!Atom or whatever.
[DavidJanes, RefactorOk] If you're a superaggregator, such as BlogMatrix, I'm looking at feeds for tens of thousands of blogs. Having to look at -- and for -- a robots.txt file can be a nasty task. Furthermore, I'm doing the work on the behalf of many clients, so the issues are not so clear cut as aggregator/user. Surely the feed itself could define frequency and permissions? ([JamesAylett RefactorOk]: Discussion of this below)
[JamesAylett RefactorOk] I'd say the case for a superaggregator is even stronger. Looking for a robots.txt file isn't that big a deal, because they tend to be highly cacheable (if max-age and expires aren't present, you can probably get away with a heuristic that gives it a fairly high expiry, although there'll probably be a Last-Modified in the response, restricting your freedom in this a little).
As an additional suggestion: why not define (or "best practice") a common User-Agent string token for all aggregators? This would allow for "active" blocking on the part of users without depending on the good will or good sense of aggregators.
[AsbjornUlsberg] The USER_AGENT string of Mozilla Firebird 0.6.1 on my Windows XP box is the following:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5a) Gecko/20030728 Mozilla Firebird/0.6.1An Echo-aggregator can build it's user agent string up based on the same principals. The principals are listed in the official Mozilla user-agent string specification. Instead of reading RFC 1945 (HTTP 1.0) and RFC 2068 (HTTP 1.1) I think we rather should just follow in Mozilla's footsteps. This might give us a UA-string like this:
Echo/1.0 (Windows; N; Windows NT 5.1; en-US) EchoAPI/20030808 EchoEditor/1.5 (ExampleOrganization, Inc.)Just an example, but it's fairly straight-forward and easy to interpret. It's also well thought out, so we don't need to do that job if we don't want to. :-)
Also, I agree that all aggregators should pay attention to the robots.txt file. Not doing so is bad web practice.
Frequency of update hints
Should a site owner have some way to set a maximum frequency for having its feed polled by a given user's aggregator? - http://scriptingnews.userland.com/2003/07/09#onceAnHourPlease
[DeveloperDude] This whole pursuit (the pursuit formerly known as Echo, Pie and not Echo) started as a best practices for RSS and evolved into a new standard. Vendors will have to implement what works. What works is the best practices. If they don't implement what works, then users will move to vendors who implement what works. As such, it's not really conceivable to move a best practices forward until we get practical experience with the result of this process. Until we find out what works and what doesn't work. MHO.
[PhilWolff] So you want to manage inbound traffic? Let the feed speak to the aggregator:
If you check more often than this, I'll ban your IP address for a while.
This feed averages updates about n times per week.
Don't bother checking for updates on weekends or after 8pm.
Here's the list of urls where you can find mirrors of this feed. I mirror within n seconds of update.
I ping upon update here and there
[FrançoisGranger] [RefactorOk] This concern applies to robot spiders as well as to aggregators. And, since the ping feature exists and there are ping hosts available, why do the robots and aggregators avoid using them? Sometime ago I wrote an entry about this in my weblog:
"There is an increasing number of robots. This results in some excessive behaviours.
In my logs, there is an IP* adress, not recognized as a robot, wich have read 2618 pages this month. And the total pages seen by the 20 robots which visited is 1648. With a total pages seen of 5469 for the month, the total of 1648 + 2618 = 4266 pages represent 78 % of the totale page seen. I am really honored by so much attention, but this is absolutely unneeded. I publish no more than 20 to 30 stories over the month. And, my software pings blo.gs and weblogs.com for each new story. It should be better to develop this ping from blog to some servers than the spidering by robots. This would be more efficient."
[JamesAylett RefactorOk] Isn't this due to a mismatch between the way HTTP works, and the idea of pinging as a push-publish mechanism? In which case I'd argue that you should just shut off the parts of your site to robots that you push update via blo.gs or whatever. This doesn't deal with robots that don't honour robots.txt, but a group-maintained list of offenders would make it fairly easy to block them. (A cronjob could update a list of apache directives for mod_rewrite or similar ...)
[AsbjornUlsberg] Just as an idea: What about doing HTTP PUSH by default and rather doing PULL only in special cases? That way, everyone aggregator needs a subscription list, but that's not a hard nut to crack. The advantages of PUSH over PULL are huge:
Aggregators (server-based, not client) will get the latest version of an entry or feed when it's ready. They don't have to pull for updates every n minutes.
The load of the feeding servers will go drastically down.
The feeding servers will only need to send updated data, e.g. the amount of data sent will decrease dramatically.
An author will trigger the update of his/hers entry, and it will spread throughout the Echo-community without much delay. It will be updated everywhere when the author presses [Publish] and not when aggregator X decides it's time to patter over to the feeding server and get any updates.
To get old entries, you'd have to do a PULL. Or, you'd at least initiate the PUSH with a PULL. Such "all entries" update-queries can be placed in a queue which the feeding server decides when to handle. So the consuming server will have to wait for the entries until the feeding server has enough CPU or bandwith to do it. This way, the control of the entries are given to the originated server, and not to the servers wanting the entries.
I think HTTP PUSH is the definitive way to update feeds around the world.
Transferring only new/changed data
One or more of the following can be used.
[JamesAylett RefactorOk] RFC 3229 SHOULD be implemented by both aggregators and feed producers. There's very little downside to doing this (except for broken implementations). This should be rolled in with Mark's strong suggestions (linked above) to support use of If-Modified-Since and If-None-Match. Ultimately I think we should be producing a pair of advisory documents, one for aggregators and one for producers, that gives best practices, and would be published alongside the XML feed definition, any apis, and so forth.
Use existing HTTP methods
[DavidJanes RefactorOk] You'd have to create a delta each time you generate a new XML (or whatever document) and keep a day or two of the deltas around. This is more or less how CVS does it (except they keep all the deltas). In fact, it's probably even easier than this because the flow of changes is so well understood -- new entries pop in at the top.