[BillSeitz] [RefactorOk] How much belongs in the "spec" vs perhaps some "suggested" "best practices" or something (maybe that belongs in the [SocialSoftwareAlliance])?

See [WWW]Atom aggregator behavior (HTTP level) and [WWW]Aggregator HTTP tests (test cases).


Should an aggregator honor the robots.txt file? I'd say yes. NewsMonster doesn't -

Should aggregators be grabbing content other than what's in a feed? NewsMonster (like MsIE) can be set to grab full content pages, with horrifying frequency. If there are some cases where it makes sense, can robots.txt communicate the site owner's wishes? I think not, because it's not possible (I think) to block a UserAgent from all of the site except rss.xml (or whatever URL is used for a feed).

I think it would be a good idea for all WebRobot-s, including Aggregators, to include some standard string like "robot" in their UserAgent. This could reduce the maintenance headache for robots.txt. Or maybe should we pick one label for aggregators to use (maybe defining aggregator as tool which grabs only feeds) and another label for non-aggregator robots?

[MartinAtkins : RefactorOk] I'd argue that an aggregator lives two separate lives. When it's sitting in the background periodically polling feeds for updates, it's a robot. However, when the user is interactively using it, either by forcing it to update a feed "right now" or browsing the items and causing the permalinks referenced to be retrieved, it's not a robot, it's a web browser. A web browser with an unusual interface, but a web browser nonetheless. Thus, aggregators should honor robots.txt when they are being robots, but not when they are being web browsers.

[DavidJanes, RefactorOk] If you're a superaggregator, such as [WWW]BlogMatrix, I'm looking at feeds for tens of thousands of blogs. Having to look at -- and for -- a robots.txt file can be a nasty task. Furthermore, I'm doing the work on the behalf of many clients, so the issues are not so clear cut as aggregator/user. Surely the feed itself could define frequency and permissions? ([JamesAylett RefactorOk]: Discussion of this below)

As an additional suggestion: why not define (or "best practice") a common User-Agent string token for all aggregators? This would allow for "active" blocking on the part of users without depending on the good will or good sense of aggregators.

[AsbjornUlsberg] The USER_AGENT string of Mozilla Firebird 0.6.1 on my Windows XP box is the following:

Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.5a) Gecko/20030728 Mozilla Firebird/0.6.1
An Echo-aggregator can build it's user agent string up based on the same principals. The principals are listed in [WWW]the official Mozilla user-agent string specification. Instead of reading [WWW]RFC 1945 (HTTP 1.0) and [WWW]RFC 2068 (HTTP 1.1) I think we rather should just follow in Mozilla's footsteps. This might give us a UA-string like this:
Echo/1.0 (Windows; N; Windows NT 5.1; en-US) EchoAPI/20030808 EchoEditor/1.5 (ExampleOrganization, Inc.)
Just an example, but it's fairly straight-forward and easy to interpret. It's also well thought out, so we don't need to do that job if we don't want to. :-)

Also, I agree that all aggregators should pay attention to the robots.txt file. Not doing so is bad web practice.

Frequency of update hints

Should a site owner have some way to set a maximum frequency for having its feed polled by a given user's aggregator? -

see also

[DeveloperDude] This whole pursuit (the pursuit formerly known as Echo, Pie and not Echo) started as a best practices for RSS and evolved into a new standard. Vendors will have to implement what works. What works is the best practices. If they don't implement what works, then users will move to vendors who implement what works. As such, it's not really conceivable to move a best practices forward until we get practical experience with the result of this process. Until we find out what works and what doesn't work. MHO.

[PhilWolff] So you want to manage inbound traffic? Let the feed speak to the aggregator:

[FrançoisGranger] [RefactorOk] This concern applies to robot spiders as well as to aggregators. And, since the ping feature exists and there are ping hosts available, why do the robots and aggregators avoid using them? Sometime ago I wrote an [WWW]entry about this in my weblog:

[JamesAylett RefactorOk] Isn't this due to a mismatch between the way HTTP works, and the idea of pinging as a push-publish mechanism? In which case I'd argue that you should just shut off the parts of your site to robots that you push update via or whatever. This doesn't deal with robots that don't honour robots.txt, but a group-maintained list of offenders would make it fairly easy to block them. (A cronjob could update a list of apache directives for mod_rewrite or similar ...)

[AsbjornUlsberg] Just as an idea: What about doing HTTP PUSH by default and rather doing PULL only in special cases? That way, everyone aggregator needs a subscription list, but that's not a hard nut to crack. The advantages of PUSH over PULL are huge:

To get old entries, you'd have to do a PULL. Or, you'd at least initiate the PUSH with a PULL. Such "all entries" update-queries can be placed in a queue which the feeding server decides when to handle. So the consuming server will have to wait for the entries until the feeding server has enough CPU or bandwith to do it. This way, the control of the entries are given to the originated server, and not to the servers wanting the entries.

I think HTTP PUSH is the definitive way to update feeds around the world.

Transferring only new/changed data

One or more of the following can be used.

Use existing HTTP methods

[DavidJanes, RefactorOk] Perhaps you should have a look at [WWW]RFC 3229, which defines sending deltas of content.

Extend HTTP

Moved to PossibleHTTPExtensionForEfficientFeedTransfer; the concensus was to focus on an AggregatorApi in preference.

Have a separate API over HTTP

Discussion of extending the AtomApi to support bandwidth-friendly aggregation moved to AggregatorApi.