One of the major topics to du Jour is spam comments. There
are those who are wildly
optimistic and others who are
wildly
pessimistic... (I tend to the optimistic side myself -
the trick is to keep the cost/benefit ratio in your favor)
This turns out to be rather timely, given that I was just hit by
143 spams from a single individual over a thirteen hour
period. By all indications this was not automated.
The removal, however, was. All it took to wipe all these
comments out was a single command (which I had to issue twice, once
before I flew out to
ApacheCon,
and one after I landed to clear out the ones created while I was in
flight). This is not much trouble for me, but it does tend to
get
noticed by
people who are subscribed to my comments feed.
So... I've implemented a throttle. The
code
is straightforward, but the policy is difficult to put into
words. Suffice it to say that no one can put in three
consecutive comments within the period of a day or put in three
comments total within a five minute period.
RE: Comment Throttle
What do you consider consecutive comments?
PS: Now that I know I'll be at XML 2003 we should have a planned hanging out session.
If I am reading the code correctly (and it's entirely possible that I am not at 3 AM), this is IP-based. If so, it can be defeated with rotating HTTP proxies.
To put it more concretely, no comment will be accepted if it would result in three of the the last four comments in this view having the same title (viewable as hover text) on the "by" or "from" link.
Wouldn't this headache be solved by not putting hyperlinks in comments (eg. you see the commenter's name and their URL next to it rather than being linked)? Would there be a point to spam with plaintext URLs?
Confused: that also wouldn't solve the problem. Spammers have it in their heads now that weblog comments are a vector to exploit. They don't look at individual results and tweak their software to stop bothering individuals. They write generic software that works with millions of sites and goes after them en masse. So you would end up with just as much spam, it would just be displayed with unlinked URLs.
Spammers don't read blogs; they just write to them.
Spammers don't read blogs; they just write to them.
Which is the only reason they can be defeated. A few simplesteps, and I have been delightfully spam-free.
Unlike SMTP, or NNTP, the "protocol" for comment submission can be varied in numerous ways. With enough variations deployed, writing a general-purpose 'bot can be made infeasible.
The posts varied from one to three minutes apart - some were simplistic responses of an Eliza quality, but an other specifically cited Dare by name (Dare is not a common name, in fact, it is a common English verb). The response to my Atkins post was as follows:
The atkins diet certainly works. The 2 women I know that each read the book. both felt better and lost weight - not that that is a scientific study...
The user agent was IE.
Not conclusive, but it certainly does not appear to me to have been automated.
A trivial test: did the "human" download your CSS stylesheet? Robots generally don't bother. (I know it can be cached; you may need to look back in your logs.)
What was the REFERER on the atkins diet post? As (some of?) these "crawler" spambots seem to come in via links from other blogs, or via google searches on some keyword, I would not be surprised if the comment was vaguely on-topic.
All the 'bots I have seen claim to be IE, so that means nothing.
The fact that the posts were spaced from 1 to 3 minutes apart makes it more likely that it was a spambot than a human.
A human would be cutting and pasting into your comment-entry form, and would be trying to get through the process as quickly as possible. A 'bot would be hitting hundreds of different weblogs simultaneously, and would prefer to space-out its HTTP requests, so as not to set off any alarm bells. (Look at how the better search-engine crawlers behave.)
I can't prove you were hit by a 'bot. But, from everything you've said, it's far more likely than not.
I don't keep my logs that far back, but from memory, the initial referrer was a google query, and the favicon.ico and blog.css were downloaded. Subsequent posts included previews, some pages were visited without leaving a comment, etc.
While I no longer have the logs, I do have the actual spams.
If it was a human, then they were working very inefficiently. If they are going post manually and waste all that time while doing so, they don't have a bright future in the comment-spam 'biz.
If it was a 'bot, then it went to extraordinary lengths to act "human-like" (downloading your favicon.ico file !?).
I wonder why.
Could it be something really stupid, like a "spambot" written in VBScript, driving IE?
Very interesting ideas forthcoming in the Blog anti-spam debate
I have a couple of vested interests in erdicating SPAM from my blog and from the rest of the Blogosphere. There are some interesting discussions (and disagreements) brewing in the various listservs and dev-blogs that I regularly visit or subscribe ...
Don Box: Comment spam has gone from a curiosity to an irritant to an amusement of mine. Don Box: Comment spam has gone from a curiosity to an irritant to an amusement of mine. Why an amusement? It is fun seeing greedy spammers who can't limit...
[more]
Sam Ruby has some stuff on comment spam. I've written before about weblog comment spam and why I don't think it will be a long term problem. Sam's comment throttling is an example of how we have so many more approaches to deal with weblog comment...
This is a trial balloon. What I am trying to explore is what would happen if I were to convert the act of posting a comment into request/response interaction. I would very much like to do this in a way that does not significantly inhibit the sponte...
[more]
OK, an initial implementation of my preview required functionality is complete. Other than requiring a preview, most of you should not see any different behavior. I've also relaxed my spam throttle to allow three comments - this allows the first to g...
[more]
Based on the lively discussions of the past few days, it certainly appears that requiring a preview does not impede the flow of discussion. Cool. Spam also is way down, despite my having removed and relaxed a number of other defenses. Notably, my spa...
[more]
Nice. OK, an initial implementation of my preview required functionality is complete. Other than requiring a preview, most of you should not see any different behavior. I've also relaxed my spam throttle to allow three comments - this allows the...
If they don't come back, it is not possible to have a two way conversation, is it? Robert Castelo: Um, the fact that you are getting paid is supposed to make me feel better? I don't think so. And I have to agree here with what Doc said about conten...
[more]
There are two main schools of thought concerning comment spam: the optimists and the defeatists. Optimists believe that comment spam can be beaten with technology; defeatists (maybe I should call them pessimists) believe that comments are as doomed ...
Google and a bunch of the blog vendors have introduced a way of anesthesizing URLs in blog comments so that they don't add PageRank. Just put rel="nofollow" in your link, and it won't count. (See, for instance, Google leads the......
[more]
The world is afire this morning with talk of the announcements by Six Apart and the three major search engines (Google, MSN and Yahoo) to support a new HTML attribute named nofollow (in full, rel="nofollow"). By adding this attribute to your link anchors, the search engines will no longer consider the linking page as a component of the linked page's......
[more]
Reading between the lines (which in this case isn't particularly hard), this and this (don't forget to view source) suggest that Google are soon to announce that they won't be calculating PageRank for links with a rel="nofollow" attribute. Finally,...
After a bit a couple rumours banging around google finally annouced the decision to fight blog comment spam by ignoring links that had the “rel=nofollow” attribute. MSN and Yahoo quickly jumped on board. Technocrati started an official...
Today Google, Yahoo, MSN Search, and other search operators announced their support of the rel="nofollow" attribute for <a href="..." /> tags. Adding this attribute indicates the search crawlers, that the specific links should not contribute...
Whyever not? For some sites, like corporate brochureware, having policy pages is handy. You want to optimize for the main page, and a TOS or privacy page gives you another way to have every page link to something that links back to the front page....
In a prior post someone commented: Wow - I hadn’t heard of the nonofollow movement. It seems to be predominantly peopled by SEO monkeys. Why are you joining up?
There are two main schools of thought concerning comment spam: the optimists and the defeatists. Optimists believe that comment spam can be beaten with technology; defeatists (maybe I should call them pessimists) believe that comments are as doomed...
Justin Mason: Blog Spam, and a ‘nofollow’ Post-Mortem
An interesting article on blog-spam countermeasures — Google’s embarrassing mistake. Quote: I think it’s time we all agreed that the ‘nofollow’ tag has been a complete failure. For those of you new to the concept,...