It’s just data

Beware of Strangers

Robert Castelo: Think about this though - if it is spam, someone is being paid to add interesting and relevant content to your site!

Um, the fact that you are getting paid is supposed to make me feel better?  I don't think so.  And I have to agree here with what Doc said about content.  'nuff said.

At the present time, the throttle and nonce are both working, so what remains are these hit and run jokers.  Each time I get a spam comment, I look at my Apache logs and I see somebody new, from an ip address I have never seen before, who comes in from a random vector, and never returns.  I tried to see a pattern in the queries and in the content. Without much luck.

Then I read this delightful piece by Stavros The Wonderchicken where he explors the falsely dualistic opposition between two unrelated words (like 'party' and 'publication').  Apparently, given that particular choice, the Wonderchicken will pick party every time.

This lead me to think about what I wanted to accomplish with my weblog in general, and on my comments in specific.  Words like Connect, Collaborate, and Communicate came to mind.

Then it struck me, the distinction that the words publish and party imperfectly capture are the distinction between a monologue and a dialog.  One way communication and two way.

Then it struck me: from an ip address I have never seen before.  Light bulb.

Hit and run.  Strangers.  People who never have been here before.  These people are unlikely to be seen again.

And if they don't come back, it is not possible to have a two way conversation, is it?

So I set to work. I wrote a script to scan my Apache logs for everybody who has ever visited my weblog within the last week.  Bots, aggregators, and an occasional carbon based life form, I make no distinction.

Then I add in everybody who has left a comment in the last ninety days.  And not just ip addresses, but also urls.

All these people are welcome to comment freely.  And everybody who visited any part of my site over an hour ago.

Then I roughed in some moderation code.  Something I am monitoring now.  I'll let you know how it goes.


Cool-- useful and interesting to hear about what you are doing.

As your moderation system is getting more elaborate, I was thinking you might appreciate Derek Powazek's essay: "Gaming the system: How moderation tools can backfire" [link]

Posted by Jay Fienberg at

You could run into an over-moderation effect with users who read you via a hosted aggregation service, such as Bloglines or My Feedster.  Such users can read your full content and only ever visit your site to comment.  If it were someone's first attempt to post a comment (ever or within ninety days), they would not be on the list of prior commenters.

Granted, most readers of this site will visit to read the comments, but it does raise an issue, especially for sites providing comments via feed.

Posted by Jason Clark at

I wonder, too, how well this will work. I wonder how well technical solutions will work on social problems. If I only read your site via a feed, and normally nothing makes me comment because either I feel nothing or I totally agree, then when that once in a blue moon time comes that I find something to comment about, will I be prevented from commenting?

Hmmm ;-) Perhaps not. I don't know the last time I came to your site.

Posted by David at

I suppose Sam could make a loophole for such people by allowing people whose REFERER was Bloglines or Feedster to comment. (People using conventional NewsAggregators would already be in his Apache logs.)

Posted by Jacques Distler at

While I agree with you on Stav's essay, my response to your software reaction is: what?

Posted by Shelley at

re: what?  Perhaps an example will help.

Shelley, you have been here before.  I know that because I can compare your ip address to the ones I find in my Apache logs.  You are not a stranger.  That is not a value judgment, simply an observation.

As I eliminate other sources of spam, what is left is people who surf Google or blogrolls looking for open comments.  Strangers.  People whose ip addresses don't match what I find in my logs.  That's what they all appear to have in common.

Somebody with the same ip address as David's above visited my weblog on the 19th and 20th.  I can tell the times, and what pages they accessed, and where they came from (bloglines).  This might not have been David himself, but does provide some valuable insight.

Posted by Sam Ruby at

Sam, isn't reasonably straight forward for you to compile a list of how many/which previous comments this approach would have blocked? That should give a definitive answer about its potential collateral damage.

Posted by Graham at

Graham, as near as I can tell, it would have stopped all of the spam that I have received since the 9th of this month (a total of 19 spams).  What I can not tell is how many false positives this will find.

At the moment, all I have in place is a simple warning that the message will be moderated, but underneath nothing has changed.  So, think of it like the fake home security signs that you can buy at some hardware stores.

So far, no spam today.  And no false positives have been reported.

Posted by Sam Ruby at

And what about people on dialup lines? (Yes, they still exist.)

They don't have a fixed IP, hence may never have appeared in your logs, despite having made dozens of previous trips to your site.

What about folks who are travelling (or whatever)? Again, the new IP address will not be in your logs.

I don't think you can assume a reliable association: Person<=>IP address, with IPv4.

Posted by Jacques Distler at

RE: Beware of Strangers

Sam, I believe my IP address isn't constant even though I have a broadband connection due to the flakiness of the service I get from Comcast. I wonder how your algorithm deals with that?

Message from Dare Obasanjo

at

Jacques and Dare: I also check the URL you provide on the form against comments made in the past six months.

Also, my experience is that spammers are in a hurry.  I almost certainly can lower the one hour boundary safely.

Posted by Sam Ruby at

If the stranger filter is used widely, then spammers will adapt to it.  It's not difficult.  Technically inclined spammers will have no problem writing a bot to visit a site hours or days before the spam attack.

This defense will work for Sam as long as the defense is not widely adopted.  Even then, it might not work for Sam because Sam's Google juice makes him a very attractive target.

Posted by Gary Burd at

Technically inclined spammers

No question... but that is not the variety that I am seeing.  What I am seeing is opportunistic people surfing by and using the web forms.

Posted by Sam Ruby at

comment spam filtering - it's all about the IPs

Sam describes his new comment spam filtering system. Quote:Then it struck me: from an ip address I have never seen before. Light bulb. Hit and run. Strangers. People who never have been here before. These people are unlikely to be seen again. And if...

Excerpt from d2r at

Jacques and Dare: I also check the URL you provide on the form against comments made in the past six months.

Hmmm. That actually makes me feel less secure, rather than more. There's nothing to stop John Q. Spammer from entering his advert for Viagra and assorted porn sites and leaving my URL as his calling card.

Previously, he had no incentive to do so (why waste a perfectly good hyperlink?). Now you have given him one.

Posted by Jacques Distler at

It might be worth pointing out that occasionally I receive worthwhile comments from utter strangers who have arrived at my blog via a Google search.  I'm not sure that this is a very friendly anti-spam strategy.

Posted by jacob at

Jacques, if they take the time to figure out my policies, they should figure out that spam doesn't last very long on my site as I very actively police it.  My strategies are aimed at the folks who don't take the time.

Jacob, here's the warning message:

Welcome!  You seem to be new here.  In order to minimize spam, your comment will not appear on the weblog until it has been moderated in by the owner of this weblog

Be advised that your ip address is being tracked, and this weblog owner is very diligent about removing spam and adding both the ip address and the address of any urls referenced by such posts to his blacklist

Suggestions welcome.

Posted by Sam Ruby at

Sam, using information at the IP layer to affect behaviour two layers up isn't going to work, unless the application protocol gives some guarantee about affinity between IP addresses and agents.

HTTP doesn't; it explicitly allows intermediaries, and it allows clients to use different intermediaries for requests; therefore, two requests from the same IP address can easily be from two different people, and conversely two requests from different IP addresses can be the same person sitting at a browser in a single session. Many proxies are deployed in farms, where users are load-balanced between a number of proxies, and therefore it's likely that requests won't come from the same IP address.

Somewhere between 30% and 50% of Web traffic goes through an intermediary, depending on who you listen to, so these scenarios aren't uncommon. Additionally, IP-based heuristics break things in a way that users can't control, so if you guess wrong, they're out of luck.

Cheers,

Posted by Mark Nottingham at

My weblog URL will be constant, but my e-mail IP address is variable (dynamic), which is out of my control. Verizon, my ISP, supplies the IP address each time I logon to my DSL account. So this might become a problem.

Posted by Curtis Seyfried at

Though I don't much like it, I'm afraid that "your comment may not be visible until the site owner vets it" is something we'll probably have to get used to. Thinking about my crapflooders and Derek's thoughts on trolls got me to thinking about how they developed, and are tolerated, it their /. home. If you only display (and I only rebuild baked pages for) comments from known people by default, and have a separate page (fried, for me) that's the equivalent of reading /. at -1, then they go from a threat to our survival to just the sort of tolerable annoyance that digging through your spam email for false positives is. I'd still rather only not moderate comments that are signed with a key that's posted at a URL I've already approved, but if IP's the best I can hope for, it's better than what I've got now.

Posted by Phil Ringnalda at

As you guys probably know, every email I've sent out in the past N years has been PGP-signed. And I'd be happy to PGP-sign my comments too. But, I'm afraid, that ain't gonna happen soon. Too many people (my very smart colleagues among them) look at a PGP signature and say, "what the f*%$# is that?"

Maybe I'm anomalous, but I've gotten precisely 5 spam comments in 14 weeks, since I set up my current system. I can live with that ratio.

My current objective is to crush the crapflooders, like bugs on my windshield. The spammers are a manageable problem.

Posted by Jacques Distler at

Although I admire the sophistication of what you're trying to do, I'd reiterate the dialup point and make another : in the UK even cable and broadband tends to be on dynamic IPs. So anyone visiting may well be seen as "new" just because their IP has changed. Does this not happen in the US, that it's not a consideration?

Posted by Meri at

Sam,

I like this approach.  When I first read about it, I was instantly worried about false positives.  But when I saw the big red warning message that you posted above, it all became clear.  The moderation eliminates the possibility of any false positives at the expense of a small portion of the blog author's time.  Very nice indeed.  Not quite as automated as your previous solutions, but it is thorough.

Posted by Scott Johnson at

Meri, the UK is not unique in having dynamic IPs.  Most DSL and Cable broadband customers in the US tend to have they address assigned through DHCP, giving them dynamic IP addresses.  Comment spammers using either of these two methods could get around IP blacklists by simply renewing their DHCP list.

My guess is that the Bayesian methods used with spam may be what is needed.  The best part is that the owner's content may be all that is needed to detect spammy comments.  Since the owner sets the content to the site, comments can be graded based on how well their comments grade against the post their commenting on.

Posted by Steve Peters at

My current objective is to crush the crapflooders, like bugs on my windshield. The spammers are a manageable problem.

A determined and intelligent crapflooder can not be stopped.  The best you can do is set a bar above the simple scriptkiddie level, and outlast the rest.

There are multiple types of spammers, too.  Setting the bar above the simple scriptkiddie level cuts out a percentage of these too.  But even after that, where you are seeing a spam comment about every third day, I was seeing about three comments a day, with a trend upwards.

Posted by Sam Ruby at

Looks like what's happening in the real world is happening in cyberspace: building walls instead of roads.

Posted by Don Park at

I like it.  Engineering the membrane around communities is the problem.  Not to thin, not to thick.  I love the idea that new-in-the-neighborhood is a useful bit of data.  I like the idea of creating a membrane that rewards lurking, since listen first seems like a good rule of thumb for polite discourse.  Reminds me a bit of the tricks that demand a bit of a turning test before you allowed in (i.e. recognizing a picture).

I'm a little unclear if the idea of blog comments is venue for dialog.  They maybe better suited as a means to capture interesting critque, additions, counter point, etc.  Given that blogs tend to orbit and individual they aren't not really as good a venue for conversation as a the more neutral group provided by a topical mailing list.  Then if you add in that a blog is a bit of a large audience broadcast medium with an unpredictable cascade of distribution the presumption that the good comment comes from a known community member declines a bit.

Posted by Ben Hyde at

Don: Walls suck, but roads suck too. A good read is Suburban Nation.

(hmm... A Pattern Language is to OO Programming as Suburban nation is to... the Web? I should finish it first ;)

Posted by Mark Nottingham at

Thanks, Don.  I was really waiting for some true insight into the social, political, legal, and technical ramifications of spam, but now I see that the solution is so simple!  Don't build walls, build roads!  Amazing!  It's like two arrows meeting in mid-air.  You should run for Congress.

Posted by Mark at

Bitterness aside, here are some points to consider:

1. Not every offense requires a technical response.

How much damage are these hit-and-run jokers inflicting?  While some extremes are possible, is that what a normal (as in not on the near-vertical side of the Power Law) blogger runs into on regular basis?

Hit and run usually just means odd comments here and there which readers can easily ignore.  Yes, I have seen some ambiguious comments that tricked me to their spam site, but no real damage is done.

2. Some strangers are new friends.

I don't see an easy way to distinguish jokers and new visitors.  Is loss of fresh visitors and their new viewpoints worth whatever satisfaction or peace of mind gained from throwing up this wall?

3. Spammers are geeks too.

If we can be clever, they can be too.

Maybe I had too much of Bush because I see a bit of Bush-like thinking going on, albeit more at geeky level than Bush's usual angry deer level.

Posted by Don Park at

Don, quite frankly, I don't what point you are trying to make.  And I really don't get the oblique Bush reference.

To date, I have received 740 spams.  They have been averaging 3 a day.  I received 16 spams on January 3rd alone.

I now require a preview.  This has hardly proven to be a wall.

I now silently issue a nonce.  This is transparent to all non-automata.

I also now issue a greeting if I encounter both an ip address and a URL that I have not seen recently.  At the present time, there is  no moderation system behind it, but I may elect to put one in if spams continue to be a problem.  Even then, the criteria was people who have only been to my site for less than an hour.  When I hear the word "wall", I think in terms of several feet.  An hour hardly qualifies as a speed bump.  And even if there was a moderation system, my inclination would be to moderate in all non spam.

In any case, so far, today is my second consecutive spam free day.

Posted by Sam Ruby at

Sam,

Most, if not all, of my technical posts and comments are from that of a developer looking at the users' needs.  That was the perspective I made above comment from.

I think the misunderstanding occurred because you are apparently looking at the problem from the perspective of a popular blogger with technical ability to do something about spams hitting your blog.

If my comment and attempt at humor appeared to be offensive or cynical, it wasn't intentional and I appologize.

Good Day.

Posted by Don Park at

I think this is an interesting discussion. My bias is towards Don's perspective. The folks commenting here clearly understand the infrastructure side of things (perhaps superficially in some cases).

But anything that requires human intervention will never scale. But if that's okay (as it certainly is for Sam) then there isn't an issue. But if it needs to work in an automated manner -- well this ain't gonna fly. Here are but a few issues and additional comments:

1. I have no issue with SPAM (apart from disliking it). I post everywhere with my real name and my real email address. I list my email address on my blog. I accept comments (not that what i have to say is that interesting).

2. It's the <1% that even know where to find there server logs. I run my blog and portions of some web servers with no logs -- mostly to discourage me from wasting time looking at them. For me they're really not that interesting.

3. Combining knowledge of server logs and denial of service -- well that's definitely in the weeds.

4. DHCP -- yep, DSL and Cable-based Broadband do DHCP here in the states. That said the DHCP is often tied to a lease that last for upwards of 72 hours from last renew -- and that renew is tied to a MAC address. So, in many many cases DHCP = Fixed IP. I've had the same DHCP IP for 2 1/2 years. There is a marked difference between DSL and Cable DHCP. Cable networks tend to have huge battery banks to keep the network powered (in case of regional power failure). In fact, during the big blackout last August in the NorthEast I was was able to stay on the laptop via WiFi because my Cable modem and Wifi router were plugged into a UPS and never lost power. It was a bit weird when you think about it -- no power and no phone (how many people have phones that don't require power anymore -- I keep an old linemen's phone just in case) -- but WiFi internet continued...

5. Back to the scaling point -- how in the world could a host of multiple blogs implement such a policy (correctly)?

Posted by phil at

Sam Ruby: Defying classification

What is Planet Apache?  What is Technorati?  These questions fascinate me. Mailing list vs comments as a venue for dialog?  To me, that's like comparing board meetings vs pubs for as venues for making business decisions. I am told ...

Pingback from Sam Ruby: Defying classification

at

Robert Castelo: "Where it gets interesting, is when someone leaves a relevant and interesting comment on your Blog, which has a link to a commercial site. This might be spam or it might not.

Think about this though - if it is spam, someone is being paid to add interesting and relevant content to your site!"

By only quoting the last sentence of my conclusion above you make it sound like I'm advocating spam in general - which I absolutely do not!

Sam Ruby: "Um, the fact that you are getting paid is supposed to make me feel better?  I don't think so."

No, the point I'm making is that if it's an interesting and relevant post, I think it's irelevant whether someone was paid to write it or not.

Imagine if you made a blog entry about the US Democratic elections and the head of each candidate's campaign team saw it and started posting to the thread. If it was on topic and interesting would you block them because they were being paid to write their entries, and posting with the intention to promote?

Posted by Robert Castelo at

Simon Willison: Solving comment spam

There are two main schools of thought concerning comment spam: the optimists and the defeatists. Optimists believe that comment spam can be beaten with technology; defeatists (maybe I should call them pessimists) believe that comments are as doomed ...

Pingback from Simon Willison: Solving comment spam

at

In the long term, I think the only real solution is for all popular weblog comment tools to convert visitor-submitted URLs to redirection URLs. If there's no PageRank boost that comes from posting spam at places like Intertwingley, the spammers will figure this out and move on.

As an aside, it appears that I've been identified as a potential troublemaker. I think my DSL provides a fixed IP address, so I  must not have dropped by Intertwingley as recently as I thought.

Posted by Rogers Cadenhead at

Rogers, are you going to track down every single person who has ever installed blogging software, and browbeat them into upgrading, so that there's absolutely no possibility of any bot-posted spam ever having any benefit?

I'd start with Yole's Syndirella devblog and Sterling Hughes' old posts, as some of the most encouraging to spammers, but remember, your job's not done until there isn't a single one left anywhere. Otherwise, since the cost to flood a thousand redirecting comments to get one that doesn't redirect is essentially nothing, you'll only have deprived real commenters of a real link without affecting the spammers behavior at all.

Posted by Phil Ringnalda at

I don't respond to spam emails in any way, but it doesn't seem to have slowed them down.

Rogers does have a point, though; there is a reason to redirect URLs.  Several people have noticed that comment spam tends to attract more comment spam.  There is a very simple reason for this: spammers use "link:" queries on Google to find sites that already link to existing spam sites, then attack those sites with more spam.  Spammers also try to make money by selling lists to other would-be spammers; existing victims of comment spam will make it onto these lists and get more spam in the next wave.

Over time, I suspect comment spam will exhibit (wait for it...) a power law structure, where a handful of sites are totally full of spam, others have a moderate amount, and most have very little.  Total vigilance to deleting spam (ala Sam/Phil/Shelley) is obviously one way to keep yourself in the long tail of that power law.  But redirecting URLs should also help, since it breaks the "link:" feedback loop that is virtually guaranteed to attract the attention of other spammers.

Posted by Mark at

That seems like a reasonable analysis, except for one thing: it makes the same mistake almost all of us have been making every time we think about comment spam, assuming that Google's a hapless bystander rather than an active participant. The link:foo.com syntax is supposed to only work for linkers with a PageRank of 4 or more, and there's no better way to find yourself with PR0 than to put up a bunch of pages that do nothing but link to spammy domains.

It would be interesting to see what sort of power-law variation you would get if, as soon as your blog makes the Technorati Top 100, you are taken out and shot.

Posted by Phil Ringnalda at

Interesting. Interesting is a very spammy word, so search for "interesting site:www.edwardbear.org" and you'll get plenty of Sterling's spammed posts. But, pick one with nice high PageRank, and search for link:www.spammerdomain.com, and you may find some of their other sorts of spam, in guestbooks and non-blog comments and mailing list archives, but you don't seem to find any blog comment spam. At all.

Posted by Phil Ringnalda at

Thinking about comment spam

Whether or not Simon thinks it's boring, I'm still interested in blogging about comment spam. Sue me....

Excerpt from phil ringnalda dot com at

"The link:foo.com syntax is supposed to only work for linkers with a PageRank of 4 or more [...]"

Interesting.  I'd never known this before.  Do you have a source that can verify this information?

Posted by Scott Johnson at

Heh. A pair, both equally bad: somewhere, or more likely numerous wheres, in the depths of the WebmasterWorld forums, and experimenting with the Google toolbar. Which is currently delivering extremely odd results, just to make things that much harder. And since it's coming after the spam filters, it's that much harder to tell why it isn't returning some link you know exists. Hmm. Where to experiment?

Posted by Phil Ringnalda at

re: "It would be interesting to see what sort of power-law variation you would get if, as soon as your blog makes the Technorati Top 100, you are taken out and shot."

Since power laws are self-similar across scale, you would likely just end up with a new power-law-looking Top 100 in a slightly smaller pond.

It would certainly add an exciting new dimension to blogging though.  "Hey, thanks for delinking me yesterday, I was dangerously close to getting shot."  Maybe we would all give up on links altogether and just manually type URLs.

Posted by Mark at

I wonder if I could apply this to email... humm

Posted by Andy at

Sam Ruby: Beware of Strangers

[link]...

Excerpt from del.icio.us/tag/spam at

Sock Puppets

FWIW, my experience is that both trolling and spamming were greatly reduced once I implemented this. Related:  Beware of Strangers  Users Who Share Locations... [more]

Trackback from Sam Ruby

at

Solving comment spam

There are two main schools of thought concerning comment spam: the optimists and the defeatists. Optimists believe that comment spam can be beaten with technology; defeatists (maybe I should call them pessimists) believe that comments are as doomed...

Excerpt from Ultimate Anti Spam Tools at

Sam Ruby: Beware of Strangers

[link]...

Excerpt from del.icio.us/jimweirich/sam-ruby at

Comments Please

I hope soon to begin implementing a comment system for ongoing. This space is my notebook where I’ll work out the design. Since, as of this writing, the system exists only in theory, if you have a suggestion you’ll have to send me an email... [more]

Trackback from Sam Ruby

at

Captcha this!

I’ve noticed an uptick of spam lately.  Not just on my weblog, but on a number of weblogs I follow.  Each time I do this, I adjust my defenses slightly, and the problem goes away — for a while. My best defense to date has been requiring pre... [more]

Trackback from Sam Ruby

at

This is the way we can get the free credit card number online and we can have the credit cards easily.

Posted by jenny singh at

Add your comment