[OH-Dev] Removed a bunch of spam (and users) from project pages

Sun Dec 4 17:20:28 UTC 2011

Excerpts from Asheesh Laroia's message of Fri Dec 02 15:50:03 -0500 2011:
> Howdy all,
> 
> I just spent a bit of time cleaning up spam "Answers" to project 
> involvement questions from the site.
> 
> I wrote some simple functions to do that, which I ran inside
> 
>      python manage.py shell_plus
> 
> on the deployment. If the spammers continue their spammerificness, we could 
> maybe turn this into a management command, or add some spam detection code 
> to the answer submission process.
> 
> I did some research on antispam backends, and it seems that Akismet and 
> TyepPad Antispam don't have high enough accuracy for what we're 
> doing. I would suggest using http://code.google.com/p/django-spambayes/ 
> instead.
> 
> The user deletion code I ran has two interesting properties:
> 
> * It saves the text of spam users' answers, so we can use it as 
> training data down the road.
> 
> * It emails the users with a note saying that we deleted their accounts.
> 
> It's attached as "antispam.py".

Two more updates that people might want to know, based on discussion
at the meeting.

1. There is no automatic spam filtering of any kind yet.

I would like to just use Akismet or TypePad's Antispam API, but it seems
those web services won't do what what we need. I'm happy to be proven
wrong about that.

http://openhatch.org/bugs/issue624 has the demo code that I used to test
those services. It has my API keys in it, so feel free to try it. You
have to have the Akismet module installed to make it work:
http://www.voidspace.org.uk/python/akismet_python.html

2. Here's how one might build actual antispam

* The Answer model gets a new field, spamcheck_status that can be in
  three states: UNKNOWN, HAM, SPAM. UNKNOWN is the default value.

* We adjust the default Manager for Answer objects so that they
  only show the ones that are HAM.

* When an Answer is submitted, we check it against a Bayesian database
  that we maintain using http://code.google.com/p/django-spambayes/
  (this would require adding a dependency or two, but I think that is okay)

  If it is super spammy, we mark it as SPAM and store it in the database.
  We also email the site admins saying there's a new spam comment.

* We create a management command that prompts the user with each comment
  in UNKNOWN state and asks if the data is spam. If so, we mark the comment
  as SPAM in the database, train the Bayesian filter on its text, and
  optionally delete the entire user.

I'm going to push this out of 0.11.11 because the actual spam problem
seems to have gone away with my semi-automated account deletion.

-- Asheesh.