[OH-Dev] Antispam updates

Sat Nov 17 19:21:10 UTC 2012

Howdy all,

Last night I wrote a small toolkit for despamming the site.

Summary:

* We mass-export user data into email-like files.

* We use an "off the shelf" anti-email-spam tool to learn which users' 
contents are spam. It takes about 1 minute to classify all 10,000 as spam 
or non-spam.

* We then semi-manually pass those usernames to the backend, which emails 
the users, archives their data, and then deletes them from the site.

Code and full details here: https://github.com/openhatch/oh-antispam

As a side note, I wrote this in about 24 hours. The code quality is not 
amazing. But I am pretty proud of the speed of execution (75 seconds to 
analyze all ~10,000 users on my unimpressive laptop) and the fact that 
it's an automated, statistical approach. I have not written or maintained 
any whitelist/blacklist as part of this antispam effort, which I find 
thrilling.

Right now, it isn't fully integrated into oh-mainline; it was written more 
as a proof of concept. What I'd *love* to see is someone take this and 
make it a Django reusable app, and then it can live as a dependency of 
oh-mainline rather than as a part of it.

If we don't make it a dependency at that level, within a week I/we/etc. 
should set up a cron job that at least runs the code in this current form 
and alerts the site admins when a spammy post is made.

-- Asheesh.