[OH-Dev] Further thoughts about wiki spam (still a problem)

Sat May 12 21:20:02 UTC 2012

Hey all,

Some of you might have noticed that we're still getting wiki spam. A week 
ago, I thought that it would be enough to just monitor 
Special:RecentChanges and remove things as they pop up, but when spammers 
do attack the wiki, I find it tedious and error-prone to actually remove 
the spam. At the moment, I'm researching some tools that might make it 
easier to react to the vandalism edits (which luckily are not super 
common; 10-20 per week or so).

A question: If I can create a git-based command-line-based workflow for 
identifying and reverting spam, would some person be interested in 
volunteering to keep up with that?

More info:

A look at https://openhatch.org/wiki/Special:RecentChanges suggests that 
spam comes in waves (which I personally find quite curious). Some other 
things that seem true, based on looking at the past week's edits:

* Jessica (jesstess) has been very good about de-vandalizing and 
protecting pages that relate to the Boston Python Workshop; thank you for 
quietly doing this work that you really shouldn't have to.

* Spammers are creating openhatch.org accounts and using those to log into 
the wiki. (I find that fairly impressive.)

* There's a form of spam I hadn't much seen before, which is to abuse 
"Move" repeatedly on the same page. See 9 May 2012 for examples of this. 
This strategy creates lots of new pages.

* Much "spam" doesn't contain external links. In my opinion, this is 
somewhat bizare. See e.g. 
https://openhatch.org/w/index.php?title=User_talk:207.151.36.229&curid=875&diff=9944&oldid=9935&rcid=9954

* Some spam removes many sections of a page and replaces them with 
irrelevant text, for example 
https://openhatch.org/w/index.php?title=Boston_Python_Workshop_5/Friday/OSX_set_up_Python&curid=486&diff=9936&oldid=8472&rcid=9946C

Interestingly, so much of this activity is not "link spamming" -- it's 
just automated vandalism. It could be that the particulars of the text 
being left behind is a method of using our wiki as a decentralized content 
store for these bots; I can't think of any other purpose.

Since most of these edits don't add new links (perhaps because we're 
already blocking those sorts of spam edits effectively), the existing 
link-oriented tools are a poor match. What we need is either humans or 
bots to identify vandalism edits and revert them, and preferably to ban 
the account/IP that caused them. At this moment, I'm investigating 
tooling that should make it easier to:

* Review all edits since a given date

* Revert the ones that are spammy

* Block users/IPs that are spamming

For spammers using OpenHatch accounts, we could go the full route of 
deleting the account across all OpenHatch sites that user our central 
login, or we could just block the account in the wiki. For now, it's 
simplest to automate blocking the account in the wiki.

I'm particularly intrigued by the idea of doing this all from within 
'git', via this package: 
https://github.com/Bibzball/Git-Mediawiki/wiki/User-manual

I'm doing a 'git clone' of the wiki now. I will follow up to this thread 
with more information about if I can make these tools be useful for 
reviewing and reverting vandals' edits. If so, I can document what I've 
done for others to see.

-- Asheesh.

P.S. I still think that machine learning would help here, but I'm not 
going to put that on the critical path.