[OH-Dev] Further thoughts about wiki spam (still a problem)
Asheesh Laroia
asheesh at asheesh.org
Sat May 12 21:20:02 UTC 2012
Hey all,
Some of you might have noticed that we're still getting wiki spam. A week
ago, I thought that it would be enough to just monitor
Special:RecentChanges and remove things as they pop up, but when spammers
do attack the wiki, I find it tedious and error-prone to actually remove
the spam. At the moment, I'm researching some tools that might make it
easier to react to the vandalism edits (which luckily are not super
common; 10-20 per week or so).
A question: If I can create a git-based command-line-based workflow for
identifying and reverting spam, would some person be interested in
volunteering to keep up with that?
More info:
A look at https://openhatch.org/wiki/Special:RecentChanges suggests that
spam comes in waves (which I personally find quite curious). Some other
things that seem true, based on looking at the past week's edits:
* Jessica (jesstess) has been very good about de-vandalizing and
protecting pages that relate to the Boston Python Workshop; thank you for
quietly doing this work that you really shouldn't have to.
* Spammers are creating openhatch.org accounts and using those to log into
the wiki. (I find that fairly impressive.)
* There's a form of spam I hadn't much seen before, which is to abuse
"Move" repeatedly on the same page. See 9 May 2012 for examples of this.
This strategy creates lots of new pages.
* Much "spam" doesn't contain external links. In my opinion, this is
somewhat bizare. See e.g.
https://openhatch.org/w/index.php?title=User_talk:207.151.36.229&curid=875&diff=9944&oldid=9935&rcid=9954
* Some spam removes many sections of a page and replaces them with
irrelevant text, for example
https://openhatch.org/w/index.php?title=Boston_Python_Workshop_5/Friday/OSX_set_up_Python&curid=486&diff=9936&oldid=8472&rcid=9946C
Interestingly, so much of this activity is not "link spamming" -- it's
just automated vandalism. It could be that the particulars of the text
being left behind is a method of using our wiki as a decentralized content
store for these bots; I can't think of any other purpose.
Since most of these edits don't add new links (perhaps because we're
already blocking those sorts of spam edits effectively), the existing
link-oriented tools are a poor match. What we need is either humans or
bots to identify vandalism edits and revert them, and preferably to ban
the account/IP that caused them. At this moment, I'm investigating
tooling that should make it easier to:
* Review all edits since a given date
* Revert the ones that are spammy
* Block users/IPs that are spamming
For spammers using OpenHatch accounts, we could go the full route of
deleting the account across all OpenHatch sites that user our central
login, or we could just block the account in the wiki. For now, it's
simplest to automate blocking the account in the wiki.
I'm particularly intrigued by the idea of doing this all from within
'git', via this package:
https://github.com/Bibzball/Git-Mediawiki/wiki/User-manual
I'm doing a 'git clone' of the wiki now. I will follow up to this thread
with more information about if I can make these tools be useful for
reviewing and reverting vandals' edits. If so, I can document what I've
done for others to see.
-- Asheesh.
P.S. I still think that machine learning would help here, but I'm not
going to put that on the critical path.
More information about the Devel
mailing list