[OH-Dev] Thinking about moving oh-bugimporters to depend on scrapy

Fri May 18 21:20:31 UTC 2012

Howdy all,

I spent a few minutes today trying to clean up the way we use Twisted in 
the bug importers. As I did that, I wondered if maybe we should add a 
dependency on a richer library (scrapy), and I wanted feedback from people 
who've played with that part of the code.

In particular, I wanted feedback from Berry, you've been doing some 
hacking on this part of the code recently. Other possibly-interested 
people: Jessica McKellar (with your Twisted experience); Jack Grigg, if 
you have some time, since you wrote much of our current implementation; 
John Morrissey, who contributed to the oh-bugimporters refactoring). Also, 
anyone else with an opinion.

"scrapy" <http://scrapy.org/> is an asynchronous HTTP downloading 
framework, and it seems to be a good fit for what we do with the bug 
importers. In particular, some upsides:

* It does asynchronous HTTP downloading, just like we do right now, so if 
we switch to it, I don't have to give up the event-driven religion. It 
also means bug downloading will stay speedy.

* scrapy lacks some of the bugs of our current implementation -- namely, 
our current downloader seems to maybe never terminate: 
http://openhatch.org/+meta/ is stuck at 22% and I don't know if it's a bug 
in our downloading code or something else.

* scrapy is well-documented and has a very clean API, unlike the current 
code with its various calls to poorly-named, unmaintained functions. 
(Sorry Jack! This is my fault too. : P)

* Because scrapy can be easily configured to store downloaded data as 
"Item" objects, rather than pass it through to the database, it would be 
easy for people to hack on the bug downloading code and play with its 
output -- they would get a list of JSON objects, and could interact with 
them on a shell, rather than those objects being pushed all the way into 
the database first.

* It *seems* possible to do the migration piecemeal -- we can do e.g. 
trac.py first, and then the other bug importers separately. trac.py is a 
good candidate for the first switch because it's already covered by tests 
within oh-bugimporters.

* Such a transition hopefully clarifies the separation of concerns -- 
oh-bugimporters contains information about how to download and parse bug 
data from the web, and Scrapy's terminology ("Item pipeline") is 
well-documented and probably easy to explain.

* All in all, I'm pretty sure this will result in way less code that we 
maintain, and way less documentation for us to write.

Some downsides, just so that people see I'm trying to make a fair 
evaluation:

* It means giving up quite a bit of pride and deciding we're not smart 
enough to use Twisted directly. (:

* The one part we might have to implement again is the handling of mock 
responses, where we have files in the repository that are snapshots of old 
bug URLs so that we can run our tests.

* It means adding a dependency. I've been watching scrapy for a while, and 
scrapy seems well-maintained and widely-used and widely-liked, and I for 
one like adding dependencies.

* It means continuing to spend time on refactoring. I think we need to 
spend this time anyway, as 22% "stale" is really bad.

Next steps:

* Are there other pros or cons I've forgotten?

* I'd like an opinion: is this a good idea, or a *great* idea? (;

-- Asheesh.