[OH-Dev] Thinking about moving oh-bugimporters to depend on scrapy

Mon May 21 21:14:27 UTC 2012

Excerpts from Asheesh Laroia's message of Fri May 18 17:20:31 -0400 2012:
> Howdy all,
> 
> I spent a few minutes today trying to clean up the way we use Twisted in 
> the bug importers. As I did that, I wondered if maybe we should add a 
> dependency on a richer library (scrapy), and I wanted feedback from people 
> who've played with that part of the code.
> 
> In particular, I wanted feedback from Berry, you've been doing some 
> hacking on this part of the code recently. Other possibly-interested 
> people: Jessica McKellar (with your Twisted experience); Jack Grigg, if 
> you have some time, since you wrote much of our current implementation; 
> John Morrissey, who contributed to the oh-bugimporters refactoring). Also, 
> anyone else with an opinion.
> 
> "scrapy" <http://scrapy.org/> is an asynchronous HTTP downloading 
> framework, and it seems to be a good fit for what we do with the bug 
> importers. In particular, some upsides:
> 
> * It does asynchronous HTTP downloading, just like we do right now, so if 
> we switch to it, I don't have to give up the event-driven religion. It 
> also means bug downloading will stay speedy.
> 
> * scrapy lacks some of the bugs of our current implementation -- namely, 
> our current downloader seems to maybe never terminate: 
> http://openhatch.org/+meta/ is stuck at 22% and I don't know if it's a bug 
> in our downloading code or something else.
> 
> * scrapy is well-documented and has a very clean API, unlike the current 
> code with its various calls to poorly-named, unmaintained functions. 
> (Sorry Jack! This is my fault too. : P)
> 
> * Because scrapy can be easily configured to store downloaded data as 
> "Item" objects, rather than pass it through to the database, it would be 
> easy for people to hack on the bug downloading code and play with its 
> output -- they would get a list of JSON objects, and could interact with 
> them on a shell, rather than those objects being pushed all the way into 
> the database first.
> 
> * It *seems* possible to do the migration piecemeal -- we can do e.g. 
> trac.py first, and then the other bug importers separately. trac.py is a 
> good candidate for the first switch because it's already covered by tests 
> within oh-bugimporters.
> 
> * Such a transition hopefully clarifies the separation of concerns -- 
> oh-bugimporters contains information about how to download and parse bug 
> data from the web, and Scrapy's terminology ("Item pipeline") is 
> well-documented and probably easy to explain.
> 
> * All in all, I'm pretty sure this will result in way less code that we 
> maintain, and way less documentation for us to write.
> 
> Some downsides, just so that people see I'm trying to make a fair 
> evaluation:
> 
> * It means giving up quite a bit of pride and deciding we're not smart 
> enough to use Twisted directly. (:
> 
> * The one part we might have to implement again is the handling of mock 
> responses, where we have files in the repository that are snapshots of old 
> bug URLs so that we can run our tests.
> 
> * It means adding a dependency. I've been watching scrapy for a while, and 
> scrapy seems well-maintained and widely-used and widely-liked, and I for 
> one like adding dependencies.
> 
> * It means continuing to spend time on refactoring. I think we need to 
> spend this time anyway, as 22% "stale" is really bad.
> 
> Next steps:
> 
> * Are there other pros or cons I've forgotten?
> 
> * I'd like an opinion: is this a good idea, or a *great* idea? (;

As an update on this, berryp on #openhatch seemed to think it was
a good idea, and I still think it's a good idea. There's a current
bug where we are ignoring some data from the Bugzilla bug importers,
and I'm going to try to add scrapy in an effort to make the code
way easier to read while fixing that.

So consider this plan agreed-upon.

-- Asheesh.