[OH-Dev] Status of scrapy-ification of bugimporters (and request for help)

Wed Sep 5 19:21:19 UTC 2012

Hello, dear OH-Dev-ers,

I spent some time over Labor Day weekend refactoring the oh-bugimporters 
code to use scrapy, rather than our homebrew, not-quite-functioning async 
downloading framework.

It's here for now: (note I may eventually rebase this branch) 
https://github.com/paulproteus/oh-bugimporters/tree/experimental/scrapy-ify-trac

There's a README.scrapy file in there that says how to run it.

Results
-------

It works: It crawls all the Trac instances, with a very high success rate 
(about 50 unhandled exceptions, and about 8000 bugs downloaded), and it 
does so in just 2.5 hours!

The code is clean: Every one of our methods now simply returns a 
combination of Requests and Items. The scrapy framework grabs the Item 
objects that get returnd, and saves them as JSON. For the Request objects, 
it queues them up for processing.

I find this totally superbly clean, and it will indeed let us remove lots 
and lots of code from oh-bugimporters and oh-mainline.

Thoughts about testing
----------------------

For manual testing, you just run the commands in README.scrapy. If you 
want to test on just one bug tracker that's failing, you make sure that's 
the only tracker in the list of trackers, and then re-run the scrapy 
command.

For automated testing, we have some work to do: since every method returns 
an Item or a Request

Things to be done: reasonably easy
----------------------------------

* Easy: Fixing up the patch set so that it doesn't have silly commit log 
messages like 'rofl'

* Easy: Fix up oh-mainline so it accepts this JSON data as input (for 
example, I changed the serialization system so that we output 
datetime.datetime.isoformat() strings rather than custom datetime 
objects). (Also, oh-mainline wants YAML files as input, but this generates 
JSON.)

* Reasonably easy: Modify the code so that in case of a remote 404, 
oh-bugimporters sets a flag on items.ParsedBug called _deleted. Then 
oh-mainline knows that when it sees that, it should delete the 
corresponding Bug in the database.

* Reasonably easy: The export of all the "TrackerModel" subclasses should 
include a list of current bug URLs, so that oh-bugimporters knows to 
download fresh data for those bugs.

Things to be done: Less easy
----------------------------

* Slightly harder: Make a trivial fake downloader function that takes a 
Request as input, and a configuration dictionary mapping URLs to filenames 
on disk, and returns a Response object that has that data. That way, we 
can continue to use something like fakeGetPage to let the test suite run 
offline and predictably.

* The other bug importers need to get ported as well. Right now, we only 
handle Trac instances. That's a large fraction of the bug importers, but 
there's github, roundup, launchpad, and Google Code to port as well.

* Documentation:

Volunteers?
-----------

I'm finding myself swamped with prep stuff for Open Source Comes to 
Campus, but I would *so* love to see these problems addressed! You'd be a 
hero(ine).

If you're interested in this, a good first step is to get the code from 
the branch and make sure you can run it as per README.scrapy in the repo.

We could also make it a sprint this weekend in SF, if there's interest in 
that format.

What do you say? (:

-- 
-- Asheesh.
I'm often slow at email. Choose phone/IM for planning and discussion:
Phone: +1 (585) 506-8865 | GChat/Jabber: asheesh at asheesh.org
More options: http://asheesh.org/about/