[OH-Dev] Status of scrapy-ification of bugimporters (and request for help)

Wed Sep 5 20:54:43 UTC 2012

Hello Asheesh,

This change is so exciting! I am very interested in helping, but right 
now it is very hard for me to commit to a time-line. I will get into the 
IRC channel when I have some free time and I will work on what is 
available at the time.

Best regards,
Adrian Ancona

On 09/05/2012 02:21 PM, Asheesh Laroia wrote:
> Hello, dear OH-Dev-ers,
>
> I spent some time over Labor Day weekend refactoring the 
> oh-bugimporters code to use scrapy, rather than our homebrew, 
> not-quite-functioning async downloading framework.
>
> It's here for now: (note I may eventually rebase this branch) 
> https://github.com/paulproteus/oh-bugimporters/tree/experimental/scrapy-ify-trac
>
> There's a README.scrapy file in there that says how to run it.
>
> Results
> -------
>
> It works: It crawls all the Trac instances, with a very high success 
> rate (about 50 unhandled exceptions, and about 8000 bugs downloaded), 
> and it does so in just 2.5 hours!
>
> The code is clean: Every one of our methods now simply returns a 
> combination of Requests and Items. The scrapy framework grabs the Item 
> objects that get returnd, and saves them as JSON. For the Request 
> objects, it queues them up for processing.
>
> I find this totally superbly clean, and it will indeed let us remove 
> lots and lots of code from oh-bugimporters and oh-mainline.
>
> Thoughts about testing
> ----------------------
>
> For manual testing, you just run the commands in README.scrapy. If you 
> want to test on just one bug tracker that's failing, you make sure 
> that's the only tracker in the list of trackers, and then re-run the 
> scrapy command.
>
> For automated testing, we have some work to do: since every method 
> returns an Item or a Request
>
> Things to be done: reasonably easy
> ----------------------------------
>
> * Easy: Fixing up the patch set so that it doesn't have silly commit 
> log messages like 'rofl'
>
> * Easy: Fix up oh-mainline so it accepts this JSON data as input (for 
> example, I changed the serialization system so that we output 
> datetime.datetime.isoformat() strings rather than custom datetime 
> objects). (Also, oh-mainline wants YAML files as input, but this 
> generates JSON.)
>
> * Reasonably easy: Modify the code so that in case of a remote 404, 
> oh-bugimporters sets a flag on items.ParsedBug called _deleted. Then 
> oh-mainline knows that when it sees that, it should delete the 
> corresponding Bug in the database.
>
> * Reasonably easy: The export of all the "TrackerModel" subclasses 
> should include a list of current bug URLs, so that oh-bugimporters 
> knows to download fresh data for those bugs.
>
> Things to be done: Less easy
> ----------------------------
>
> * Slightly harder: Make a trivial fake downloader function that takes 
> a Request as input, and a configuration dictionary mapping URLs to 
> filenames on disk, and returns a Response object that has that data. 
> That way, we can continue to use something like fakeGetPage to let the 
> test suite run offline and predictably.
>
> * The other bug importers need to get ported as well. Right now, we 
> only handle Trac instances. That's a large fraction of the bug 
> importers, but there's github, roundup, launchpad, and Google Code to 
> port as well.
>
> * Documentation:
>
> Volunteers?
> -----------
>
> I'm finding myself swamped with prep stuff for Open Source Comes to 
> Campus, but I would *so* love to see these problems addressed! You'd 
> be a hero(ine).
>
> If you're interested in this, a good first step is to get the code 
> from the branch and make sure you can run it as per README.scrapy in 
> the repo.
>
> We could also make it a sprint this weekend in SF, if there's interest 
> in that format.
>
> What do you say? (:
>