[OH-Dev] Status of scrapy-ification of bugimporters (and request for help)
Adrian Ancona
soonick5 at yahoo.com.mx
Wed Sep 5 20:54:43 UTC 2012
Hello Asheesh,
This change is so exciting! I am very interested in helping, but right
now it is very hard for me to commit to a time-line. I will get into the
IRC channel when I have some free time and I will work on what is
available at the time.
Best regards,
Adrian Ancona
On 09/05/2012 02:21 PM, Asheesh Laroia wrote:
> Hello, dear OH-Dev-ers,
>
> I spent some time over Labor Day weekend refactoring the
> oh-bugimporters code to use scrapy, rather than our homebrew,
> not-quite-functioning async downloading framework.
>
> It's here for now: (note I may eventually rebase this branch)
> https://github.com/paulproteus/oh-bugimporters/tree/experimental/scrapy-ify-trac
>
> There's a README.scrapy file in there that says how to run it.
>
> Results
> -------
>
> It works: It crawls all the Trac instances, with a very high success
> rate (about 50 unhandled exceptions, and about 8000 bugs downloaded),
> and it does so in just 2.5 hours!
>
> The code is clean: Every one of our methods now simply returns a
> combination of Requests and Items. The scrapy framework grabs the Item
> objects that get returnd, and saves them as JSON. For the Request
> objects, it queues them up for processing.
>
> I find this totally superbly clean, and it will indeed let us remove
> lots and lots of code from oh-bugimporters and oh-mainline.
>
> Thoughts about testing
> ----------------------
>
> For manual testing, you just run the commands in README.scrapy. If you
> want to test on just one bug tracker that's failing, you make sure
> that's the only tracker in the list of trackers, and then re-run the
> scrapy command.
>
> For automated testing, we have some work to do: since every method
> returns an Item or a Request
>
> Things to be done: reasonably easy
> ----------------------------------
>
> * Easy: Fixing up the patch set so that it doesn't have silly commit
> log messages like 'rofl'
>
> * Easy: Fix up oh-mainline so it accepts this JSON data as input (for
> example, I changed the serialization system so that we output
> datetime.datetime.isoformat() strings rather than custom datetime
> objects). (Also, oh-mainline wants YAML files as input, but this
> generates JSON.)
>
> * Reasonably easy: Modify the code so that in case of a remote 404,
> oh-bugimporters sets a flag on items.ParsedBug called _deleted. Then
> oh-mainline knows that when it sees that, it should delete the
> corresponding Bug in the database.
>
> * Reasonably easy: The export of all the "TrackerModel" subclasses
> should include a list of current bug URLs, so that oh-bugimporters
> knows to download fresh data for those bugs.
>
> Things to be done: Less easy
> ----------------------------
>
> * Slightly harder: Make a trivial fake downloader function that takes
> a Request as input, and a configuration dictionary mapping URLs to
> filenames on disk, and returns a Response object that has that data.
> That way, we can continue to use something like fakeGetPage to let the
> test suite run offline and predictably.
>
> * The other bug importers need to get ported as well. Right now, we
> only handle Trac instances. That's a large fraction of the bug
> importers, but there's github, roundup, launchpad, and Google Code to
> port as well.
>
> * Documentation:
>
> Volunteers?
> -----------
>
> I'm finding myself swamped with prep stuff for Open Source Comes to
> Campus, but I would *so* love to see these problems addressed! You'd
> be a hero(ine).
>
> If you're interested in this, a good first step is to get the code
> from the branch and make sure you can run it as per README.scrapy in
> the repo.
>
> We could also make it a sprint this weekend in SF, if there's interest
> in that format.
>
> What do you say? (:
>
More information about the Devel
mailing list