[OH-Dev] Status of scrapy-ification of bugimporters (and request for help)
Asheesh Laroia
asheesh at asheesh.org
Wed Sep 5 19:21:19 UTC 2012
Hello, dear OH-Dev-ers,
I spent some time over Labor Day weekend refactoring the oh-bugimporters
code to use scrapy, rather than our homebrew, not-quite-functioning async
downloading framework.
It's here for now: (note I may eventually rebase this branch)
https://github.com/paulproteus/oh-bugimporters/tree/experimental/scrapy-ify-trac
There's a README.scrapy file in there that says how to run it.
Results
-------
It works: It crawls all the Trac instances, with a very high success rate
(about 50 unhandled exceptions, and about 8000 bugs downloaded), and it
does so in just 2.5 hours!
The code is clean: Every one of our methods now simply returns a
combination of Requests and Items. The scrapy framework grabs the Item
objects that get returnd, and saves them as JSON. For the Request objects,
it queues them up for processing.
I find this totally superbly clean, and it will indeed let us remove lots
and lots of code from oh-bugimporters and oh-mainline.
Thoughts about testing
----------------------
For manual testing, you just run the commands in README.scrapy. If you
want to test on just one bug tracker that's failing, you make sure that's
the only tracker in the list of trackers, and then re-run the scrapy
command.
For automated testing, we have some work to do: since every method returns
an Item or a Request
Things to be done: reasonably easy
----------------------------------
* Easy: Fixing up the patch set so that it doesn't have silly commit log
messages like 'rofl'
* Easy: Fix up oh-mainline so it accepts this JSON data as input (for
example, I changed the serialization system so that we output
datetime.datetime.isoformat() strings rather than custom datetime
objects). (Also, oh-mainline wants YAML files as input, but this generates
JSON.)
* Reasonably easy: Modify the code so that in case of a remote 404,
oh-bugimporters sets a flag on items.ParsedBug called _deleted. Then
oh-mainline knows that when it sees that, it should delete the
corresponding Bug in the database.
* Reasonably easy: The export of all the "TrackerModel" subclasses should
include a list of current bug URLs, so that oh-bugimporters knows to
download fresh data for those bugs.
Things to be done: Less easy
----------------------------
* Slightly harder: Make a trivial fake downloader function that takes a
Request as input, and a configuration dictionary mapping URLs to filenames
on disk, and returns a Response object that has that data. That way, we
can continue to use something like fakeGetPage to let the test suite run
offline and predictably.
* The other bug importers need to get ported as well. Right now, we only
handle Trac instances. That's a large fraction of the bug importers, but
there's github, roundup, launchpad, and Google Code to port as well.
* Documentation:
Volunteers?
-----------
I'm finding myself swamped with prep stuff for Open Source Comes to
Campus, but I would *so* love to see these problems addressed! You'd be a
hero(ine).
If you're interested in this, a good first step is to get the code from
the branch and make sure you can run it as per README.scrapy in the repo.
We could also make it a sprint this weekend in SF, if there's interest in
that format.
What do you say? (:
--
-- Asheesh.
I'm often slow at email. Choose phone/IM for planning and discussion:
Phone: +1 (585) 506-8865 | GChat/Jabber: asheesh at asheesh.org
More options: http://asheesh.org/about/
More information about the Devel
mailing list