This site is an archive; learn more about 8 years of OpenHatch.

[OH-Dev] Moving the bug importer code out of oh-mainline into "bugimporters"

Jack Grigg me at jackgrigg.com
Fri Oct 28 04:58:02 UTC 2011


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Hey,

Given that I wrote basically all of this code, I am very happy to
provide insight into how it works ^_^

On 28/10/11 16:42, Asheesh Laroia wrote:
> At recent OH-Dev weekly meetings, we've talked about moving some of
> our data import/export code out of the "oh-mainline" repository 
> into a separate Python package.
> 
> You can see the start of that work here:
> 
> https://gitorious.org/openhatch/bugimporters/trees/master
> 
> That has all the code that powers the new, asynchronous bug
> importers. I think it will be faster and easier to increase the
> quality and testing and code-coverage of the new bug import code
> with it separated out.
> 
> For now, we don't use this "bugimporters" repo. It's an
> experiment. But I would really love to make that code work, and
> then delete the corresponding code from "oh-mainline".
> 
> Some further thoughts:
> 
> * There is a README file which explains how to set up a
> development environment, which should work pretty well on any
> Linuxy system.
> 
> * The big idea behind separating this out is that we should add
> way more automated testing to this all-important, error-prone part
> of the code, and make this code easier to maintain.
> 
> * It does asynchronous network I/O. A little about that:
> 
> A lot of web scraping/API-using code has the form:
> 
>>>> data = urllib2.urlopen("http://example.com/").read() 
>>>> do_some_processing_on(data)
> 
> Because we always separate the *downloading* from the *processing* 
> of network data, it is easy for us to add fake data that we've
> pre- downloaded as data for the test suite.
> 
> * I'd like to apply a code coverage tool to the bug import code,
> too -- http://nedbatchelder.com/code/coverage/ -- so we can see
> which parts are not tested
> 
> * Right now, it still has a few things like "import mysite..." at
> the top of files -- that's parts where it depends on the main
> OpenHatch site. It should not do that, and so any place where it
> does that is a bug. Wherever this code does that, we have to come
> up with a way to lose those dependencies.
> 
> * Most of the times, those imports are pulling in Django models
> because the bug import code wants to find out the answer to
> questions like, "Which bugs are stale?" so it can then go fetch
> those bugs.
> 
> * Instead, we could pass it e.g. a list of stale bugs so that it
> doesn't have to depend on the Django database-Python glue code that
> lives in "mysite".
> 

My feeling with this is that the best way is for the bugimporter to
define some base classes, and anyone that wants to use the bugimporter
module defines subclasses that contain the site-specific processing.
Or in other cases, pass extra callback methods in. The main thing is,
that bugimporters.base.BugImporter has no dependencies on mysite, and
it is where all the heavy lifting (or rather, the deferred
management). It is the subclasses of this that utilize mysite a bit,
but most of this can be factored out.

The other big classes to worry about are the BugParser classes e.g.
BugzillaBugParser. Each of these takes the obtained bug data (however
that particular tracker type gives it) and returns the specific data
dict required for our Bug model. This is very much not site-agnostic,
as it requires prior knowledge of how the Bug model is constructed
i.e. what information it is that we care about. So in my mind, the
BugParser classes will need to stay in the OH tree, and be passed into
the bugimporter package.

In fact, take the BugzillaBugImporter as an example. Here, the two
places where we have site-specific info are:

prepare_bug_urls()
This uses a Bug query to determine what Bugs are fresh (rather than
stale, because subtracting fresh from the query gives both stale AND
new bugs). In fact, while all the BugImporter subclasses have
differing versions of this method, they all use this particular query.
This query could be turned into a passed-in method that just outputs a
list of bug URLs that are fresh.

handle_bug_xml(bug_list_xml_string)
This is very site-specific, and is the only place the
BugzillaBugParser is used. So I would factor this out entirely, and
either require that we pass in this method so it can be added as a
callback, or override it somehow in a sub-subclass.

> I'm worried about a few classes of bugs in the bug import code.
> First, I fear that we might crash on some bug data that we download
> from bug trackers. This is what th sample HTML data is supposed to
> address.
> 

This is handled by the errbacks. I put in the groundwork to be able to
add errbacks easily, but never actually got many in there =P The only
one with a proper errback is the BugzillaBugImporter, because this was
prone to RequestEntityTooLarge and timeout errors. But it is trivial
to catch other errors here. From what I recall, because I add the
errback after the callback, the errback catches errors from the
callback as well as the original Twisted getPage, so the errback can
be a catchall.

> Another is that maybe when we enqueue work, the work gets lost in
> the queue somehow. The bugimporters code relies on the
> "reactor_manager" object to do all the enqueueing, so that's
> something that should be tested separately. I can get on that.
> 

This is not entirely true. The "reactor manager" does do all the
queueing for DIAs, but for the bugimporter it purely starts things off
and then keeps track of how many total running deferreds there are, so
it knows when to stop the reactor - hence, "reactor manager".
bugimporter.base.BugImporter is where everything happens (see above).

Also, work won't get lost, because I specifically engineered it to be
so (I recall spending quite some time planning the logic while
procrastinating in Leuven =P). There are checks all over the place
that see if there is work left, and if so then start queueing it
again. And likewise, BugImporter adds common callbacks to both the
callback and error streams to decrement running_deferreds, so the
reactor won't ever think there is more running than there actually is
(at least, not due to the bugimporter. DIAs on the other hand... ;).

Of course, the above assumes there are no bugs [sic] in the code. And
who would really trust me to be the judge of that? =D

> Someone on IRC popped in (I'll let her introduce herself), and I
> advised: If you're interested in looking into this, I would say the
> first place to start is to comb through the existing files and
> remove references to mysite.whatever, and see if you can come up
> with what the callbacks should be.
> 

If you are interested, then firstly, thank you for being interested in
a section of code that is rather undermanned! And secondly, good luck
- - this code is a large part of my work in OpenHatch, so while I worked
to improve it from it's humble beginnings, it may have acquired some
quirks related to the way I approach problems, and therefore take some
time to understand (as I said above, I am very happy to help with the
thinking, and the coding as well if I ever have more than negative
time =D ). Speaking of which, do respond with queries about the above
- - I cannot speak for its legibility or comprehendability ^_^

Cheers,
Jack

> Whew!
> 
> -- Asheesh. _______________________________________________ Devel
> mailing list Devel at lists.openhatch.org 
> http://lists.openhatch.org/mailman/listinfo/devel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQEcBAEBAgAGBQJOqjZSAAoJEI3j92WSe+b36m0H/3IvRmdAtPResLCQEV9Ncf2m
BypaS2XPuCjI3fFE4nqjD/26qC/DBB7lMIf25F4WTmkjW2Gz+83v1FCFxouswlyh
F/WlTtOpUBPfuGZtH5ck3se92BtfVjmdCdHIpxurLS+jPKgI1TeoVnXGcmOFZR1w
E7/4/LVubtiIvFfe5qFPiA6Wb7VVbVBZ0WI8MgoMsfvYODx6ZW4FJKnCjH3Ryzi/
JnTd1pA6w19fqI/YYcefYlC+1g8hlIgyXl7MaWWGXfG/HixcBCxFhW5xCjkdMN1o
jYbsWkHQCTQoKI0ISoEHX6HptjFP8wUGB/cZtyaqb6VPU5J/U5bwYlArSGvwaZU=
=OeRJ
-----END PGP SIGNATURE-----


More information about the Devel mailing list