[OH-Dev] Moving the bug importer code out of oh-mainline into "bugimporters"

Fri Oct 28 03:42:07 UTC 2011

At recent OH-Dev weekly meetings, we've talked about moving some
of our data import/export code out of the "oh-mainline" repository
into a separate Python package.

You can see the start of that work here:

https://gitorious.org/openhatch/bugimporters/trees/master

That has all the code that powers the new, asynchronous
bug importers. I think it will be faster and easier to increase
the quality and testing and code-coverage of the new bug import
code with it separated out.

For now, we don't use this "bugimporters" repo. It's an experiment.
But I would really love to make that code work, and then delete the
corresponding code from "oh-mainline".

Some further thoughts:

* There is a README file which explains how to set up a development
  environment, which should work pretty well on any Linuxy system.

* The big idea behind separating this out is that we should add way
  more automated testing to this all-important, error-prone part of
  the code, and make this code easier to maintain.

* It does asynchronous network I/O. A little about that:

A lot of web scraping/API-using code has the form:

>>> data = urllib2.urlopen("http://example.com/").read()
>>> do_some_processing_on(data)

Because we always separate the *downloading* from the *processing*
of network data, it is easy for us to add fake data that we've pre-
downloaded as data for the test suite.

* I'd like to apply a code coverage tool to the bug import code, too
  -- http://nedbatchelder.com/code/coverage/ -- so we can see which parts
  are not tested

* Right now, it still has a few things like "import mysite..." at the top of
  files -- that's parts where it depends on the main OpenHatch site. It
  should not do that, and so any place where it does that is a bug.
  Wherever this code does that, we have to come up with a way to lose those dependencies.

* Most of the times, those imports are pulling in Django models because the
  bug import code wants to find out the answer to questions like, "Which bugs
  are stale?" so it can then go fetch those bugs.

* Instead, we could pass it e.g. a list of stale bugs so that it doesn't have to
  depend on the Django database-Python glue code that lives in "mysite".

I'm worried about a few classes of bugs in the bug import code. First, I fear that
we might crash on some bug data that we download from bug trackers. This is what th
sample HTML data is supposed to address.

Another is that maybe when we enqueue work, the work gets lost in the queue somehow.
The bugimporters code relies on the "reactor_manager" object to do all the enqueueing,
so that's something that should be tested separately. I can get on that.

Someone on IRC popped in (I'll let her introduce herself), and I advised: If you're
interested in looking into this, I would say the first place to start is to comb
through the existing files and remove references to mysite.whatever, and see if
you can come up with what the callbacks should be.

Whew!

-- Asheesh.