[OH-Dev] [issue793] oh-bugimporters should do per-domain backoff
Asheesh Laroia
bugs at openhatch.org
Tue Nov 20 16:04:43 UTC 2012
New submission from Asheesh Laroia <asheesh at asheesh.org>:
Some bug trackers (openhatch.org/bugs/ especially...) if you request more than 1-
2 bugs per second report HTTP 504 Gateway Timeout.
The way Scrapy handles this now is in the
http://doc.scrapy.org/en/0.12/topics/downloader-middleware.html#module-
scrapy.contrib.downloadermiddleware.retry middleware, which re-queues the job but
doesn't insist on a time delay.
It'd be nice to have a custom RetryMiddleware that did per-domain backoff. (Note
that we're sort of abusing the Scrapy architecture; we're supposed to have one
"spider" class per domain, but instead we only have one.)
One way to do this is to provide a custom subclass of
scrapy.contrib.downloadermiddleware.retry.RetryMiddleware and then override the
_retry method.
That should let us more reliably crawl some of the sites that are quite finnicky.
----------
messages: 3517
nosy: paulproteus
priority: wish
status: unread
title: oh-bugimporters should do per-domain backoff
__________________________________________
Roundup issue tracker <bugs at openhatch.org>
<https://openhatch.org/bugs/issue793>
__________________________________________
More information about the Devel
mailing list