Temporizing User Agent rotation in Scrapy - python

I am writing a crawlspider using Scrapy and I use a downloader middleware to rotate user agents for each request. What I would like to know if there is a way to temporize this. In other words, I would like to know if it is possible to tell the spider to change User Agent every X seconds. I thought that, maybe, using the DOWNLOAD_DELAY setting to do this would do the trick.

You might approach it a bit differently. Since you have control over the requests/sec crawling speed via CONCURRENT_REQUESTS, DOWNLOAD_DELAY and other relevant settings, you might just count how many requests in a row would go with the same User-Agent header.
Something along these lines (based on scrapy-fake-useragent) (not tested):
from fake_useragent import UserAgent
class RotateUserAgentMiddleware(object):
def __init__(self, settings):
# let's make it configurable
self.rotate_user_agent_freq = settings.getint('ROTATE_USER_AGENT_FREQ')
self.ua = UserAgent()
self.request_count = 0
self.current_user_agent = self.ua.random
def process_request(self, request, spider):
if self.request_count >= self.rotate_user_agent_freq:
self.current_user_agent = self.ua.random
self.request_count = 0
else:
self.request_count += 1
request.headers.setdefault('User-Agent', self.current_user_agent)
This might be not particularly accurate since there also could be retries and other reasons that can theoretically screw up the count - test it please.

Related

How to restart session after n amount of requests have been made

I have a script that tries to scrape data from a website. The website blocks any incoming requests after ~75 requests have already been made to it. I found that resetting a session after 50 requests and sleeping for 30s seems to get around the problem of getting blocked. Now I would like to subclass requests.Session and modify It's behaviour in order so It automatically resets the session when it needs to. Here is my code so far:
class Session(requests.Session):
request_count_limit = 50
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.request_count = 0
def get(self, url, **kwargs):
if self.request_count == self.request_count_limit:
self = Session.restart_session()
response = super().get(url, **kwargs)
self.request_count += 1
return response
#classmethod
def restart_session(cls):
print('Restarting Session, Sleeping For 20 seconds...')
time.sleep(20)
return cls()
However, the code above doesn't work. The reason is although I am reassigning self the object itself doesn't change and with that the request_count doesn't change as well. Any help would be appreciated
Assigning to self is just changing a local variable, it has absolutely no effect outside of the method. You could try implementing .new() instead.
Look here: Python Class: overwrite `self`

Scrapy with multiple Selenium instances (parallel)

I need to scrape many urls with Selenium and Scrapy. To speed up whole process, I'm trying to create a bunch of shared Selenium instances. My idea is to have a set of parallel Selenium instances available to any Request if needed and released if done.
I tried to create a Middleware but the problem is that Middleware is sequential (I see all drivers (I call it browsers) loading urls and it seems to be sequential). I want all drivers work parallel.
class ScrapySpiderDownloaderMiddleware(object):
BROWSERS_COUNT = 10
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.free_browsers = set(
[webdriver.Chrome(executable_path=BASE_DIR + '/chromedriver') for x in range(self.BROWSERS_COUNT)])
def get_free_browser(self):
while True:
try:
return self.free_browsers.pop()
except KeyError:
time.sleep(0.1)
def release_browser(self, browser):
self.free_browsers.add(browser)
def process_request(self, request, spider):
browser = self.get_free_browser()
browser.get(request.url)
body = str.encode(browser.page_source)
self.release_browser(browser)
# Expose the driver via the "meta" attribute
request.meta.update({'browser': browser})
return HtmlResponse(
browser.current_url,
body=body,
encoding='utf-8',
request=request
)
I don't like solutions where you do:
driver.get(response.url)
in parse method because it causes redundant requests. Every url is being requested two times which I need to avoid.
For example this https://stackoverflow.com/a/17979285/2607447
Do you know what to do?
I suggest you look towards scrapy + docker. you can run many instances at once
As #Granitosaurus suggested, Splash is a good choice. I personally used Scrapy-splash - Scrapy takes care of parallel processing and Splash takes care of website rendering including JavaScript execution.

Python Scrapy - common way to configure scrapers

Python framework Scrapy works pretty well but I can not figure out how to configure spiders at runtime. Seems that all configuration should me made "statically" which is not handy. Awful design or I missed something?
For example I have a spider that requires difficult initialization routine.
I use my own scripts to obtain HTTP headers for crawling (cookies, user-agent etc.) - as it were logged-in user.
This takes one-two minute. After that these headers should be applied to all requests.
Right now I do this in __init__ method of the spider. But I can not set up User-Agent from the constructor of the spider. custom_settings must be set up as class variable. Therefore I have to use middleware to set up user agent for each request. This is an ugly solution.
Do we have some common pattern to init spiders - some kind of spider factory ? Smth like this:
class SpiderConfigurator:
def __init__():
...
def configureSpider(spider, environment):
...
spider.setMyCustomSettings(arg1, arg2)
...
environment.setMyCustomSettings(argName1, argValue1)
environment.setMyCustomSettings('User-Agent', 'my value')
Scrapy allows to lounch crawling from the script: Run Scrapy from script Thanks to #paultrmbrth for this hint
But we can not init spider - we just pass spider class to Crawler instance and then crawler instantiates the object. What we can do - pass parameters for our spider`s constructor. Smth like this:
os.chdir(scrapyDir)
projectSettings = get_project_settings()
crawlerProcess = CrawlerProcess(projectSettings)
crawlerProcess.crawl(SpiderCls,
argumentName1=argumentValue1,
argumentName2=argumentValue2)
Arguments argumentName1 and argumentName2 will be passed to the spider`s constructor.

How can I write tests for code using twisted.web.client.Agent and its subclasses?

I read the official tutorial on test-driven development, but it hasn't been very helpful in my case. I've written a small library that makes extensive use of twisted.web.client.Agent and its subclasses (BrowserLikeRedirectAgent, for instance), but I've been struggling in adapting the tutorial's code to my own test cases.
I had a look at twisted.web.test.test_web, but I don't understand how to make all the pieces fit together. For instance, I still have no idea how to get a Protocol object from an Agent, as per the official tutorial
Can anybody show me how to write a simple test for some code that relies on Agent to GET and POST data? Any additional details or advice is most welcome...
Many thanks!
How about making life simpler (i.e. code more readable) by using #inlineCallbacks.
In fact, I'd even go as far as to suggest staying away from using Deferreds directly, unless absolutely necessary for performance or in a specific use case, and instead always sticking to #inlineCallbacks—this way you'll keep your code looking like normal code, while benefitting from non-blocking behavior:
from twisted.internet import reactor
from twisted.web.client import Agent
from twisted.internet.defer import inlineCallbacks
from twisted.trial import unittest
from twisted.web.http_headers import Headers
from twisted.internet.error import DNSLookupError
class SomeTestCase(unittest.TestCase):
#inlineCallbacks
def test_smth(self):
ag = Agent(reactor)
response = yield ag.request('GET', 'http://example.com/', Headers({'User-Agent': ['Twisted Web Client Example']}), None)
self.assertEquals(response.code, 200)
#inlineCallbacks
def test_exception(self):
ag = Agent(reactor)
try:
yield ag.request('GET', 'http://exampleeee.com/', Headers({'User-Agent': ['Twisted Web Client Example']}), None)
except DNSLookupError:
pass
else:
self.fail()
Trial should take care of the rest (i.e. waiting on the Deferreds returned from the test functions (#inlineCallbacks-wrapped callables also "magically" return a Deferred—I strongly suggest reading more on #inlineCallbacks if you're not familiar with it yet).
P.S. there's also a Twisted "plugin" for nosetests that enables you to return Deferreds from your test functions and have nose wait until they are fired before exiting: http://nose.readthedocs.org/en/latest/api/twistedtools.html
This is similar to what mike said, but attempts to test response handling. There are other ways of doing this, but I like this way. Also I agree that testing things that wrap Agent isn't too helpful and testing your protocol/keeping logic in your protocol is probably better anyway but sometimes you just want to add some green ticks.
class MockResponse(object):
def __init__(self, response_string):
self.response_string = response_string
def deliverBody(self, protocol):
protocol.dataReceived(self.response_string)
protocol.connectionLost(None)
class MockAgentDeliverStuff(Agent):
def request(self, method, uri, headers=None, bodyProducer=None):
d = Deferred()
reactor.callLater(0, d.callback, MockResponse(response_body))
return d
class MyWrapperTestCase(unittest.TestCase):
def setUp:(self):
agent = MockAgentDeliverStuff(reactor)
self.wrapper_object = MyWrapper(agent)
#inlineCallbacks
def test_something(self):
response_object = yield self.wrapper_object("example.com")
self.assertEqual(response_object, expected_object)
How about this? Run trial on the following. Basically you're just mocking away Agent and pretending it does as advertised, and using FakeAgent to (in this case) fail all requests. If you actually want to inject data into the transport, that would take "more doing" I guess. But are you really testing your code, then? Or Agent's?
from twisted.web import client
from twisted.internet import reactor, defer
class BidnessLogik(object):
def __init__(self, agent):
self.agent = agent
self.money = None
def make_moneee_quik(self):
d = self.agent.request('GET', 'http://no.traffic.plz')
d.addCallback(self.made_the_money).addErrback(self.no_dice)
return d
def made_the_money(self, *args):
##print "Moneeyyyy!"
self.money = True
return 'money'
def no_dice(self, fail):
##print "Better luck next time!!"
self.money = False
return 'no dice'
class FailingAgent(client.Agent):
expected_uri = 'http://no.traffic.plz'
expected_method = 'GET'
reasons = ['No Reason']
test = None
def request(self, method, uri, **kw):
if self.test:
self.test.assertEqual(self.expected_uri, uri)
self.test.assertEqual(self.expected_method, method)
self.test.assertEqual([], kw.keys())
return defer.fail(client.ResponseFailed(reasons=self.reasons,
response=None))
class TestRequest(unittest.TestCase):
def setUp(self):
self.agent = FailingAgent(reactor)
self.agent.test = self
#defer.inlineCallbacks
def test_foo(self):
bid = BidnessLogik(self.agent)
resp = yield bid.make_moneee_quik()
self.assertEqual(resp, 'no dice')
self.assertEqual(False, bid.money)

How to return some variable once in few seconds to client?

I'm learning python+tornado currently and was stopped with this problem:
i need to write some data one every few sec (for example) to client even using self.write(var)
I've tried:
time.sleep - it's blocked
yield gen.Task(IOLoop.instance().add_timeout, time.time() + ...) - great thing but I still got full request at the end of timeout
.flush - in some reason it don t want to return Bdata to client
.PeriodicCallback - browsers window just loading and loading like with another upper methods
I imagine my code like
class MaHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
#tornado.gen.engine
def get(self):
for x in xrange(10):
self.write(x)
time.sleep(5) #yes,it's no working
That's all. Thanks for any help with this. I'm solving this like 4-5 days and really can't make it by myself.
I still think it can't be done only with server side. It coud be closed.
Use the PeriodicCallback class.
class MyHandler(tornado.web.RequestHandler):
#tornado.web.asynchronous
def get(self):
self._pcb = tornado.ioloop.PeriodicCallback(self._cb, 1000)
self._pcb.start()
def _cb(self):
self.write('Kapooya, Kapooya!')
self.flush()
def on_connection_close(self):
self._pcb.stop()

Categories