Scrapy with multiple Selenium instances (parallel) - python

I need to scrape many urls with Selenium and Scrapy. To speed up whole process, I'm trying to create a bunch of shared Selenium instances. My idea is to have a set of parallel Selenium instances available to any Request if needed and released if done.
I tried to create a Middleware but the problem is that Middleware is sequential (I see all drivers (I call it browsers) loading urls and it seems to be sequential). I want all drivers work parallel.
class ScrapySpiderDownloaderMiddleware(object):
BROWSERS_COUNT = 10
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.free_browsers = set(
[webdriver.Chrome(executable_path=BASE_DIR + '/chromedriver') for x in range(self.BROWSERS_COUNT)])
def get_free_browser(self):
while True:
try:
return self.free_browsers.pop()
except KeyError:
time.sleep(0.1)
def release_browser(self, browser):
self.free_browsers.add(browser)
def process_request(self, request, spider):
browser = self.get_free_browser()
browser.get(request.url)
body = str.encode(browser.page_source)
self.release_browser(browser)
# Expose the driver via the "meta" attribute
request.meta.update({'browser': browser})
return HtmlResponse(
browser.current_url,
body=body,
encoding='utf-8',
request=request
)
I don't like solutions where you do:
driver.get(response.url)
in parse method because it causes redundant requests. Every url is being requested two times which I need to avoid.
For example this https://stackoverflow.com/a/17979285/2607447
Do you know what to do?

I suggest you look towards scrapy + docker. you can run many instances at once

As #Granitosaurus suggested, Splash is a good choice. I personally used Scrapy-splash - Scrapy takes care of parallel processing and Splash takes care of website rendering including JavaScript execution.

Related

Specifying local Chromedriver path for Dash application testing with webdriver-manager

I am trying to leverage testing Dash application as described here: https://dash.plotly.com/testing
However I found no way of specifying the Chromedriver path for the webdriver-manager under the hood of dash testing.
I tried this below which calls webdriver-manager before reaching the test code:
def test_bsly001_falsy_child(dash_duo):
app = import_app("my_app_path")
dash_duo.start_server(app)
webdriver-manager then would start downloading the latest Chrome version. But due to company policy we cannot just download things from the internet, it is blocked by firewall. We are supposed to use the Chromedriver which is already downloaded for us on the internal network.
I tried implementing a pytest fixture to set up the Chrome driver before the testing starts:
driver = webdriver.Chrome(executable_path="...")
But webdriver-manager does not accept this.
Do you know any ways of working around this? Any hints?
Any way of doing Dash testing without webdriver-manager?
Thanks.
I had a similar problem and ended up with another pytest fixture dash_duo_bis which uses another class DashComposite(Browser) inside the conftest.py where the _get_chrome method is overriden as follows:
class DashComposite(Browser):
def __init__(self, server, **kwargs):
super().__init__(**kwargs)
self.server = server
def get_webdriver(self):
return self._get_chrome()
def _get_chrome(self):
return webdriver.Chrome(executable_path = r"MY_CHROME_EXECUTABLE_PATH")
def start_server(self, app, **kwargs):
"""Start the local server with app."""
# start server with app and pass Dash arguments
self.server(app, **kwargs)
# set the default server_url, it implicitly call wait_for_page
self.server_url = self.server.url

What is the right approach to unittest this method in Python?

I have a scraping module on my application that uses Beautiful Soup and Selenium to get website info with this function:
def get_page(user: str) -> Optional[BeautifulSoup]:
"""Get a Beautiful Soup object that represents the user profile page in some website"""
try:
browser = webdriver.Chrome(options=options)
wait = WebDriverWait(browser, 10)
browser.get('https://somewebsite.com/' + user)
wait.until(EC.presence_of_element_located((By.TAG_NAME, 'article')))
except TimeoutException:
print("User hasn't been found. Try another user.")
return None
return BeautifulSoup(browser.page_source, 'lxml')
I need to test this function in two ways:
if it is getting a page (the success case);
and if it is printing the warning and returning None when it's not getting any page (the failure case).
I tried to test like this:
class ScrapeTests(unittest.TestCase):
def test_get_page_success(self):
"""
Test if get_page is getting a page
"""
self.assertEqual(isinstance(sc.get_page('myusername'), BeautifulSoup), True)
def test_get_page_not_found(self):
"""
Test if get_page returns False when looking for a user
that doesn't exist
"""
self.assertEqual(sc.get_page('iwçl9239jaçklsdjf'), None)
if __name__ == '__main__':
unittest.main()
Doing it like that makes the tests somewhat slower, as get_page itself is slow in the success case, and in the failure case, I'm forcing a timeout error looking for a non-existing user. I have the impression that my approach for testing these functions is not the right one. Probably the best way to test it is to fake a response, so get_page won't need to connect to the server and ask for anything.
So I have two questions:
Is this "fake web response" idea the right approach to test this function?
If so, how can I achieve it for that function? Do I need to rewrite the get_page function so it can be "testable"?
EDIT:
I tried to create a test to get_page like this:
class ScrapeTests(TestCase):
def setUp(self) -> None:
self.driver = mock.patch(
'scrape.webdriver.Chrome',
autospec=True
)
self.driver.page_source.return_value = "<html><head></head><body><article>Yes</article></body></html>"
self.driver.start()
def tearDown(self) -> None:
self.driver.stop()
def test_get_page_success(self):
"""
Test if get_page is getting a page
"""
self.assertEqual(isinstance(sc.get_page('whatever'), BeautifulSoup), True)
The problem I'm facing is that the driver.page_source attribute is created only after the wait.until function call. I need the wait.until because I need that Selenium browser waits for the javascript to create the article tags in HTML in order for me to scrape them.
When I try to define a return value for page source in setUp, I get an error: AttributeError: '_patch' object has no attribute 'page_source'
I tried lots of ways to mock webdriver attributes with mock. patch but it seems very difficult to my little knowledge. I think that maybe the best way to achieve what I desire (test get_page function without the need to connect to a server) is to mock an entire web server connection. But this is just a guess.
I think you are on the right track with #1. Look into the mock library which allows you to mock (fake) out functions, methods and classes and control the results of method calls. This will also remove the latency of the actual calls.
In my experience it is best to focus on testing the local logic and mock out any external dependencies. If you do that you will have a lot of small unit tests that together will test the majority of your code.
Based on your update, try:
self.driver = mock.patch_object(webdriver.Chrome, 'page_source', return_value="<html><head></head><body><article>Yes</article></body></html>"
If that doesn't work, then unfortunately I am out of ideas. I't possible the Selenium code is harder to mock.

CrawlerRunner not Waiting?

I was trying to use the following function to wait for a crawler to finish and return all results. However, this function always returns immediately when called while the crawler is still running. What am I missing here? Aren't join() supposed to wait?
def spider_results():
runner = CrawlerRunner(get_project_settings())
results = []
def crawler_results(signal, sender, item, response, spider):
results.append(item)
dispatcher.connect(crawler_results, signal=signals.item_passed)
runner.crawl(QuotesSpider)
runner.join()
return results
Accordig to scrapy docs (common practices section)
CrawlerProcess class is recommended to use in this cases.

Memoizing on some of a functions arguments in Python

Suppose I have a function that takes an object representing a web browser and uses it to grab a web page. I might be following that URL again at some point in the program, and I'd prefer to have the page cached to loading it again. But the trick is, I might be using a different browser object at that point. I'm happy to get the value from the cache, but lru_cache (and most other memoization systems) will consider the two invocations different by virtue of the different browser arguments. I'm wondering if there's a nice way to use memoization while ignoring some of the functions arguments. What I've got below is neither nice nor reusable.
from functools import lru_cache
class Browser(object):
"""Pretend this is a Browser from mechanize"""
count = 0
def __init__(self):
self.count = Browser.count
Browser.count += 1
print("Created browser #{}".format(self.count))
def get_url(self, url):
"""Pretend we're actually doing something here"""
print("...Browser #{} visiting {}".format(self.count, url))
return url[::-1]
#lru_cache()
def _get_url(url):
return _get_url.browser.get_url(url)
_get_url.browser = None
def get_url(browser, url):
_get_url.browser = browser
return _get_url(url)
for url in "www.python.org www.yahoo.com www.python.org".split():
browser = Browser() # Imagine that we periodically switch to a different Browser instance
print("{} => {}".format(url, get_url(browser, url)))
Output:
Created browser #0
...Browser #0 visiting www.python.org
www.python.org => gro.nohtyp.www
Created browser #1
...Browser #1 visiting www.yahoo.com
www.yahoo.com => moc.oohay.www
Created browser #2
www.python.org => gro.nohtyp.www

Temporizing User Agent rotation in Scrapy

I am writing a crawlspider using Scrapy and I use a downloader middleware to rotate user agents for each request. What I would like to know if there is a way to temporize this. In other words, I would like to know if it is possible to tell the spider to change User Agent every X seconds. I thought that, maybe, using the DOWNLOAD_DELAY setting to do this would do the trick.
You might approach it a bit differently. Since you have control over the requests/sec crawling speed via CONCURRENT_REQUESTS, DOWNLOAD_DELAY and other relevant settings, you might just count how many requests in a row would go with the same User-Agent header.
Something along these lines (based on scrapy-fake-useragent) (not tested):
from fake_useragent import UserAgent
class RotateUserAgentMiddleware(object):
def __init__(self, settings):
# let's make it configurable
self.rotate_user_agent_freq = settings.getint('ROTATE_USER_AGENT_FREQ')
self.ua = UserAgent()
self.request_count = 0
self.current_user_agent = self.ua.random
def process_request(self, request, spider):
if self.request_count >= self.rotate_user_agent_freq:
self.current_user_agent = self.ua.random
self.request_count = 0
else:
self.request_count += 1
request.headers.setdefault('User-Agent', self.current_user_agent)
This might be not particularly accurate since there also could be retries and other reasons that can theoretically screw up the count - test it please.

Categories