Suppose I have a function that takes an object representing a web browser and uses it to grab a web page. I might be following that URL again at some point in the program, and I'd prefer to have the page cached to loading it again. But the trick is, I might be using a different browser object at that point. I'm happy to get the value from the cache, but lru_cache (and most other memoization systems) will consider the two invocations different by virtue of the different browser arguments. I'm wondering if there's a nice way to use memoization while ignoring some of the functions arguments. What I've got below is neither nice nor reusable.
from functools import lru_cache
class Browser(object):
"""Pretend this is a Browser from mechanize"""
count = 0
def __init__(self):
self.count = Browser.count
Browser.count += 1
print("Created browser #{}".format(self.count))
def get_url(self, url):
"""Pretend we're actually doing something here"""
print("...Browser #{} visiting {}".format(self.count, url))
return url[::-1]
#lru_cache()
def _get_url(url):
return _get_url.browser.get_url(url)
_get_url.browser = None
def get_url(browser, url):
_get_url.browser = browser
return _get_url(url)
for url in "www.python.org www.yahoo.com www.python.org".split():
browser = Browser() # Imagine that we periodically switch to a different Browser instance
print("{} => {}".format(url, get_url(browser, url)))
Output:
Created browser #0
...Browser #0 visiting www.python.org
www.python.org => gro.nohtyp.www
Created browser #1
...Browser #1 visiting www.yahoo.com
www.yahoo.com => moc.oohay.www
Created browser #2
www.python.org => gro.nohtyp.www
Related
I have a scraping module on my application that uses Beautiful Soup and Selenium to get website info with this function:
def get_page(user: str) -> Optional[BeautifulSoup]:
"""Get a Beautiful Soup object that represents the user profile page in some website"""
try:
browser = webdriver.Chrome(options=options)
wait = WebDriverWait(browser, 10)
browser.get('https://somewebsite.com/' + user)
wait.until(EC.presence_of_element_located((By.TAG_NAME, 'article')))
except TimeoutException:
print("User hasn't been found. Try another user.")
return None
return BeautifulSoup(browser.page_source, 'lxml')
I need to test this function in two ways:
if it is getting a page (the success case);
and if it is printing the warning and returning None when it's not getting any page (the failure case).
I tried to test like this:
class ScrapeTests(unittest.TestCase):
def test_get_page_success(self):
"""
Test if get_page is getting a page
"""
self.assertEqual(isinstance(sc.get_page('myusername'), BeautifulSoup), True)
def test_get_page_not_found(self):
"""
Test if get_page returns False when looking for a user
that doesn't exist
"""
self.assertEqual(sc.get_page('iwçl9239jaçklsdjf'), None)
if __name__ == '__main__':
unittest.main()
Doing it like that makes the tests somewhat slower, as get_page itself is slow in the success case, and in the failure case, I'm forcing a timeout error looking for a non-existing user. I have the impression that my approach for testing these functions is not the right one. Probably the best way to test it is to fake a response, so get_page won't need to connect to the server and ask for anything.
So I have two questions:
Is this "fake web response" idea the right approach to test this function?
If so, how can I achieve it for that function? Do I need to rewrite the get_page function so it can be "testable"?
EDIT:
I tried to create a test to get_page like this:
class ScrapeTests(TestCase):
def setUp(self) -> None:
self.driver = mock.patch(
'scrape.webdriver.Chrome',
autospec=True
)
self.driver.page_source.return_value = "<html><head></head><body><article>Yes</article></body></html>"
self.driver.start()
def tearDown(self) -> None:
self.driver.stop()
def test_get_page_success(self):
"""
Test if get_page is getting a page
"""
self.assertEqual(isinstance(sc.get_page('whatever'), BeautifulSoup), True)
The problem I'm facing is that the driver.page_source attribute is created only after the wait.until function call. I need the wait.until because I need that Selenium browser waits for the javascript to create the article tags in HTML in order for me to scrape them.
When I try to define a return value for page source in setUp, I get an error: AttributeError: '_patch' object has no attribute 'page_source'
I tried lots of ways to mock webdriver attributes with mock. patch but it seems very difficult to my little knowledge. I think that maybe the best way to achieve what I desire (test get_page function without the need to connect to a server) is to mock an entire web server connection. But this is just a guess.
I think you are on the right track with #1. Look into the mock library which allows you to mock (fake) out functions, methods and classes and control the results of method calls. This will also remove the latency of the actual calls.
In my experience it is best to focus on testing the local logic and mock out any external dependencies. If you do that you will have a lot of small unit tests that together will test the majority of your code.
Based on your update, try:
self.driver = mock.patch_object(webdriver.Chrome, 'page_source', return_value="<html><head></head><body><article>Yes</article></body></html>"
If that doesn't work, then unfortunately I am out of ideas. I't possible the Selenium code is harder to mock.
I need to scrape many urls with Selenium and Scrapy. To speed up whole process, I'm trying to create a bunch of shared Selenium instances. My idea is to have a set of parallel Selenium instances available to any Request if needed and released if done.
I tried to create a Middleware but the problem is that Middleware is sequential (I see all drivers (I call it browsers) loading urls and it seems to be sequential). I want all drivers work parallel.
class ScrapySpiderDownloaderMiddleware(object):
BROWSERS_COUNT = 10
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.free_browsers = set(
[webdriver.Chrome(executable_path=BASE_DIR + '/chromedriver') for x in range(self.BROWSERS_COUNT)])
def get_free_browser(self):
while True:
try:
return self.free_browsers.pop()
except KeyError:
time.sleep(0.1)
def release_browser(self, browser):
self.free_browsers.add(browser)
def process_request(self, request, spider):
browser = self.get_free_browser()
browser.get(request.url)
body = str.encode(browser.page_source)
self.release_browser(browser)
# Expose the driver via the "meta" attribute
request.meta.update({'browser': browser})
return HtmlResponse(
browser.current_url,
body=body,
encoding='utf-8',
request=request
)
I don't like solutions where you do:
driver.get(response.url)
in parse method because it causes redundant requests. Every url is being requested two times which I need to avoid.
For example this https://stackoverflow.com/a/17979285/2607447
Do you know what to do?
I suggest you look towards scrapy + docker. you can run many instances at once
As #Granitosaurus suggested, Splash is a good choice. I personally used Scrapy-splash - Scrapy takes care of parallel processing and Splash takes care of website rendering including JavaScript execution.
My aim is to provide to a web framework access to a Pyro daemon that has time-consuming tasks at the first loading. So far, I have managed to keep in memory (outside of the web app) a single instance of a class that takes care of the time-consuming loading at its initialization. I can also query it with my web app. The code for the daemon is:
Pyro4.expose
#Pyro4.behavior(instance_mode='single')
class Store(object):
def __init__(self):
self._store = ... # the expensive loading
def query_store(self, query):
return ... # Useful query tool to expose to the web framework.
# Not time consuming, provided self._store is
# loaded.
with Pyro4.Daemon() as daemon:
uri = daemon.register(Thing)
with Pyro4.locateNS() as ns:
ns.register('thing', uri)
daemon.requestLoop()
The issue I am having is that although a single instance is created, it is only created at the first proxy query from the web app. This is normal behavior according to the doc, but not what I want, as the first query is still slow because of the initialization of Thing.
How can I make sure the instance is already created as soon as the daemon is started?
I was thinking of creating a proxy instance of Thing in the code of the daemon, but this is tricky because the event loop must be running.
EDIT
It turns out that daemon.register() can accept either a class or an object, which could be a solution. This is however not recommended in the doc (link above) and that feature apparently only exists for backwards compatibility.
Do whatever initialization you need outside of your Pyro code. Cache it somewhere. Use the instance_creator parameter of the #behavior decorator for maximum control over how and when an instance is created. You can even consider pre-creating server instances yourself and retrieving one from a pool if you so desire? Anyway, one possible way to do this is like so:
import Pyro4
def slow_initialization():
print("initializing stuff...")
import time
time.sleep(4)
print("stuff is initialized!")
return {"initialized stuff": 42}
cached_initialized_stuff = slow_initialization()
def instance_creator(cls):
print("(Pyro is asking for a server instance! Creating one!)")
return cls(cached_initialized_stuff)
#Pyro4.behavior(instance_mode="percall", instance_creator=instance_creator)
class Server:
def __init__(self, init_stuff):
self.init_stuff = init_stuff
#Pyro4.expose
def work(self):
print("server: init stuff is:", self.init_stuff)
return self.init_stuff
Pyro4.Daemon.serveSimple({
Server: "test.server"
})
But this complexity is not needed for your scenario, just initialize the thing (that takes a long time) and cache it somewhere. Instead of re-initializing it every time a new server object is created, just refer to the cached pre-initialized result. Something like this;
import Pyro4
def slow_initialization():
print("initializing stuff...")
import time
time.sleep(4)
print("stuff is initialized!")
return {"initialized stuff": 42}
cached_initialized_stuff = slow_initialization()
#Pyro4.behavior(instance_mode="percall")
class Server:
def __init__(self):
self.init_stuff = cached_initialized_stuff
#Pyro4.expose
def work(self):
print("server: init stuff is:", self.init_stuff)
return self.init_stuff
Pyro4.Daemon.serveSimple({
Server: "test.server"
})
I'd like to setup a nice way to capture screenshots when some of our Robot Framework front-end test fails. Of course I can add:
Run Keyword If Test Failed Capture Page Screenshot
To test Teardown, but considering I have huge and complex test suites with hundreds of tests and a nested structure - I would need to add this to so many Teardowns that it seems ugly to me.
I've experimented a bit. I thought the way forward was to use listener. So I tried this:
class ExtendedSelenium(Selenium2Library):
ROBOT_LISTENER_API_VERSION = 3
def __init__(self, timeout=5.0, implicit_wait=0.0, run_on_failure='Capture Page Screenshot', screenshot_root_directory=None):
super(ExtendedSelenium, self).__init__(timeout, implicit_wait, run_on_failure, screenshot_root_directory)
self.ROBOT_LIBRARY_LISTENER = self
def _end_test(self, data, result):
if not result.passed:
screenshot_name = data.id + data.name.replace(" ", "_") + ".png"
try:
self.capture_page_screenshot(screenshot_name)
except:
pass
It captures the screenshot but the picture is then not visible in the log. I'm able to display it in the test message, adding this step after capturing:
BuiltIn().set_test_message("*HTML* {} <br/> <img src={}>".format(result.message, screenshot_name))
But still, it isn't the best.
Then I tried a different approach with the Visitor interface (used with --prerunmodifier):
from robot.api import SuiteVisitor
class CaptureScreenshot(SuiteVisitor):
def end_test(self, test):
if test.status == 'FAIL':
test.keywords.create('Capture Page Screenshot', type='teardown')
But it replaces any existing Test Teardown by the new one (with only one keyword 'Capture Page Screenshot'). I thought I would be able to modify existing Teardowns by adding the Capture keyword, but I wasn't.
Is there any nice, clean, pythonic way to do this? Did I miss something?
Finally it ended up with library listener as below. I have to agree with Bryan: the code isn't short and nice but it fulfils the desired goal - single point in suite where screenshot capturing is defined.
As a big advantage I see the possibility to capture screenshot for failed setup - in some cases it helps us to identify possible infrastructure problems.
Please note the part with ActionChains - it zooms out in browser. Our front-end app uses partial page scroll and with this zoom we are able to see more content inside of that scroll which is really helpful for us. Result of this ActionChains differs for each browser, so this is truly workaround.
#!/usr/bin/python
# -*- coding: utf-8 -*-
from Selenium2Library import Selenium2Library
from selenium.common.exceptions import StaleElementReferenceException, WebDriverException
import re
class ExtendedSelenium(Selenium2Library):
""" Robot Framework library extending Robot Framework Selenium2Library library.
"""
ROBOT_LISTENER_API_VERSION = 2
DON_NOT_CAPTURE_KEYWORDS = ["Run Keyword And Ignore Error", "Run Keyword And Expect Error", "Run Keyword And Return Status", "Wait Until.*"]
def __init__(self, timeout=5.0, implicit_wait=0.0, run_on_failure='', screenshot_root_directory=None):
super(ExtendedSelenium, self).__init__(timeout, implicit_wait, run_on_failure, screenshot_root_directory)
self.ROBOT_LIBRARY_LISTENER = self
self._is_current_keyword_inside_teardown = False
self._do_not_capture_parent_keywords_count = 0
self._screenshot_was_captured = False
def _start_test(self, name, attributes):
""" Reset flags at the begin of each test.
"""
self._do_not_capture_parent_keywords_count = 0
self._is_current_keyword_inside_teardown = False
self._screenshot_was_captured = False
def _start_keyword(self, name, attributes):
""" Set keyword flag at the beginning of teardown.
If the keyword is one of the 'do not capture keywords' increase _do_not_capture_parent_keywords_count counter.
"""
if attributes["type"] == "Teardown":
self._is_current_keyword_inside_teardown = True
if any(kw for kw in self.DON_NOT_CAPTURE_KEYWORDS if re.match(kw, attributes["kwname"])):
self._do_not_capture_parent_keywords_count += 1
def _end_keyword(self, name, attributes):
"""If the keyword is one of the 'do not capture keywords' decrease _do_not_capture_parent_keywords_count counter.
Capture Screenshot if:
- keyword failed AND
- test is not in teardown phase AND
- the parent keyword isn't one of the 'do not capture keywords'
RuntimeError exception is thrown when no browser is open (backend test), no screenshot is captured in this case.
"""
if any(kw for kw in self.DON_NOT_CAPTURE_KEYWORDS if re.match(kw, attributes["kwname"])):
self._do_not_capture_parent_keywords_count -= 1
if not attributes["status"] == "PASS" and not self._is_current_keyword_inside_teardown and self._do_not_capture_parent_keywords_count == 0 and not self._screenshot_was_captured:
self._screenshot_was_captured = True
try:
self.capture_page_screenshot()
# TODO refactor this so it is reusable and nice!
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.common.keys import Keys
ActionChains(super(ExtendedSelenium, self)._current_browser()).send_keys(Keys.CONTROL, Keys.SUBTRACT, Keys.NULL).perform()
ActionChains(super(ExtendedSelenium, self)._current_browser()).send_keys(Keys.CONTROL, Keys.SUBTRACT, Keys.NULL).perform()
self.capture_page_screenshot()
ActionChains(super(ExtendedSelenium, self)._current_browser()).send_keys(Keys.CONTROL, '0', Keys.NULL).perform()
except RuntimeError:
pass
Any comments are welcome.
I wrapped everything in a try/except block for my tests. That way at any point if the test fails the screen shot is in the except block and I get a screen shot the moment before/after/...whatever.
try:
class UnitTestRunner():
...
except:
driver.save_screenshot('filepath')
I'm attempting to call test_check_footer_visible in another .py file.
origin.py has:
class FooterSection(unittest.TestCase):
"""
"""
browser = None
site_url = None
def setUp(self):
self.browser.open(self.site_url)
self.browser.window_maximize()
# make footer section
self._footer = FooterPage(self.browser)
def test_check_footer_visible(self):
#press the page down
self.browser.key_down_native("34")
self.browser.key_down_native("34")
self.browser.key_down_native("34")
print "Page down"
time.sleep (10)
Now in dest.py I need to call test_check_footer_visible(). How do I do that?
dest.py has the following format:
class TestPageErrorSection(unittest.TestCase):
"""
"""
browser = None
site_url = None
def setUp(self):
global SITE_URL, BROWSER
if not BROWSER:
BROWSER = self.browser
if not SITE_URL:
SITE_URL = self.site_url
BROWSER.open(SITE_URL)
BROWSER.window_maximize()
print 'page not found 404 error'
self._pageerror_section = ErrorSection(BROWSER)
def _primary_navigation(self):
# I want to call test_check_footer_visible here.
I tried everything in Call Class Method From Another Class
You can't (without doing some really shady things -- see comments by #glglgl) -- at least not without changing your code somewhat. The reason you can't is because test_check_footer_visible will assert that the first element passed in is an instance of the class FooterSection (which it isn't -- it's an instance of TestPageErrorSection). You have (at least) a couple of options for the re-factor.
1) Make TestPageErrorSection inherit from FooterSection and use that function directly (e.g. self.test_check_footer_visible() This doesn't seem like a very good option to me based on the class names
2) Make test_check_footer_visible() a regular function. It can then be called from either class.
3) Create a 3rd class (e.g. SectionBase), put test_check_footer_visible on that class and have your other 2 classes inherit from it.