Is it possible to parallelize selenium webdriver get_attribute calls in python? - python

I am running this code
from multiprocessing.Pool import ThreadPool
from selenium import webdriver
driver = webdriver.Firefox()
driver.get(url)
elements = driver.find_elements_by_class_name("class-name")
pool = ThreadPool(4)
async = [pool.apply_async(fn_which_calls_get_attribute,(element,)) for element in elements]
results = [result.get() for result in async]
which works fine for some of the results, but throws an error of ResponseNotReady for other results. It runs as expected if I use "pool.apply" instead of the async version.
Is it a problem that I am making multiple calls to the selenium driver at once, and the error is because it cannot handle it? Or is something wrong with my parallelization?

Just Hint that Selenium run in a single thread and in a single core system. So its not possible to exercise multi-threading over selenium webdriver . Yes you can create a separate instance and attach to another core of a multi core system.
I may not answer your question but if you are trying to do something similar good not to do.

Related

Equivalent of time.sleep() in Selenium

I realize this is a relatively simple question but I haven't found the answer yet.
I'm using driver.get() in a for loop that iterates through some urls. To help avoid my IP address getting blocked, I've implemented time.sleep(5) before the driver.get statement in the for loop.
Basically, I just want a wait period to make my scraping seem more natural.
I think time.sleep may be causing page crashes. What is the equivalent of time.sleep in selenium? From what I understand, implicitly_wait just sets the amount of time before throwing an exception, but I'm not sure that's what I want here? I want a specific amount of time for the driver to wait.
time.sleep()
The sleep() function is from the time module which suspends execution of the current thread for a given number of seconds.
Now, WebDriver being a out-of-process library which instructs the browser what to perform and at the same time the web browser being asynchronous in nature, WebDriver can't track the active, real-time state of the HTML DOM. This gives rise to some intermittent issues that arise from usage of Selenium and WebDriver those are subjected to race conditions that occur between the browser and the user’s instructions.
As of now Selenium doesn't have any identical method to time.sleep(), however there are two equavalent methods at your disposal and can be used as per the prevailing condition of your automated tests.
Implicit wait: In this case, WebDriver polls the DOM for a certain duration when trying to find any element. This can be useful when certain elements on the webpage are not available immediately and need some time to load.
def implicitly_wait(self, time_to_wait) -> None:
"""
Sets a sticky timeout to implicitly wait for an element to be found,
or a command to complete. This method only needs to be called one
time per session. To set the timeout for calls to
execute_async_script, see set_script_timeout.
:Args:
- time_to_wait: Amount of time to wait (in seconds)
:Usage:
::
driver.implicitly_wait(30)
"""
self.execute(Command.SET_TIMEOUTS, {
'implicit': int(float(time_to_wait) * 1000)})
Explicit wait: This type of wait allows your code to halt program execution, or freeze the thread, until the condition you pass it resolves. As an example:
presence_of_element_located()
visibility_of_element_located()
element_to_be_clickable()
There is no specific method in Selenium for hardcoded pauses like a time.sleep() general Python method.
As you mentioned there is an implicitly_wait and Expected Conditions explicit WebDriverWait waits but both these are NOT a hardcoded pauses.
Both the implicitly_wait WebDriverWait are used for setting the timeout - how long time to poll for some element presence or condition, so if that condition is fulfilled or the element is presented the program flow will immediately continue to the next code line.
So, if you want to put a pause you have to use some general Python method that will suspend the program / thread run like the time.sleep().

Selenium browser instance can be accessible from a different process?

What I am currently trying to do is the following. There are a number of changing values (js driven) in a website that I am monitoring and saving to a database using Selenium. The values are read through infinite loops, from elements found with selenium's find_element.
This works as intended with one process. However, when I try to multiprocess this (to monitor multiple values at the same time), there seems to be no way to do it without opening one separate browser for each process (unfeasible, since we are talking about close to 60 different elements).
The browser I open before multiprocessing seems to not be available from within the various processes. Even if I find the elements before the multiprocessing step, I cannot pass them to the process since the webelements can't be pickled.
Am I doing something wrong, is selenium not the tool for the job, or is there another way?
The code below doesn't actually work, it's just meant to show the structure of what I currently have as a "working version". What I need to get away from is opening the browser from within the function and have all my processes relying on a single browser.
import time
import datetime
import os
from selenium import webdriver
from multiprocessing import Pool
def sampling(value_ID):
dir = os.path.dirname(__file__)
driver = webdriver.Firefox(dir)
driver.get("https:\\website.org")
monitored_value = driver.find_element_by_xpath('value_ID')
while(1):
print(monitored_value.text)
time.sleep(0.1)
value_array = [1,2,3,4,5,6]
if __name__ == '__main__':
with Pool(6) as p:
p.map(getSampleRT, value_array)
You can checkout selenium abstract listeners if you want to capture the changes in elements. By implementing a listener you can get rid of infinite loops. Here is an example that i think it can work for you.
class EventListeners(AbstractEventListener):
def before_change_value_of(self, element, driver):
# check if this is the element you are looking for
# do your stuff
print("element changed!")
driver_with_listeners = EventFiringWebDriver(driver, EventListeners()
# wait as much as you like while your listeners are working
driver_with_listeners.implicitly_wait(20000)
Also you can checkout this post for more complete implementation.

Getting stuck executing infinite javascript loop in Python's Selenium chromedriver

I am trying to build a service where users can insert their Javascript code and it gets executed on a website of their choice. I use webdriver from python's selenium lib and chromedriver. The problem is that the python script gets stuck if user submits Javascript code with infinite loop.
The python script needs to process many tasks like: go to a website and execute some Javascript code. So, I can't afford to let it get stuck. Infinite loop in Javascript is known to cause a browser to freeze. But isn't there some way to set a timeout for webdriver's execute_script method? I would like to get back to python after a timeout and continue to run code after the execute_script command. Is this possible?
from selenium import webdriver
chromedriver = "C:\chromedriver\chromedriver.exe"
driver = webdriver.Chrome(chromedriver)
driver.get("http://www.bulletproofpasswords.org/") # Or any other website
driver.execute_script("while (1); // Javascript infinite loop causing freeze")
You could set a timeout for your driver.execute_script("while (1);") call.
I have found another post that could solve this issue.
Basically, if you are on a Unix system, you could use signal to set a timeout for your driver.execute_script("while (1); call.
Or if you could run it in a separate process and then end the process if it takes too long using multiprocessing.Process. I'm including the example that was given in the other post:
import multiprocessing
import time
# bar
def bar():
for i in range(100):
print "Tick"
time.sleep(1)
if __name__ == '__main__':
# Start bar as a process
p = multiprocessing.Process(target=bar)
p.start()
# Wait for 10 seconds or until process finishes
p.join(10)
# If thread is still active
if p.is_alive():
print "running... let's kill it..."
# Terminate
p.terminate()
p.join()

Python Selenium send request and avoid "Waiting for (website) ...."

I am launching several requests on different tabs. While one tab loads I will iteratively go to other tabs and see whether they have loaded correctly. The mechanism works great except for one thing: a lot of time is wasted "Waiting for (website)..."
The way in which I go from one tab to the other is launching an exception whenever a key element that I have to find is missing. But, in order to check for this exception (and therefore to proceed on other tabs, as it should do) what happens is that I have to wait for the request to end (so for the message "Waiting for..." to disappear).
Would it be possible not to wait? That is, would it be possible to launch the request via browser.get(..) and then immediately change tab?
Yes you can do that. You need to change the pageLoadStrategy of the driver. Below is an example of firefox
import time
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium import webdriver
cap = DesiredCapabilities.FIREFOX
cap["pageLoadStrategy"] = "none"
print(DesiredCapabilities.FIREFOX)
driver = webdriver.Firefox(capabilities=cap)
driver.get("http://tarunlalwani.com")
#execute code for tab 2
#execute code for tab 3
Nothing will wait now and it is up to you to do all the waiting. You can also use eager instead of none

Multi-threading in selenium python

I am working on a project which needs bit automation and web-scraping for which I am using Selenium and BeautifulSoup (python2.7).
I want to open only one instance of a web browser and login to a website, keeping that session, I am trying to open new tabs which will be independently controlled by threads, each thread controlling a tab and performing their own task. How should I do it? An example code would be nice. Well here's my code:
def threadFunc(driver, tabId):
if tabId == 1:
#open a new tab and do something in it
elif tabId == 2:
#open another new tab with some different link and perform some task
.... #other cases
class tabThreads(threading.Thread):
def __init__(self, driver, tabId):
threading.Thread.__init__(self)
self.tabID = tabId
self.driver = driver
def run(self):
print "Executing tab ", self.tabID
threadFunc(self.driver, self.tabID)
def func():
# Created a main window
driver = webdriver.Firefox()
driver.get("...someLink...")
# This is the part where i am stuck, whether to create threads and send
# them the same web-driver to stick with the current session by using the
# javascript call "window.open('')" or use a separate for each tab to
# operate on individual pages, but that will open a new browser instance
# everytime a driver is created
thread1 = tabThreads(driver, 1)
thread2 = tabThreads(driver, 2)
...... #other threads
I am open to suggestions for using any other module, if needed
My understanding is that Selenium drivers are not thread-safe. In the WebDriver spec, the Thread Safety section is empty...which I take to mean they have not addressed the topic at all. https://www.w3.org/TR/2012/WD-webdriver-20120710/#thread-safety
So while you could share the driver reference with multiple threads and make calls to the driver from multiple threads, there is no guarantee that the driver will be able to handle multiple asynchronous calls correctly.
Instead, you must either synchronize calls from multiple threads to ensure one is completed before the next starts, or you should have just one thread making Selenium API calls...potentially handling commands from a queue that is filled by multiple other threads.
Also, see Can Selenium use multi threading in one browser?
I you are using the script to automatically submit forms (simply said doing GET and POST requests), I would recommend you to look at requests. You can easily capture Post requests from your Browser (Network tab in Developer Pane on both Firefox and Chrome), and submit them. Something like:
session = requests.session()
response = session.get('https://stackoverflow.com/')
soup = BeautifulSoup(response.text)
and even POST data like:
postdata = {'username':'John','password':password}
response=session.post('example.com',data=postdata,allow_redirects=True)
It can be easily threaded, Multiple times faster than using selenium, the only problem is there is no JavaScript or Form support, so you need to do it the old fashioned way.
EDIT:
Also take a look at ThreadPoolExecutor

Categories