I am working on a project which needs bit automation and web-scraping for which I am using Selenium and BeautifulSoup (python2.7).
I want to open only one instance of a web browser and login to a website, keeping that session, I am trying to open new tabs which will be independently controlled by threads, each thread controlling a tab and performing their own task. How should I do it? An example code would be nice. Well here's my code:
def threadFunc(driver, tabId):
if tabId == 1:
#open a new tab and do something in it
elif tabId == 2:
#open another new tab with some different link and perform some task
.... #other cases
class tabThreads(threading.Thread):
def __init__(self, driver, tabId):
threading.Thread.__init__(self)
self.tabID = tabId
self.driver = driver
def run(self):
print "Executing tab ", self.tabID
threadFunc(self.driver, self.tabID)
def func():
# Created a main window
driver = webdriver.Firefox()
driver.get("...someLink...")
# This is the part where i am stuck, whether to create threads and send
# them the same web-driver to stick with the current session by using the
# javascript call "window.open('')" or use a separate for each tab to
# operate on individual pages, but that will open a new browser instance
# everytime a driver is created
thread1 = tabThreads(driver, 1)
thread2 = tabThreads(driver, 2)
...... #other threads
I am open to suggestions for using any other module, if needed
My understanding is that Selenium drivers are not thread-safe. In the WebDriver spec, the Thread Safety section is empty...which I take to mean they have not addressed the topic at all. https://www.w3.org/TR/2012/WD-webdriver-20120710/#thread-safety
So while you could share the driver reference with multiple threads and make calls to the driver from multiple threads, there is no guarantee that the driver will be able to handle multiple asynchronous calls correctly.
Instead, you must either synchronize calls from multiple threads to ensure one is completed before the next starts, or you should have just one thread making Selenium API calls...potentially handling commands from a queue that is filled by multiple other threads.
Also, see Can Selenium use multi threading in one browser?
I you are using the script to automatically submit forms (simply said doing GET and POST requests), I would recommend you to look at requests. You can easily capture Post requests from your Browser (Network tab in Developer Pane on both Firefox and Chrome), and submit them. Something like:
session = requests.session()
response = session.get('https://stackoverflow.com/')
soup = BeautifulSoup(response.text)
and even POST data like:
postdata = {'username':'John','password':password}
response=session.post('example.com',data=postdata,allow_redirects=True)
It can be easily threaded, Multiple times faster than using selenium, the only problem is there is no JavaScript or Form support, so you need to do it the old fashioned way.
EDIT:
Also take a look at ThreadPoolExecutor
Related
What I am currently trying to do is the following. There are a number of changing values (js driven) in a website that I am monitoring and saving to a database using Selenium. The values are read through infinite loops, from elements found with selenium's find_element.
This works as intended with one process. However, when I try to multiprocess this (to monitor multiple values at the same time), there seems to be no way to do it without opening one separate browser for each process (unfeasible, since we are talking about close to 60 different elements).
The browser I open before multiprocessing seems to not be available from within the various processes. Even if I find the elements before the multiprocessing step, I cannot pass them to the process since the webelements can't be pickled.
Am I doing something wrong, is selenium not the tool for the job, or is there another way?
The code below doesn't actually work, it's just meant to show the structure of what I currently have as a "working version". What I need to get away from is opening the browser from within the function and have all my processes relying on a single browser.
import time
import datetime
import os
from selenium import webdriver
from multiprocessing import Pool
def sampling(value_ID):
dir = os.path.dirname(__file__)
driver = webdriver.Firefox(dir)
driver.get("https:\\website.org")
monitored_value = driver.find_element_by_xpath('value_ID')
while(1):
print(monitored_value.text)
time.sleep(0.1)
value_array = [1,2,3,4,5,6]
if __name__ == '__main__':
with Pool(6) as p:
p.map(getSampleRT, value_array)
You can checkout selenium abstract listeners if you want to capture the changes in elements. By implementing a listener you can get rid of infinite loops. Here is an example that i think it can work for you.
class EventListeners(AbstractEventListener):
def before_change_value_of(self, element, driver):
# check if this is the element you are looking for
# do your stuff
print("element changed!")
driver_with_listeners = EventFiringWebDriver(driver, EventListeners()
# wait as much as you like while your listeners are working
driver_with_listeners.implicitly_wait(20000)
Also you can checkout this post for more complete implementation.
I'm trying to open two websites in two tabs in my web browser. What actually happens is that two separate web browser windows are opened.
import webbrowser
webbrowser.open_new('https://www.msn.com')
webbrowser.open_new_tab('https://www.aol.com/')
The issue is likely that browser hasn't finished opening by the time you ask for a new tab. The docs do state that if no browser is open open_new_tab() acts as open_new(), which is why you are seeing two browsers.
I suggest putting a small delay between the calls:
import webbrowser
import time
webbrowser.open_new(url1)
time.sleep(1)
webbrowser.open_new_tab(url2)
Your other option is to poll the running processes and wait until the first instance of the browser appears before asking for a new tab.
I would like to have a single long-running browser session, that is reused between separate runs of my script. Thus allowing me to avoid logging in every time my script runs. Using other answers I have a working solution:
session_info = load_from_json()
options = webdriver.ChromeOptions()
driver = webdriver.Remote(
command_executor=session_info["executor_url"],
desired_capabilities={},
options = options)
driver.session_id = session_info["session_id"]
This has an unwanted side effect of leaving an orphaned chrome-webdriver session laying around on top of the already existing browser session. I was wondering what I can do to avoid having an extra orphaned session.
Prior to loading a new session try to clear both session and local storage.
driver.getSessionStorage().clear();
driver.getLocalStorage().clear();
I am launching several requests on different tabs. While one tab loads I will iteratively go to other tabs and see whether they have loaded correctly. The mechanism works great except for one thing: a lot of time is wasted "Waiting for (website)..."
The way in which I go from one tab to the other is launching an exception whenever a key element that I have to find is missing. But, in order to check for this exception (and therefore to proceed on other tabs, as it should do) what happens is that I have to wait for the request to end (so for the message "Waiting for..." to disappear).
Would it be possible not to wait? That is, would it be possible to launch the request via browser.get(..) and then immediately change tab?
Yes you can do that. You need to change the pageLoadStrategy of the driver. Below is an example of firefox
import time
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
from selenium import webdriver
cap = DesiredCapabilities.FIREFOX
cap["pageLoadStrategy"] = "none"
print(DesiredCapabilities.FIREFOX)
driver = webdriver.Firefox(capabilities=cap)
driver.get("http://tarunlalwani.com")
#execute code for tab 2
#execute code for tab 3
Nothing will wait now and it is up to you to do all the waiting. You can also use eager instead of none
So this is what I'm using to open the browser
import webbrowser
import time
url = "http://google.com"
time = 5
def TestBrowse(url,time):
webbrowser.open(url)
time.sleep(time)
I want a function or method following time.sleep that will refresh the tab that the function opens. This is a module I'm just getting familiar with so I don't even know if its a better module or solution for this (or if at all possible)
Infact my main target was to be able to close the tab but I've been reading there is no way to do that, If this is false I would also love knowing how to do that. I've experimented with using os.system to kill the browser but os.system never seems to work inside a function (and It doesn't seem like a good idea anyway)
Maybe using selenium would be a better option for browser programing.
It accepts python scripts.
Also, you could try creating a wrapper web page with an embedded script that does the refreshing and/or exiting, though the browser might threat it as a cross-site scripting and limit the functionality of the URL you are trying to access. In any case to use that you would need to program in javascript rather than python.