Get manually opened firefox window with Selenium and python [duplicate] - python

For some unknown reasons ,my browser open test pages of my remote server very slowly. So I am thinking if I can reconnect to the browser after quitting the script but don't execute webdriver.quit() this will leave the browser opened. It is probably kind of HOOK or webdriver handle.
I have looked up the selenium API doc but didn't find any function.
I'm using Chrome 62,x64,windows 7,selenium 3.8.0.
I'll be very appreciated whether the question can be solved or not.

No, you can't reconnect to the previous Web Browsing Session after you quit the script. Even if you are able to extract the Session ID, Cookies and other session attributes from the previous Browsing Context still you won't be able to pass those attributes as a HOOK to the WebDriver.
A cleaner way would be to call webdriver.quit() and then span a new Browsing Context.
Deep Dive
There had been a lot of discussions and attempts around to reconnect WebDriver to an existing running Browsing Context. In the discussion Allow webdriver to attach to a running browser Simon Stewart [Creator WebDriver] clearly mentioned:
Reconnecting to an existing Browsing Context is a browser specific feature, hence can't be implemented in a generic way.
With internet-explorer, it's possible to iterate over the open windows in the OS and find the right IE process to attach to.
firefox and google-chrome needs to be started in a specific mode and configuration, which effectively means that just
attaching to a running instance isn't technically possible.
tl; dr
webdriver.firefox.useExisting not implemented

Yes, that's actually quite easy to do.
A selenium <-> webdriver session is represented by a connection url and session_id, you just reconnect to an existing one.
Disclaimer - the approach is using selenium internal properties ("private", in a way), which may change in new releases; you'd better not use it for production code; it's better not to be used against remote SE (yours hub, or provider like BrowserStack/Sauce Labs), because of a caveat/resource drainage explained at the end.
When a webdriver instance is initiated, you need to get the before-mentioned properties; sample:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.google.com/')
# now Google is opened, the browser is fully functional; print the two properties
# command_executor._url (it's "private", not for a direct usage), and session_id
print(f'driver.command_executor._url: {driver.command_executor._url}')
print(f'driver.session_id: {driver.session_id}')
With those two properties now known, another instance can connect; the "trick" is to initiate a Remote driver, and provide the _url above - thus it will connect to that running selenium process:
driver2 = webdriver.Remote(command_executor=the_known_url)
# when the started selenium is a local one, the url is in the form 'http://127.0.0.1:62526'
When that is ran, you'll see a new browser window being opened.
That's because upon initiating the driver, the selenium library automatically starts a new session for it - and now you have 1 webdriver process with 2 sessions (browsers instances).
If you navigate to an url, you'll see it is executed on that new browser instance, not the one that's left from the previous start - which is not the desired behavior.
At this point, two things need to be done - a) close the current SE session ("the new one"), and b) switch this instance to the previous session:
if driver2.session_id != the_known_session_id: # this is pretty much guaranteed to be the case
driver2.close() # this closes the session's window - it is currently the only one, thus the session itself will be auto-killed, yet:
driver2.quit() # for remote connections (like ours), this deletes the session, but does not stop the SE server
# take the session that's already running
driver2.session_id = the_known_session_id
# do something with the now hijacked session:
driver.get('https://www.bing.com/')
And, that is it - you're now connected to the previous/already existing session, with all its properties (cookies, LocalStorage, etc).
By the way, you do not have to provide desired_capabilities when initiating the new remote driver - those are stored and inherited from the existing session you took over.
Caveat - having a SE process running can lead to some resource drainage in the system.
Whenever one is started and then not closed - like in the first piece of the code - it will stay there until you manually kill it. By this I mean - in Windows for example - you'll see a "chromedriver.exe" process, that you have to terminate manually once you are done with it. It cannot be closed by a driver that has connected to it as to a remote selenium process.
The reason - whenever you initiate a local browser instance, and then call its quit() method, it has 2 parts in it - the first one is to delete the session from the Selenium instance (what's done in the second code piece up there), and the other is to stop the local service (the chrome/geckodriver) - which generally works ok.
The thing is, for Remote sessions the second piece is missing - your local machine cannot control a remote process, that's the work of that remote's hub. So that 2nd part is literally a pass python statement - a no-op.
If you start too many selenium services on a remote hub, and don't have a control over it - that'll lead to resource drainage from that server. Cloud providers like BrowserStack take measures against this - they are closing services with no activity for the last 60s, etc, yet - this is something you don't want to do.
And as for local SE services - just don't forget to occasionally clean up the OS from orphaned selenium drivers you forgot about :)

OK after mixing various solutions shared on here and tweaking I have this working now as below. Script will use previously left open chrome window if present - the remote connection is perfectly able to kill the browser if needed and code functions just fine.
I would love a way to automate the getting of session_id and url for previous active session without having to write them out to a file during hte previous session for pick up...
This is my first post on here so apologies for breaking any norms
#Set manually - read/write from a file for automation
session_id = "e0137cd71ab49b111f0151c756625d31"
executor_url = "http://localhost:50491"
def attach_to_session(executor_url, session_id):
original_execute = WebDriver.execute
def new_command_execute(self, command, params=None):
if command == "newSession":
# Mock the response
return {'success': 0, 'value': None, 'sessionId': session_id}
else:
return original_execute(self, command, params)
# Patch the function before creating the driver object
WebDriver.execute = new_command_execute
driver = webdriver.Remote(command_executor=executor_url, desired_capabilities={})
driver.session_id = session_id
# Replace the patched function with original function
WebDriver.execute = original_execute
return driver
remote_session = 0
#Try to connect to the last opened session - if failing open new window
try:
driver = attach_to_session(executor_url,session_id)
driver.current_url
print(" Driver has an active window we have connected to it and running here now : ")
print(" Chrome session ID ",session_id)
print(" executor_url",executor_url)
except:
print("No Driver window open - make a new one")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=myoptions)
session_id = driver.session_id
executor_url = driver.command_executor._url

Without getting into why do you think that leaving an open browser windows will solve the problem of being slow, you don't really need a handle to do that. Just keep running the tests without closing the session or, in other words, without calling driver.quit() as you have mentioned yourself. The question here though framework that comes with its own runner? Like Cucumber?
In any case, you must have some "setup" and "cleanup" code. So what you need to do is to ensure during the "cleanup" phase that the browser is back to its initial state. That means:
Blank page is displayed
Cookies are erased for the session

Related

Python Selenium Scraping on Local but not on VPS

I'm making a python web scraping using Selenium. I've been testing it on my local machine and it finishes the "scraping" stage of my program, then I have it process data to a cloud database.
This is an image of what the logs look like when I scrape on my local computer.
However, on a VPS (Ubuntu), the scraping reaches up to 100% but never really finishes, then enters limbo (program eternally running). I have timeouts on my ChromeDriver with logs whenever it times out on a website, so that shouldn't be the issue.
These are my options:
option = webdriver.ChromeOptions()
option.add_argument("--no-sandbox")
# option.add_argument("--disable-extensions")
# option.add_argument("--disable-setuid-sandbox")
# option.add_argument('--disable-application-cache')
# option.add_argument("enable-automation")
# option.add_argument("--disable-browser-side-navigation")
# option.add_argument("start-maximized")
option.add_argument("--headless")
option.add_argument("--disable-dev-sh-usage")
option.add_argument("--disable-gpu")
# option.add_argument("--blink-settings=imagesEnabled=false")
I took off the commented-out options to test whether it would make a difference. I feel like it did speed up the scraping process on the VPS but the scraping still does not finish.
How do I find out the issue, and what could it be?
Thanks.

MaxRetryError while web scraping workaround - Python, Selenium

I am having some really hard times trying to figure out how to webscrape making multiple requests to the same website. I have to web scrape 3000 products from a website. That implies making various requests to that server (for example searching the product, clicking on it, going back to the home page) 3000 times.
I state that I am using Selenium. If I only launch one instance of my Firefox webdriver I don't get a MaxRetryError, but as the search goes on my webdriver gets slower and slower, and when the program reaches about half of the searches it stops responding. I looked it up on some forums and I found out it does so for some browser memory issues. So I tried quitting and reinstantiating the webdriver every n seconds (I tried with 100, 200 and 300 secs), but when I do so I get that MaxRetryError because of the too many requests to that url using the same session.
I then tried making the program sleep for a minute when the exception occurs but that hasn't worked (I am only able to make another search and then an exception is again thrown, and so on).
I am wondering if there is any workaround for these kind of issue.
It might be using another library, a way for changing IP or session dynamically or something like that.
P.S. I would rather keep working with selenium if possible.
This error is normally raised if the server determines a high request rate from your client.
As you mentioned, the server bans your IP from making further requests so you can get around that by using some available technologies. Look into Zalenium and also see here for some other possible ways.
Another possible (but tedious) way is to use a number of browser instances to make the call, for example, an answer from here illustrates that.
urlArr = ['https://link1', 'https://link2', '...']
for url in urlArr:
chrome_options = Options()
chromedriver = webdriver.Chrome(executable_path='C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe', options=chrome_options)
with chromedriver as browser:
browser.get(url)
# your task
chromedriver.close() # will close only the current chrome window.
browser.quit() # should close all of the open windows,

Is it possible to attach an existing browser session not initiated by selenium to selenium driver [duplicate]

For some unknown reasons ,my browser open test pages of my remote server very slowly. So I am thinking if I can reconnect to the browser after quitting the script but don't execute webdriver.quit() this will leave the browser opened. It is probably kind of HOOK or webdriver handle.
I have looked up the selenium API doc but didn't find any function.
I'm using Chrome 62,x64,windows 7,selenium 3.8.0.
I'll be very appreciated whether the question can be solved or not.
No, you can't reconnect to the previous Web Browsing Session after you quit the script. Even if you are able to extract the Session ID, Cookies and other session attributes from the previous Browsing Context still you won't be able to pass those attributes as a HOOK to the WebDriver.
A cleaner way would be to call webdriver.quit() and then span a new Browsing Context.
Deep Dive
There had been a lot of discussions and attempts around to reconnect WebDriver to an existing running Browsing Context. In the discussion Allow webdriver to attach to a running browser Simon Stewart [Creator WebDriver] clearly mentioned:
Reconnecting to an existing Browsing Context is a browser specific feature, hence can't be implemented in a generic way.
With internet-explorer, it's possible to iterate over the open windows in the OS and find the right IE process to attach to.
firefox and google-chrome needs to be started in a specific mode and configuration, which effectively means that just
attaching to a running instance isn't technically possible.
tl; dr
webdriver.firefox.useExisting not implemented
Yes, that's actually quite easy to do.
A selenium <-> webdriver session is represented by a connection url and session_id, you just reconnect to an existing one.
Disclaimer - the approach is using selenium internal properties ("private", in a way), which may change in new releases; you'd better not use it for production code; it's better not to be used against remote SE (yours hub, or provider like BrowserStack/Sauce Labs), because of a caveat/resource drainage explained at the end.
When a webdriver instance is initiated, you need to get the before-mentioned properties; sample:
from selenium import webdriver
driver = webdriver.Chrome()
driver.get('https://www.google.com/')
# now Google is opened, the browser is fully functional; print the two properties
# command_executor._url (it's "private", not for a direct usage), and session_id
print(f'driver.command_executor._url: {driver.command_executor._url}')
print(f'driver.session_id: {driver.session_id}')
With those two properties now known, another instance can connect; the "trick" is to initiate a Remote driver, and provide the _url above - thus it will connect to that running selenium process:
driver2 = webdriver.Remote(command_executor=the_known_url)
# when the started selenium is a local one, the url is in the form 'http://127.0.0.1:62526'
When that is ran, you'll see a new browser window being opened.
That's because upon initiating the driver, the selenium library automatically starts a new session for it - and now you have 1 webdriver process with 2 sessions (browsers instances).
If you navigate to an url, you'll see it is executed on that new browser instance, not the one that's left from the previous start - which is not the desired behavior.
At this point, two things need to be done - a) close the current SE session ("the new one"), and b) switch this instance to the previous session:
if driver2.session_id != the_known_session_id: # this is pretty much guaranteed to be the case
driver2.close() # this closes the session's window - it is currently the only one, thus the session itself will be auto-killed, yet:
driver2.quit() # for remote connections (like ours), this deletes the session, but does not stop the SE server
# take the session that's already running
driver2.session_id = the_known_session_id
# do something with the now hijacked session:
driver.get('https://www.bing.com/')
And, that is it - you're now connected to the previous/already existing session, with all its properties (cookies, LocalStorage, etc).
By the way, you do not have to provide desired_capabilities when initiating the new remote driver - those are stored and inherited from the existing session you took over.
Caveat - having a SE process running can lead to some resource drainage in the system.
Whenever one is started and then not closed - like in the first piece of the code - it will stay there until you manually kill it. By this I mean - in Windows for example - you'll see a "chromedriver.exe" process, that you have to terminate manually once you are done with it. It cannot be closed by a driver that has connected to it as to a remote selenium process.
The reason - whenever you initiate a local browser instance, and then call its quit() method, it has 2 parts in it - the first one is to delete the session from the Selenium instance (what's done in the second code piece up there), and the other is to stop the local service (the chrome/geckodriver) - which generally works ok.
The thing is, for Remote sessions the second piece is missing - your local machine cannot control a remote process, that's the work of that remote's hub. So that 2nd part is literally a pass python statement - a no-op.
If you start too many selenium services on a remote hub, and don't have a control over it - that'll lead to resource drainage from that server. Cloud providers like BrowserStack take measures against this - they are closing services with no activity for the last 60s, etc, yet - this is something you don't want to do.
And as for local SE services - just don't forget to occasionally clean up the OS from orphaned selenium drivers you forgot about :)
OK after mixing various solutions shared on here and tweaking I have this working now as below. Script will use previously left open chrome window if present - the remote connection is perfectly able to kill the browser if needed and code functions just fine.
I would love a way to automate the getting of session_id and url for previous active session without having to write them out to a file during hte previous session for pick up...
This is my first post on here so apologies for breaking any norms
#Set manually - read/write from a file for automation
session_id = "e0137cd71ab49b111f0151c756625d31"
executor_url = "http://localhost:50491"
def attach_to_session(executor_url, session_id):
original_execute = WebDriver.execute
def new_command_execute(self, command, params=None):
if command == "newSession":
# Mock the response
return {'success': 0, 'value': None, 'sessionId': session_id}
else:
return original_execute(self, command, params)
# Patch the function before creating the driver object
WebDriver.execute = new_command_execute
driver = webdriver.Remote(command_executor=executor_url, desired_capabilities={})
driver.session_id = session_id
# Replace the patched function with original function
WebDriver.execute = original_execute
return driver
remote_session = 0
#Try to connect to the last opened session - if failing open new window
try:
driver = attach_to_session(executor_url,session_id)
driver.current_url
print(" Driver has an active window we have connected to it and running here now : ")
print(" Chrome session ID ",session_id)
print(" executor_url",executor_url)
except:
print("No Driver window open - make a new one")
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()),options=myoptions)
session_id = driver.session_id
executor_url = driver.command_executor._url
Without getting into why do you think that leaving an open browser windows will solve the problem of being slow, you don't really need a handle to do that. Just keep running the tests without closing the session or, in other words, without calling driver.quit() as you have mentioned yourself. The question here though framework that comes with its own runner? Like Cucumber?
In any case, you must have some "setup" and "cleanup" code. So what you need to do is to ensure during the "cleanup" phase that the browser is back to its initial state. That means:
Blank page is displayed
Cookies are erased for the session

python - Speeding up Chrome Webdriver in Selenium

I am making a simple bot with selenium that will like, comment and message people on certain intervals.
I am using chrome web driver:
browser = webdriver.Chrome()
Also, I am on a x64 linux system. Distro is ubuntu 15.04 and am running with python3 from terminal.
and this works good and all, but it's pretty slow. I know as my code progresses, testing the app will become a pain. I've looked into this already and know it may have something to do with the proxy settings.
I am clueless when it comes to this type of stuff.
I fiddled with my system settings and changed my proxy settings to not require a connection, but nothing changed.
I notice when the driver loads, I see 'Establishing secure connection' for a few seconds in the browser window. I feel this is a culprit.
Also, 'establishing host' shows up multiple times. I'd say it takes about 5-8 seconds just to get a page.
login_url = 'http://www.skout.com/login'
browser.get(login_url)
In what ways can I speed up chrome driver, and is it proxy settings? It could definitely be something else.
Thanks for your time.
Chrome webdriver can be clunky and a bit slow to initialize as it is spawning a fresh instance every time you call the Webdriver object.
If speed is of the utmost importance I might recommend investing some time into looking at a headless alternative such as PhantomJS. This can save a significant amount of time if you are running multiple tests or instances of your application.

IE11 stuck at initial start page for webdriver server only when connection to the URL is slow

For the IE webdriver, it opens the IE browsers but it starts to load the local host and then stops (ie/ It never stated loading ). WHen the browser stops loading it shows the msg 'Initial start page for webdriver server'. The problem is that this does not occur every time I execute the test case making it difficult to identify what could be the cause of the issue. What I have noticed is when this issue occurs, the url will take ~25 secs to load manually on the same machine. When the issue does not occur, the URL will load within 3secs.
All security setting are the same (protected Mode enabled across all zone)
enhance protected mode is disabled
IE version 11
the URL is added as a trusted site.
Any clue why it does not load the URL sometimes?
I would try with disabling IE Native event. And, sorry that I cannot provide you the Python syntax right a way. The following is C# which should be fairly easy to convert.
var ieOptions = new InternetExplorerOptions
{ EnableNativeEvents = false };
ieOptions.EnsureCleanSession = true;
driver = new InternetExplorerDriver(ieOptions);
Use remote driver with desired cap (pageLoadStrategy)
Release notes from seleniumhq.org. Note that we have to use version 2.46 for the jar, iedriverserver.exe and python client driver in order to have things work correctly. It is unclear why 2.45 does not work given the release notes below.
v2.45.0.2
Updates to JavaScript automation atoms.
Added pageLoadStrategy to IE driver. Setting a capability named
pageLoadStrategy when creating a session with the IE driver will now change
the wait behavior when navigating to a new page. The valid values are:
"normal" - Waits for document.readyState to be 'complete'. This is the
default, and is the same behavior as all previous versions of
the IE driver.
"eager" - Will abort the wait when document.readyState is
'interactive' instead of waiting for 'complete'.
"none" - Will abort the wait immediately, without waiting for any of
the page to load.
Setting the capability to an invalid value will result in use of the
"normal" page load strategy.
It hasn't been updated for a while, but recently I had very similar issue - IEDriverServer was eventually opening page under test, but in most cases just stuck on Initial page of WebDriver.
What I found the root cause (in my case) was startup setting of IE. I had Start with tabs from the last session enabled, when changed back to Start with home page driver started to work like a charm, opening page under test in 100% of tries.

Categories