How to clean up all Selenium Firefox Processes - python

I've created a web scraper with python (3.6) and a selenium, firefox web driver. I've set up a cronjob to run this scraper every few minutes, and it seems to all be working, except that over time (like a few days), the memory on my Ubuntu VPS (8GB RAM, Ubuntu 18.04.4) fills up and it crashes.
When I check HTOP, I can see lots (as in, hundreds) of firefox processes like "/usr/lib/firefox -marionette" and "/usr/lib/firefox -contentproc", all taking up about 3 or 4mb of memory each.
I've put a
browser.stop_client()
browser.close()
browser.quit()
In every function that uses the web driver, but I suspect the script is sometimes leaving web drivers open when it hits an error, and not closing them properly, and these firefox processes just accumulate until my system crashes.
I'm working on finding the root cause of this, but in the meantime, is there a quick way I can kill/clean up all these processes?
e.g. a cronjob that kills all matching processes (older than 10 minutes)?
Thanks.

I suspect the script is sometimes leaving web drivers open when it hits an error, and not closing them properly
This is most likely the issue. We fix this issue by using try except finally blocks.
browser = webdriver.Firefox()
try:
# Your code
except Exception as e:
# Log or print error
finally:
browser.close()
browser.quit()
And if you still face the same issue, you can force kill the driver as per this answer, or this answer for Ubuntu.
import os
os.system("taskkill /im geckodriver.exe /f")

Related

Properly starting/stopping Selenium standalone server

I am using the Selenium standalone server for a remote web driver. One thing I am trying to figure out is how to start/stop it effectively. On their documentation, it says
"the caller is expected to terminate each session properly, calling either Selenium#stop() or WebDriver#quit."
What I am trying to do is figure out how to programmatically close the server, but is that even necessary? In other words, would it be okay to have the server up and running at all times, but to just close the session after each use with something like driver.quit()? Therefore when I'm not using it the server would be up but there would be no sessions.
While using Selenium standalone server as a Remote WebDriver you need to invoke quit() method at the end to terminate each session properly.
As per best practices, you should invoke the quit() method within the tearDown() {}. Invoking quit() DELETEs the current browsing session through sending "quit" command with {"flags":["eForceQuit"]} and finally sends the GET request on /shutdown EndPoint. Here is an example below :
1503397488598 webdriver::server DEBUG -> DELETE /session/8e457516-3335-4d3b-9140-53fb52aa8b74
1503397488607 geckodriver::marionette TRACE -> 37:[0,4,"quit",{"flags":["eForceQuit"]}]
1503397488821 webdriver::server DEBUG -> GET /shutdown
So on invoking quit() method the Web Browser session and the WebDriver instance gets killed completely.
References
You can find a couple of relevant detailed discussions in:
Selenium : How to stop geckodriver process impacting PC memory, without calling driver.quit()?
PhantomJS web driver stays in memory
you were right. Use seleniums driver.quit() as it properly closes all browser windows and ends driver's session/process. Especially the latter is what you want, because you most certainly run the script headless.
I have a selenium script running on as Raspberry Pi (hourly cron job) headless. That script calls driver.quit() at the end of each iteration. When i do a -ps A (to list al active processes under unix), no active selenium/python processes are shown anymore.
Hope that satisfies your question!

(Selenium) Running many firefox browser with less performance [duplicate]

I am using selenium with Firefox to automate some tasks on Instagram. It basically goes back and forth between user profiles and notifications page and does tasks based on what it finds.
It has one infinite loop that makes sure that the task keeps on going. I have sleep() function every few steps but the memory usage keeps increasing. I have something like this in Python:
while(True):
expected_conditions()
...doTask()
driver.back()
expected_conditions()
...doAnotherTask()
driver.forward()
expected_conditions()
I never close the driver because that will slow down the program by a lot as it has a lot of queries to process. Is there any way to keep the memory usage from increasing overtime without closing or quitting the driver?
EDIT: Added explicit conditions but that did not help either. I am using headless mode of Firefox.
Well, This the serious problem I've been going through for some days. But I have found the solution. You can add some flags to optimize your memory usage.
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument('--no-sandbox')
options.add_argument('--disable-application-cache')
options.add_argument('--disable-gpu')
options.add_argument("--disable-dev-shm-usage")
These are the flags I added. Before I added the flags RAM usage kept increasing after it crosses 4GB (8GB my machine) my machine stuck. after I added these flags memory usage didn't cross 500MB. And as DebanjanB answers, if you running for loop or while loop tries to put some seconds sleep after each execution it will give some time to kill the unused thread.
To start with Selenium have very little control over the amount of RAM used by Firefox. As you mentioned the Browser Client i.e. Mozilla goes back and forth between user profiles and notifications page on Instagram and does tasks based on what it finds is too broad as a single usecase. So, the first and foremost task would be to break up the infinite loop pertaining to your usecase into smaller Tests.
time.sleep()
Inducing time.sleep() virtually puts a blanket over the underlying issue. However while using Selenium and WebDriver to execute tests through your Automation Framework, using time.sleep() without any specific condition defeats the purpose of automation and should be avoided at any cost. As per the documentation:
time.sleep(secs) suspends the execution of the current thread for the given number of seconds. The argument may be a floating point number to indicate a more precise sleep time. The actual suspension time may be less than that requested because any caught signal will terminate the sleep() following execution of that signal’s catching routine. Also, the suspension time may be longer than requested by an arbitrary amount because of the scheduling of other activity in the system.
You can find a detailed discussion in How to sleep webdriver in python for milliseconds
Analysis
There were previous instances when Firefox consumed about 80% of the RAM.
However as per this discussion some of the users feels that the more memory is used the better because it means you don't have RAM wasted. Firefox uses RAM to make its processes faster since application data is transferred much faster in RAM.
Solution
You can implement either/all of the generic/specific steps as follows:
Upgrade Selenium to current levels Version 3.141.59.
Upgrade GeckoDriver to GeckoDriver v0.24.0 level.
Upgrade Firefox version to Firefox v65.0.2 levels.
Clean your Project Workspace through your IDE and Rebuild your project with required dependencies only.
If your base Web Client version is too old, then uninstall it and install a recent GA and released version of Web Client.
Some extensions allow you to block such unnecessary content, as an example:
uBlock Origin allows you to hide ads on websites.
NoScript allows you to selectively enable and disable all scripts running on websites.
To open the Firefox client with an extension you can download the extension i.e. the XPI file from https://addons.mozilla.org and use the add_extension(extension='webdriver.xpi') method to add the extension in a FirefoxProfile as follows:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.add_extension(extension='extension_name.xpi')
driver = webdriver.Firefox(firefox_profile=profile, executable_path=r'C:\path\to\geckodriver.exe')
If your Tests doesn't requires the CSS you can disable the CSS following the this discussion.
Use Explicit Waits or Implicit Waits.
Use driver.quit() to close all
the browser windows and terminate the WebDriver session because if
you do not use quit() at the end of the program, the WebDriver
session will not be closed properly and the files will not be cleared
off memory. And this may result in memory leak errors.
Creating new firefox profile and use it every time while running test cases in Firefox shall eventually increase the performance of execution as without doing so always new profile would be created and caching information would be done there and if driver.quit does not get called somehow before failure then in this case, every time we end up having new profiles created with some cached information which would be consuming memory.
// ------------ Creating a new firefox profile -------------------
1. If Firefox is open, close Firefox.
2. Press Windows +R on the keyboard. A Run dialog will open.
3. In the Run dialog box, type in firefox.exe -P
Note: You can use -P or -ProfileManager(either one should work).
4. Click OK.
5. Create a new profile and sets its location to the RAM Drive.
// ----------- Associating Firefox profile -------------------
ProfilesIni profile = new ProfilesIni();
FirefoxProfile myprofile = profile.getProfile("automation_profile");
WebDriver driver = new FirefoxDriver(myprofile);
Please share execution performance with community if you plan to implement this way.
There is no fix for that as of now.
I suggest you use driver.close() approach.
I was also struggling with the RAM issue and what i did was i counted the number of loops and when the loop count reached to a certain number( for me it was 200) i called driver.close() and then start the driver back again and also reset the count.
This way i did not need to close the driver every time the loop is executed and has less effect on the performance too.
Try this. Maybe it will help in your case too.

Is it possible to reduce memory RAM consumption when using Selenium GeckoDriver and Firefox

I use Selenium and Firefox webdriver with python to scrape data from a website.
But in the code, I need to access this website more than 10k times and it consumes a lot of RAM to do that.
Usually, when the script access this site 2500 times, it already consumes 4gb or more of RAM and it stops to work.
Is it possible to reduce memory RAM consumption without close browser session?
I ask that because when I start the script, I need to log manually on the site(two-factor autentication, the code is not shown below) and if I close the browser session, I will need to log in the site again.
for itemLista in lista:
driver.get("https://mytest.site.com/query/option?opt="+str(itemLista))
isActivated = driver.find_element_by_xpath('//div/table//tr[2]//td[1]')
activationDate = driver.find_element_by_xpath('//div/table//tr[2]//td[2]')
print(str(isActivated.text))
print(str(activationDate.text))
indice+=1
print("numero: "+str(indice))
file2.write(itemLista+" "+str(isActivated.text)+" "+str(activationDate.text)+"\n")
#close file
file2.close()
I discover how to avoid the memory leak.
I just use
time.sleep(2)
after
file2.write(itemLista+" "+str(isActivated.text)+" "+str(activationDate.text)+"\n")
Now firefox is working without consumes lots of RAM
It is just perfect.
I don't know exactly why it stopped consumes so much memory, but I think it was growing memory consume because it didn't have time to finish each driver.get request.
It is not clear from your question about the list items within lista to check the actual url/website.
However, it may not be possible to reduce RAM consumption while accessing the website more than 10k times in a row with the approach you have adapted.
Solution
As you mentioned when the script access this site 2500 times or so, it already consumes 4gb or more of RAM and it stops to work you may induce a counter to access the site 2000 times in a loop and reinitialize the WebDriver and Web Browser afresh after invoking driver.quit() within tearDown(){} method to close & destroy the existing WebDriver and Web Client instances gracefully as follows:
driver.quit() // Python
You can find a detailed discussion in PhantomJS web driver stays in memory
Incase the GeckoDriver and Firefox processes are still not destroyed and removed you may require to kill the processes from tasklist.
Python Solution(Cross Platform):
import os
import psutil
PROCNAME = "geckodriver" # or chromedriver or iedriverserver
for proc in psutil.process_iter():
# check whether the process name matches
if proc.name() == PROCNAME:
proc.kill()
You can find a detailed discussion in Selenium : How to stop geckodriver process impacting PC memory, without calling driver.quit()?
As mentioned in my comment, only open and write to your file on each iteration instead of keeping it open in memory:
# remove the line file2 = open(...) from your code
for itemLista in lista:
driver.get("https://mytest.site.com/query/option?opt="+str(itemLista))
isActivated = driver.find_element_by_xpath('//div/table//tr[2]//td[1]')
activationDate = driver.find_element_by_xpath('//div/table//tr[2]//td[2]')
print(str(isActivated.text))
print(str(activationDate.text))
indice+=1
print("numero: "+str(indice))
with open("your file path here", "w") as file2:
file2.write(itemLista+" "+str(isActivated.text)+" "+str(activationDate.text)+"\n")
While selenium is quite a memory hungry beast, it doesn't necessarily murder your RAM with each growing iteration. However your growing opened buffer of file2 does take up RAM the more you write to it. Only when it's closed it will release the virtual memory and write the physical.

How can I tell ChromeDriver to wait longer for Chrome to launch before giving up?

Background
I'm using Selenium and Python to automate display and navigation of a website in Chromium on Ubuntu MATE 16.04 on a Raspberry Pi 3. (Think unattended digital signage.) This combination was working great until today when the newest version of Chromium (with matching ChromeDriver) installed via automatic updates.
Because Chromium needed to perform some upgrade housekeeping tasks the next time it started up, it took a little longer than usual. Keep in mind that this is on a Raspberry Pi, so I/O is severely bottlenecked by the SD card. Unfortunately, it took long enough that my Python script failed because the ChromeDriver gave up on Chromium ever starting:
Traceback (most recent call last):
File "call-tracker-start", line 15, in <module>
browser = webdriver.Chrome(executable_path=chromedriver_path, options=chrome_options)
File "/home/pi/.local/lib/python3.5/site-packages/selenium/webdriver/chrome/webdriver.py", line 75, in __init__
desired_capabilities=desired_capabilities)
File "/home/pi/.local/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 154, in __init__
self.start_session(desired_capabilities, browser_profile)
File "/home/pi/.local/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 243, in start_session
response = self.execute(Command.NEW_SESSION, parameters)
File "/home/pi/.local/lib/python3.5/site-packages/selenium/webdriver/remote/webdriver.py", line 312, in execute
self.error_handler.check_response(response)
File "/home/pi/.local/lib/python3.5/site-packages/selenium/webdriver/remote/errorhandler.py", line 242, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: chrome not reachable
(Driver info: chromedriver=2.35 (0),platform=Linux 4.4.38-v7+ armv7l)
Of course, when the script dies after throwing this exception, the Chromium instance is killed before it can finish its housekeeping, which means that next time it has to start over, so it takes just as long as the last time and fails just as hard.
If I then manually intervene and run Chromium as a normal user, I just... wait... a minute... or two, for Chromium to finish its upgrade housekeeping, then it opens its browser window, and then I cleanly quit the application. Now that the housekeeping is done, Chromium starts up the next time at a more normal speed, so all of the sudden my Python script runs without any error because the ChromeDriver sees Chromium finish launching within its accepted timeout window.
Everything will likely be fine until the next automatic update comes down, and then this same problem will happen all over again. I don't want to have to manually intervene after every update, nor do I want to disable automatic updates.
The root of the question
How can I tell ChromeDriver not to give up so quickly on launching Chromium?
I looked for some sort of timeout value that I could set, but I couldn't find any in the ChromeDriver or Selenium for Python documentation.
Interestingly, there is a timeout argument that can be passed to the Firefox WebDriver, as shown in the Selenium for Python API documentation:
timeout – Time to wait for Firefox to launch when using the extension connection.
This parameter is also listed for the Internet Explorer WebDriver, but it's notably absent in the Chrome WebDriver API documentation.
I also wouldn't mind passing something directly to ChromeDriver via service_args, but I couldn't find any relevant options in the ChromeDriver docs.
Update: found root cause of post-upgrade slowness
After struggling with finding a way to reproduce this problem in order to test solutions, I was able to pinpoint the reason Chromium takes forever to launch after an upgrade.
It seems that, as part of its post-upgrade housekeeping, Chromium rebuilds the user's font cache. This is a CPU & I/O intensive process that is especially hard on a Raspberry Pi and its SD card, hence the extreme 2.5 minute launch time whenever the font cache has to be rebuilt.
The problem can be reproduced by purposely deleting the font cache, which forces a rebuild:
pi#rpi-dev1:~$ killall chromium-browser
pi#rpi-dev1:~$ time chromium-browser --headless --disable-gpu --dump-dom 'about:blank'
[0405/132706.970822:ERROR:gpu_process_transport_factory.cc(1019)] Lost UI shared context.
<html><head></head><body></body></html>
real 0m0.708s
user 0m0.340s
sys 0m0.200s
pi#rpi-dev1:~$ rm -Rf ~/.cache/fontconfig
pi#rpi-dev1:~$ time chromium-browser --headless --disable-gpu --dump-dom 'about:blank'
[0405/132720.917590:ERROR:gpu_process_transport_factory.cc(1019)] Lost UI shared context.
<html><head></head><body></body></html>
real 2m9.449s
user 2m8.670s
sys 0m0.590s
You are right, there is no option to explicitly set the timeout of the initial driver creation. I would recommend visiting their git page HERE and creating a new issue. It also has the links for the direct ChromeDriver site in case you want to create a bug there. Currently, there is no option to set timeout that I could find.
You could try something like this in the meantime though:
import webbrowser
from selenium import webdriver
from selenium.common.exceptions import WebDriverException
try:
driver = webdriver.Chrome()
except WebDriverException:
webbrowser.open_new('http://www.Google.com')
# Let this try and get Chrome open, then go back and use webdriver
Here is the documentation on webbrowser:
https://docs.python.org/3/library/webbrowser.html
As per your question without your code trial it would be tough to analyze the reason behind the error which you are seeing as :
selenium.common.exceptions.WebDriverException: Message: chrome not reachable
Perhaps a more details about the version info of the binaries you are using would have helped us in someway.
Factually, asking ChromeDriver to wait longer for Chrome to launch before giving up won't help us as the default configuration of ChromeDriver takes care of the optimum needs.
However WebDriverException: Message: chrome not reachable is pretty common issue when the binary versions are incompatible. You can find a detailed discussion about this issue at org.openqa.selenium.WebDriverException: chrome not reachable - when attempting to start a new session
The bad news
It turns out that not only is there no timeout option for Selenium to pass to ChromeDriver, but short of recompiling your own custom ChromeDriver, there is currently no way to change this value programmatically whatsoever. Sadly, looking at the source code shows that Google has hard-coded a timeout value of 60 seconds!
from chromium /src/chrome/test/chromedriver/chrome_launcher.cc#208:
std::unique_ptr<DevToolsHttpClient> client(new DevToolsHttpClient(
address, context_getter, socket_factory, std::move(device_metrics),
std::move(window_types), capabilities->page_load_strategy));
base::TimeTicks deadline =
base::TimeTicks::Now() + base::TimeDelta::FromSeconds(60);
Status status = client->Init(deadline - base::TimeTicks::Now());
Until this code is changed to allow custom deadlines, the only option is a workaround.
The workaround
I ended up taking an approach that "primed" Chromium before having Selenium call ChromeDriver. This gets that one-time, post-upgrade slow start out of the way before ChromeDriver ever begins its countdown. The answer #PixelEinstein gave helped lead me down the right path, but this solution differs in two ways:
The call to open standalone Chromium here is blocking, while webbrowser.open_new() is not.
Standalone Chromium is always launched before ChromeDriver whether it is needed or not. I did this because waiting one minute for ChromeDriver to timeout, then waiting another 2.5 minutes for Chromium to start, then trying ChromeDriver again created a total delay of just over 3.5 minutes. Launching Chromium as the first action brings the total wait time down to about 2.5 minutes, as you skip the initial ChromeDriver timeout. On occasions when the long startup time doesn't occur, then this "double loading" of Chromium is negligible, as the whole process finishes in a matter of seconds.
Here's the code snippet:
#!/usr/bin/env python3
import subprocess
from selenium import webdriver
some_site = 'http://www.google.com'
chromedriver_path = '/usr/lib/chromium-browser/chromedriver'
# Block until Chromium finishes launching and self-terminates
subprocess.run(['chromium-browser', '--headless', '--disable-gpu', '--dump-dom', 'about:blank'])
browser = webdriver.Chrome(executable_path=chromedriver_path)
browser.get(some_site)
# Continue on with your Selenium business...
Before instantiating a webdriver.Chrome() object, this waits for Chromium to finish its post-upgrade housekeeping no matter how long it takes. Chromium is launched in headless mode where --dump-dom is a one-shot operation that writes the requested web page (in this case about:blank) to stdout, which is ignored. Chromium self-terminates after completing the operation, which then returns from the subprocess.run() call, unblocking program flow. After that, it's safe to let ChromeDriver start its countdown, as Chromium will launch in a matter of seconds.

Runaway memory usage with Selenium using PhantomJS

I have written a script in Python that iterates over a long list of webpages and gathers data, using Selenium and PhantomJS as the webdriver (since I'm running it on a remote terminal machine running Linux, and needed to use a headless browser). For short jobs, e.g. where it has to iterate over a few pages, there are no issues. However, for longer jobs, where it has to iterate through a longer list of pages, I see the memory usage increase dramatically over time, each time a new page is loaded. Eventually after about 20 odd pages the script is killed due to memory overflow.
Here is how I initialize my browser -
from selenium import webdriver
url = 'http://someurl.com/'
browser = webdriver.PhantomJS()
browser.get(url)
The page has next buttons and I iterate through the pages by finding the xpath for the 'Next >' button -
next_xpath = "//*[contains(text(), 'Next >')]"
next_link = browser.find_element_by_xpath(next_xpath)
next_link.click()
I have tried clearing cookies and cache for the PhantomJS browser in the following ways -
browser.get('javascript:localStorage.clear();')
browser.get('javascript:sessionStorage.clear();')
browser.delete_all_cookies()
However none of these has had any impact on memory usage. When I use the Firefox driver, on my local machine it works without any issues, though it should be noted that my local machine has much more memory than the remote server.
My apologies if any crucial information is missing. Please feel free to let me know how I can make my question more comprehensive.
If Headless browser is working for you, I would like to suggest a solution which helped me solve similar memory issue.
Use AWS Lamda as your server, use parallel execution library xdist and bingo you would never face a memory issue as Lamada is managed service, Make sure to upload your captured data to S3 and clean up temp directory on lamda. (I have already implemented this in one of my project and works like a charm)

Categories