I have written a script in Python that iterates over a long list of webpages and gathers data, using Selenium and PhantomJS as the webdriver (since I'm running it on a remote terminal machine running Linux, and needed to use a headless browser). For short jobs, e.g. where it has to iterate over a few pages, there are no issues. However, for longer jobs, where it has to iterate through a longer list of pages, I see the memory usage increase dramatically over time, each time a new page is loaded. Eventually after about 20 odd pages the script is killed due to memory overflow.
Here is how I initialize my browser -
from selenium import webdriver
url = 'http://someurl.com/'
browser = webdriver.PhantomJS()
browser.get(url)
The page has next buttons and I iterate through the pages by finding the xpath for the 'Next >' button -
next_xpath = "//*[contains(text(), 'Next >')]"
next_link = browser.find_element_by_xpath(next_xpath)
next_link.click()
I have tried clearing cookies and cache for the PhantomJS browser in the following ways -
browser.get('javascript:localStorage.clear();')
browser.get('javascript:sessionStorage.clear();')
browser.delete_all_cookies()
However none of these has had any impact on memory usage. When I use the Firefox driver, on my local machine it works without any issues, though it should be noted that my local machine has much more memory than the remote server.
My apologies if any crucial information is missing. Please feel free to let me know how I can make my question more comprehensive.
If Headless browser is working for you, I would like to suggest a solution which helped me solve similar memory issue.
Use AWS Lamda as your server, use parallel execution library xdist and bingo you would never face a memory issue as Lamada is managed service, Make sure to upload your captured data to S3 and clean up temp directory on lamda. (I have already implemented this in one of my project and works like a charm)
Related
I'm trying to develop scripts that are memory efficient. I believe using a single instance(window) of firefox would speed up pace execution of scripts and decrease space complexity and therefore increase efficiency of selenium scripts in python (by enabling use of more scripts simultaneously). It would also make browser sessions more organized and easy to use.
I have tried using --connect-existing argument to the options variable
options.add_argument("--connect-existing")
driver = webdriver.Firefox(options=options)
but still saw a new window open.
Tried these solutions: https://stackoverflow.com/a/37964479
https://stackoverflow.com/a/73964036/20356446
But they too fail run
i'm trying to automate login and scraping data using selenium in python im using chromedriver as a driver. but i want to do that with 5 account, im still figuring out what is the best method to do.
for now, i create the code in python file then i create batch file to run the python. so i can open multiple batch file with multiple account.
but the problem is cpu usage is too high so im just able to do with 3 account.
Troubleshooting i do so far is changing options to make chrome headless using this code
self.options = webdriver.ChromeOptions()
self.options.headless = True
self.options.add_argument(f'user-agent={user_agent}')
self.options.add_argument("--window-size=1920,1080")
self.options.add_argument('--ignore-certificate-errors')
self.options.add_argument('--allow-running-insecure-content')
self.options.add_argument("--disable-extensions")
self.options.add_argument("--proxy-server='direct://'")
self.options.add_argument("--proxy-bypass-list=*")
self.options.add_argument("--start-maximized")
self.options.add_argument('--disable-gpu')
self.options.add_argument('--disable-dev-shm-usage')
self.options.add_argument("--FontRenderHinting[none]")
self.options.add_argument('--no-sandbox')
self.options.add_argument('log-level=3')
self.driver = webdriver.Chrome(executable_path="chromedriver.exe", options=self.options)
but still cant achieve my target(5 account).
is there anything i can do to achieve my target?
Thanks.
I have noticed (by watching htop processes while tests are running), the crash reporting of chrome takes up a huge amount of CPU and memory
Since you are using Chrome Headless, I've found adding this reduces the CPU usage by about 20% for me:
--disable-crash-reporter
This will only disable when you are running in headless
This might speed things up for you!!!
My settings are currently as follows and I reduce the CPU (but only a marginal time saving) by about 20%:
self.options.add_argument("--no-sandbox");
self.options.add_argument("--disable-dev-shm-usage");
self.options.add_argument("--disable-renderer-backgrounding");
self.options.add_argument("--disable-background-timer-throttling");
self.options.add_argument("--disable-backgrounding-occluded-windows");
self.options.add_argument("--disable-client-side-phishing-detection");
self.options.add_argument("--disable-crash-reporter");
self.options.add_argument("--disable-oopr-debug-crash-dump");
self.options.add_argument("--no-crash-upload");
self.options.add_argument("--disable-gpu");
self.options.add_argument("--disable-extensions");
self.options.add_argument("--disable-low-res-tiling");
self.options.add_argument("--log-level=3");
self.options.add_argument("--silent");
I found this to be a pretty good list (full list I think) of command line switches with explanations:
https://peter.sh/experiments/chromium-command-line-switches/
Some additional things you can turn off are also mentioned here:
https://github.com/GoogleChrome/chrome-launcher/blob/main/docs/chrome-flags-for-tools.md
I hope this helps someone
self.options.add_argument('--headless')
this will make the driver run way faster then before
I am using selenium with Firefox to automate some tasks on Instagram. It basically goes back and forth between user profiles and notifications page and does tasks based on what it finds.
It has one infinite loop that makes sure that the task keeps on going. I have sleep() function every few steps but the memory usage keeps increasing. I have something like this in Python:
while(True):
expected_conditions()
...doTask()
driver.back()
expected_conditions()
...doAnotherTask()
driver.forward()
expected_conditions()
I never close the driver because that will slow down the program by a lot as it has a lot of queries to process. Is there any way to keep the memory usage from increasing overtime without closing or quitting the driver?
EDIT: Added explicit conditions but that did not help either. I am using headless mode of Firefox.
Well, This the serious problem I've been going through for some days. But I have found the solution. You can add some flags to optimize your memory usage.
options = Options()
options.add_argument("start-maximized")
options.add_argument("disable-infobars")
options.add_argument("--disable-extensions")
options.add_argument('--no-sandbox')
options.add_argument('--disable-application-cache')
options.add_argument('--disable-gpu')
options.add_argument("--disable-dev-shm-usage")
These are the flags I added. Before I added the flags RAM usage kept increasing after it crosses 4GB (8GB my machine) my machine stuck. after I added these flags memory usage didn't cross 500MB. And as DebanjanB answers, if you running for loop or while loop tries to put some seconds sleep after each execution it will give some time to kill the unused thread.
To start with Selenium have very little control over the amount of RAM used by Firefox. As you mentioned the Browser Client i.e. Mozilla goes back and forth between user profiles and notifications page on Instagram and does tasks based on what it finds is too broad as a single usecase. So, the first and foremost task would be to break up the infinite loop pertaining to your usecase into smaller Tests.
time.sleep()
Inducing time.sleep() virtually puts a blanket over the underlying issue. However while using Selenium and WebDriver to execute tests through your Automation Framework, using time.sleep() without any specific condition defeats the purpose of automation and should be avoided at any cost. As per the documentation:
time.sleep(secs) suspends the execution of the current thread for the given number of seconds. The argument may be a floating point number to indicate a more precise sleep time. The actual suspension time may be less than that requested because any caught signal will terminate the sleep() following execution of that signal’s catching routine. Also, the suspension time may be longer than requested by an arbitrary amount because of the scheduling of other activity in the system.
You can find a detailed discussion in How to sleep webdriver in python for milliseconds
Analysis
There were previous instances when Firefox consumed about 80% of the RAM.
However as per this discussion some of the users feels that the more memory is used the better because it means you don't have RAM wasted. Firefox uses RAM to make its processes faster since application data is transferred much faster in RAM.
Solution
You can implement either/all of the generic/specific steps as follows:
Upgrade Selenium to current levels Version 3.141.59.
Upgrade GeckoDriver to GeckoDriver v0.24.0 level.
Upgrade Firefox version to Firefox v65.0.2 levels.
Clean your Project Workspace through your IDE and Rebuild your project with required dependencies only.
If your base Web Client version is too old, then uninstall it and install a recent GA and released version of Web Client.
Some extensions allow you to block such unnecessary content, as an example:
uBlock Origin allows you to hide ads on websites.
NoScript allows you to selectively enable and disable all scripts running on websites.
To open the Firefox client with an extension you can download the extension i.e. the XPI file from https://addons.mozilla.org and use the add_extension(extension='webdriver.xpi') method to add the extension in a FirefoxProfile as follows:
from selenium import webdriver
profile = webdriver.FirefoxProfile()
profile.add_extension(extension='extension_name.xpi')
driver = webdriver.Firefox(firefox_profile=profile, executable_path=r'C:\path\to\geckodriver.exe')
If your Tests doesn't requires the CSS you can disable the CSS following the this discussion.
Use Explicit Waits or Implicit Waits.
Use driver.quit() to close all
the browser windows and terminate the WebDriver session because if
you do not use quit() at the end of the program, the WebDriver
session will not be closed properly and the files will not be cleared
off memory. And this may result in memory leak errors.
Creating new firefox profile and use it every time while running test cases in Firefox shall eventually increase the performance of execution as without doing so always new profile would be created and caching information would be done there and if driver.quit does not get called somehow before failure then in this case, every time we end up having new profiles created with some cached information which would be consuming memory.
// ------------ Creating a new firefox profile -------------------
1. If Firefox is open, close Firefox.
2. Press Windows +R on the keyboard. A Run dialog will open.
3. In the Run dialog box, type in firefox.exe -P
Note: You can use -P or -ProfileManager(either one should work).
4. Click OK.
5. Create a new profile and sets its location to the RAM Drive.
// ----------- Associating Firefox profile -------------------
ProfilesIni profile = new ProfilesIni();
FirefoxProfile myprofile = profile.getProfile("automation_profile");
WebDriver driver = new FirefoxDriver(myprofile);
Please share execution performance with community if you plan to implement this way.
There is no fix for that as of now.
I suggest you use driver.close() approach.
I was also struggling with the RAM issue and what i did was i counted the number of loops and when the loop count reached to a certain number( for me it was 200) i called driver.close() and then start the driver back again and also reset the count.
This way i did not need to close the driver every time the loop is executed and has less effect on the performance too.
Try this. Maybe it will help in your case too.
I am using selenium to run chrome headless with the following command:
system "LC_ALL=C google-chrome --headless --enable-logging --hide-scrollbars --remote-debugging-port=#{debug_port} --remote-debugging-address=0.0.0.0 --disable-gpu --no-sandbox --ignore-certificate-errors &"
However it appears that chrome headless is consuming too much memory and cpu,anyone know how we can limit CPU/Memory usage of chrome headless? Or if there is some workaround.
Thanks in advance.
There had been a lot of discussion going around about the unpredictable CPU and Memory Consumption by Chrome Headless sessions.
As per the discussion Building headless for minimum cpu+mem usage the CPU + Memory usage can be optimized by:
Using either a custom proxy or C++ ProtocolHandlers you could return stub 1x1 pixel images or even block them entirely.
Chromium Team is working on adding a programmatic control over when frames are produced. Currently headless chrome is still trying to render at 60 fps which is rather wasteful. Many pages do need a few frames (maybe 10-20 fps) to render properly (due to usage of requestAnimationFrame and animation triggers) but we expect there are a lot of CPU savings to be had here.
MemoryInfra should help you determine which component is the biggest consumer of memory in your setup.
An usage can be:
$ headless_shell --remote-debugging-port=9222 --trace-startup=*,disabled-by-default-memory-infra http://www.chromium.org
Chromium is always going to use as much resources as are available to it. If you want to effectively limit it's utilization, you should look into using cgroups
Having said the above mentioned points here are some of the common best practices to adapt when running headless browsers in a production environment:
Fig: Volatile resource usage of Headless Chrome
Don't run a headless browser:
By all accounts, if at all possible, just don't run a headless browser. Headless browsers are un-predictable and hungry. Almost everything you can do with a browser (save for interpolating and running JavaScript) can be done with simple Linux tools. There are libraries those offer elegant Node API's for fetching data via HTTP requests and scraping if that's your end-goal.
Don't run a headless browser when you don't need to:
There are users those attempt to keep the browser open, even when not in use, so that it's always available for connections. While this might be a good strategy to help expedite session launch it'll only end in misery after a few hours. This is largely because browsers like to cache stuff and slowly eat more memory. Any time you're not actively using the browser, close it!
Parallelize with browsers, not pages:
We should only run one when absolutely necessary, the next best-practice is to run only one session through each browser. While you actually might save some overhead by parallelizing work through pages, if one page crashes it can bring down the entire browser with it. That, plus each page isn't guaranteed to be totally clean (cookies and storage might bleed-through).
page.waitForNavigation:
One of the most common issues observed are the actions that trigger a pageload, and the sudden loss of your scripts execution. This is because actions that trigger a pageload can often cause subsequent work to get swallowed. In order to get around this issue, you will generally have to invoke the page-loading-action and immediately wait for the next pageload.
Use docker to contain it all:
Chrome takes a lot of dependencies to get running properly. Even after all of that's complete then there's things like fonts and phantom-processes you have to worry about so it's ideal to use some sort of container to contain it. Docker is almost custom-built for this task as you can limit the amount resources available and sandbox it. Create your own Dockerfile yourself.
And to avoid running into zombie processes (which commonly happen with Chrome), you'll want to use something like dumb-init to properly start-up.
Two different runtimes:
There can be two JavaScript runtimes going on (Node and the browser). This is great for the purposes of shareability, but it comes at the cost of confusion since some page methods will require you to explicitly pass in references (versus doing so with closures or hoisting).
As an example, while using page.evaluate deep down in the bowels of the protocol, this literally stringifies the function and passes it into Chrome, so things like closures and hoisting won't work at all. If you need to pass some references or values into an evaluate call, simply append them as arguments which get properly handled.
Reference: Observations running 2 million headless sessions
Consider to use Docker. It has well documented features for thresholding usage of system resources like memory and cpu. The good news is that it's pretty easy to build a Docker image with headless Chromes (on top of X11) inside it.
There are lots of out of box solutions on that, check it out: https://hub.docker.com/r/justinribeiro/chrome-headless/
I am using selenium to run chrome headless with the following command:
system "LC_ALL=C google-chrome --headless --enable-logging --hide-scrollbars --remote-debugging-port=#{debug_port} --remote-debugging-address=0.0.0.0 --disable-gpu --no-sandbox --ignore-certificate-errors &"
However it appears that chrome headless is consuming too much memory and cpu,anyone know how we can limit CPU/Memory usage of chrome headless? Or if there is some workaround.
Thanks in advance.
There had been a lot of discussion going around about the unpredictable CPU and Memory Consumption by Chrome Headless sessions.
As per the discussion Building headless for minimum cpu+mem usage the CPU + Memory usage can be optimized by:
Using either a custom proxy or C++ ProtocolHandlers you could return stub 1x1 pixel images or even block them entirely.
Chromium Team is working on adding a programmatic control over when frames are produced. Currently headless chrome is still trying to render at 60 fps which is rather wasteful. Many pages do need a few frames (maybe 10-20 fps) to render properly (due to usage of requestAnimationFrame and animation triggers) but we expect there are a lot of CPU savings to be had here.
MemoryInfra should help you determine which component is the biggest consumer of memory in your setup.
An usage can be:
$ headless_shell --remote-debugging-port=9222 --trace-startup=*,disabled-by-default-memory-infra http://www.chromium.org
Chromium is always going to use as much resources as are available to it. If you want to effectively limit it's utilization, you should look into using cgroups
Having said the above mentioned points here are some of the common best practices to adapt when running headless browsers in a production environment:
Fig: Volatile resource usage of Headless Chrome
Don't run a headless browser:
By all accounts, if at all possible, just don't run a headless browser. Headless browsers are un-predictable and hungry. Almost everything you can do with a browser (save for interpolating and running JavaScript) can be done with simple Linux tools. There are libraries those offer elegant Node API's for fetching data via HTTP requests and scraping if that's your end-goal.
Don't run a headless browser when you don't need to:
There are users those attempt to keep the browser open, even when not in use, so that it's always available for connections. While this might be a good strategy to help expedite session launch it'll only end in misery after a few hours. This is largely because browsers like to cache stuff and slowly eat more memory. Any time you're not actively using the browser, close it!
Parallelize with browsers, not pages:
We should only run one when absolutely necessary, the next best-practice is to run only one session through each browser. While you actually might save some overhead by parallelizing work through pages, if one page crashes it can bring down the entire browser with it. That, plus each page isn't guaranteed to be totally clean (cookies and storage might bleed-through).
page.waitForNavigation:
One of the most common issues observed are the actions that trigger a pageload, and the sudden loss of your scripts execution. This is because actions that trigger a pageload can often cause subsequent work to get swallowed. In order to get around this issue, you will generally have to invoke the page-loading-action and immediately wait for the next pageload.
Use docker to contain it all:
Chrome takes a lot of dependencies to get running properly. Even after all of that's complete then there's things like fonts and phantom-processes you have to worry about so it's ideal to use some sort of container to contain it. Docker is almost custom-built for this task as you can limit the amount resources available and sandbox it. Create your own Dockerfile yourself.
And to avoid running into zombie processes (which commonly happen with Chrome), you'll want to use something like dumb-init to properly start-up.
Two different runtimes:
There can be two JavaScript runtimes going on (Node and the browser). This is great for the purposes of shareability, but it comes at the cost of confusion since some page methods will require you to explicitly pass in references (versus doing so with closures or hoisting).
As an example, while using page.evaluate deep down in the bowels of the protocol, this literally stringifies the function and passes it into Chrome, so things like closures and hoisting won't work at all. If you need to pass some references or values into an evaluate call, simply append them as arguments which get properly handled.
Reference: Observations running 2 million headless sessions
Consider to use Docker. It has well documented features for thresholding usage of system resources like memory and cpu. The good news is that it's pretty easy to build a Docker image with headless Chromes (on top of X11) inside it.
There are lots of out of box solutions on that, check it out: https://hub.docker.com/r/justinribeiro/chrome-headless/