python - Speeding up Chrome Webdriver in Selenium - python

I am making a simple bot with selenium that will like, comment and message people on certain intervals.
I am using chrome web driver:
browser = webdriver.Chrome()
Also, I am on a x64 linux system. Distro is ubuntu 15.04 and am running with python3 from terminal.
and this works good and all, but it's pretty slow. I know as my code progresses, testing the app will become a pain. I've looked into this already and know it may have something to do with the proxy settings.
I am clueless when it comes to this type of stuff.
I fiddled with my system settings and changed my proxy settings to not require a connection, but nothing changed.
I notice when the driver loads, I see 'Establishing secure connection' for a few seconds in the browser window. I feel this is a culprit.
Also, 'establishing host' shows up multiple times. I'd say it takes about 5-8 seconds just to get a page.
login_url = 'http://www.skout.com/login'
browser.get(login_url)
In what ways can I speed up chrome driver, and is it proxy settings? It could definitely be something else.
Thanks for your time.

Chrome webdriver can be clunky and a bit slow to initialize as it is spawning a fresh instance every time you call the Webdriver object.
If speed is of the utmost importance I might recommend investing some time into looking at a headless alternative such as PhantomJS. This can save a significant amount of time if you are running multiple tests or instances of your application.

Related

Python Selenium Scraping on Local but not on VPS

I'm making a python web scraping using Selenium. I've been testing it on my local machine and it finishes the "scraping" stage of my program, then I have it process data to a cloud database.
This is an image of what the logs look like when I scrape on my local computer.
However, on a VPS (Ubuntu), the scraping reaches up to 100% but never really finishes, then enters limbo (program eternally running). I have timeouts on my ChromeDriver with logs whenever it times out on a website, so that shouldn't be the issue.
These are my options:
option = webdriver.ChromeOptions()
option.add_argument("--no-sandbox")
# option.add_argument("--disable-extensions")
# option.add_argument("--disable-setuid-sandbox")
# option.add_argument('--disable-application-cache')
# option.add_argument("enable-automation")
# option.add_argument("--disable-browser-side-navigation")
# option.add_argument("start-maximized")
option.add_argument("--headless")
option.add_argument("--disable-dev-sh-usage")
option.add_argument("--disable-gpu")
# option.add_argument("--blink-settings=imagesEnabled=false")
I took off the commented-out options to test whether it would make a difference. I feel like it did speed up the scraping process on the VPS but the scraping still does not finish.
How do I find out the issue, and what could it be?
Thanks.

MaxRetryError while web scraping workaround - Python, Selenium

I am having some really hard times trying to figure out how to webscrape making multiple requests to the same website. I have to web scrape 3000 products from a website. That implies making various requests to that server (for example searching the product, clicking on it, going back to the home page) 3000 times.
I state that I am using Selenium. If I only launch one instance of my Firefox webdriver I don't get a MaxRetryError, but as the search goes on my webdriver gets slower and slower, and when the program reaches about half of the searches it stops responding. I looked it up on some forums and I found out it does so for some browser memory issues. So I tried quitting and reinstantiating the webdriver every n seconds (I tried with 100, 200 and 300 secs), but when I do so I get that MaxRetryError because of the too many requests to that url using the same session.
I then tried making the program sleep for a minute when the exception occurs but that hasn't worked (I am only able to make another search and then an exception is again thrown, and so on).
I am wondering if there is any workaround for these kind of issue.
It might be using another library, a way for changing IP or session dynamically or something like that.
P.S. I would rather keep working with selenium if possible.
This error is normally raised if the server determines a high request rate from your client.
As you mentioned, the server bans your IP from making further requests so you can get around that by using some available technologies. Look into Zalenium and also see here for some other possible ways.
Another possible (but tedious) way is to use a number of browser instances to make the call, for example, an answer from here illustrates that.
urlArr = ['https://link1', 'https://link2', '...']
for url in urlArr:
chrome_options = Options()
chromedriver = webdriver.Chrome(executable_path='C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe', options=chrome_options)
with chromedriver as browser:
browser.get(url)
# your task
chromedriver.close() # will close only the current chrome window.
browser.quit() # should close all of the open windows,

Selenium headless browser webdriver [Errno 104] Connection reset by peer

I am trying to scrape data from the URLs below. But selenium fails when driver.get(url) Some times the error is [Errno 104] Connection reset by peer, sometimes [Errno 111] Connection refused. On rare days it works just fine and on my mac with real browser the same spider works fine every single time. So this isn't related to my spider.
Have tried many solutions like waiting got selectors on page, implicit wait, using selenium-requests yo pass proper request headers, etc. But nothing seems to work.
http://www.snapdeal.com/offers/deal-of-the-day
https://paytm.com/shop/g/paytm-home/exclusive-discount-deals
I am using python, selenium & headless Firefox webdriver to achieve this. The os is centos 6.5.
Note: I have many AJAX heavy pages that gets scraped successfully some are below.
http://www.infibeam.com/deal-of-the-day.html, http://www.amazon.in/gp/goldbox/ref=nav_topnav_deals
Already spent many days trying to debug the issue with no luck. Any help would be appreciated.
After days of jingling around this issue, finally found the cause. Writing it here for the benefit of the community. The headless browser was failing due to lack of RAM on the server, strange error messages from webdriver were real pita.
The server was running straight up for 60 days without reboot, Rebooting it did the trick. After increasing the swap by 3 times, has not faced issue for past few days. Also scheduled a task to cleanup page file caches (http://www.yourownlinux.com/2013/10/how-to-free-up-release-unused-cached-memory-in-linux.html).
Found this question while looking for similar error.
Look's like it's a selenium 3.8.1 and 3.9.0 bug.
https://github.com/SeleniumHQ/selenium/issues/5296
Downgrade to 3.8.0 solves this problem
I have been using Selenium and chromedriver (python3) for scraping purposes for some time now. With the latest Google Chrome update I had to deal with two issues.
1) Error on webdriver launch:
Solution: I had to add "no-sandbox" argument.
chrome_options.add_argument('--no-sandbox')
2) [Errno 104] Connection reset by peer:
Solution. There seems to be a problem with sockets and http requests. Either the webpage content is too big or you don't give the page enough time to load. At least that's what I thought.
I set the maximum page load time to 60 seconds and it seems to be working fine.
driver.set_page_load_timeout(60)
I added a small delay between webdrivers initialisations which also seems to help.
time.sleep(0.5)

IE11 stuck at initial start page for webdriver server only when connection to the URL is slow

For the IE webdriver, it opens the IE browsers but it starts to load the local host and then stops (ie/ It never stated loading ). WHen the browser stops loading it shows the msg 'Initial start page for webdriver server'. The problem is that this does not occur every time I execute the test case making it difficult to identify what could be the cause of the issue. What I have noticed is when this issue occurs, the url will take ~25 secs to load manually on the same machine. When the issue does not occur, the URL will load within 3secs.
All security setting are the same (protected Mode enabled across all zone)
enhance protected mode is disabled
IE version 11
the URL is added as a trusted site.
Any clue why it does not load the URL sometimes?
I would try with disabling IE Native event. And, sorry that I cannot provide you the Python syntax right a way. The following is C# which should be fairly easy to convert.
var ieOptions = new InternetExplorerOptions
{ EnableNativeEvents = false };
ieOptions.EnsureCleanSession = true;
driver = new InternetExplorerDriver(ieOptions);
Use remote driver with desired cap (pageLoadStrategy)
Release notes from seleniumhq.org. Note that we have to use version 2.46 for the jar, iedriverserver.exe and python client driver in order to have things work correctly. It is unclear why 2.45 does not work given the release notes below.
v2.45.0.2
Updates to JavaScript automation atoms.
Added pageLoadStrategy to IE driver. Setting a capability named
pageLoadStrategy when creating a session with the IE driver will now change
the wait behavior when navigating to a new page. The valid values are:
"normal" - Waits for document.readyState to be 'complete'. This is the
default, and is the same behavior as all previous versions of
the IE driver.
"eager" - Will abort the wait when document.readyState is
'interactive' instead of waiting for 'complete'.
"none" - Will abort the wait immediately, without waiting for any of
the page to load.
Setting the capability to an invalid value will result in use of the
"normal" page load strategy.
It hasn't been updated for a while, but recently I had very similar issue - IEDriverServer was eventually opening page under test, but in most cases just stuck on Initial page of WebDriver.
What I found the root cause (in my case) was startup setting of IE. I had Start with tabs from the last session enabled, when changed back to Start with home page driver started to work like a charm, opening page under test in 100% of tries.

Firefox 14 on Ubuntu gets stuck connecting

I have a custom application written in python on Ubuntu. It's a bit hairy to unwind all the pieces to get to a reduced question to ask (will post more if I get there), but I have a few things to ennumerate. After trial-and-error, I have backed this problem off to just firefox 14.
Things were fine on firefox 13, firefox 14 was updated on Ubuntu, and stuff broke. (this is not uncommon, but I can't find this problem referenced anywhere yet)
We go to a page in our webservice and reload, 10 or so times, and then the reload hangs, spinning with "Connecting" in the status bar.
Connections on Firefox are getting consumed by XHRs. Increasing the max connection setting in firefox works around the issue. Basically we open up an XHR that in chrome, I can't even see, but in firefox shows with a spinner in firebug. That XHR seems to stay open across page reloads, and eventually consumes the open connections to the site.
After a couple minutes or so, a connection frees up and the load goes through.
Has anyone seen this? Is there a proper way to release the connection? All other browsers tried are not having this problem.
Thanks!
I have many tests in my rails application that worked ok before I updated to firefox 14.01. After that, Firefox browsers opens and just hangs there. I had to switch to Chrome (downloaded driver from Google). If of any help, this is how I initialize driver in ruby:
#driver = Selenium::WebDriver.for :chrome, :switches => %w[--ignore-certificate-errors --disable-popup-blocking --disable-translate]
Upgrading to Firefox 15 beta has solved the problem. If I find anything in the FF release notes, I'll update the answer.
There is now a Firefox bug to track this issue.

Categories