Selenium headless browser webdriver [Errno 104] Connection reset by peer - python

I am trying to scrape data from the URLs below. But selenium fails when driver.get(url) Some times the error is [Errno 104] Connection reset by peer, sometimes [Errno 111] Connection refused. On rare days it works just fine and on my mac with real browser the same spider works fine every single time. So this isn't related to my spider.
Have tried many solutions like waiting got selectors on page, implicit wait, using selenium-requests yo pass proper request headers, etc. But nothing seems to work.
http://www.snapdeal.com/offers/deal-of-the-day
https://paytm.com/shop/g/paytm-home/exclusive-discount-deals
I am using python, selenium & headless Firefox webdriver to achieve this. The os is centos 6.5.
Note: I have many AJAX heavy pages that gets scraped successfully some are below.
http://www.infibeam.com/deal-of-the-day.html, http://www.amazon.in/gp/goldbox/ref=nav_topnav_deals
Already spent many days trying to debug the issue with no luck. Any help would be appreciated.

After days of jingling around this issue, finally found the cause. Writing it here for the benefit of the community. The headless browser was failing due to lack of RAM on the server, strange error messages from webdriver were real pita.
The server was running straight up for 60 days without reboot, Rebooting it did the trick. After increasing the swap by 3 times, has not faced issue for past few days. Also scheduled a task to cleanup page file caches (http://www.yourownlinux.com/2013/10/how-to-free-up-release-unused-cached-memory-in-linux.html).

Found this question while looking for similar error.
Look's like it's a selenium 3.8.1 and 3.9.0 bug.
https://github.com/SeleniumHQ/selenium/issues/5296
Downgrade to 3.8.0 solves this problem

I have been using Selenium and chromedriver (python3) for scraping purposes for some time now. With the latest Google Chrome update I had to deal with two issues.
1) Error on webdriver launch:
Solution: I had to add "no-sandbox" argument.
chrome_options.add_argument('--no-sandbox')
2) [Errno 104] Connection reset by peer:
Solution. There seems to be a problem with sockets and http requests. Either the webpage content is too big or you don't give the page enough time to load. At least that's what I thought.
I set the maximum page load time to 60 seconds and it seems to be working fine.
driver.set_page_load_timeout(60)
I added a small delay between webdrivers initialisations which also seems to help.
time.sleep(0.5)

Related

Failed to establish a new connection: [Errno 111] Connection refused

I am trying to get data from Reuters and have the code as below. But I think due to continuous requests, I got blocked from scraping more data. Is there a way to resolve this? I am using Google Colab. Although there are a lot of similar questions, they are all unanswered. So would really appreciate if I could get some help with this. Thanks!
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
from selenium import webdriver
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.reuters.com/companies/AAPL.O")
links=[]
news=[]
i=0
try:
while True:
news = driver.find_elements_by_xpath("//div[#class='item']")
driver.execute_script("arguments[0].scrollIntoView(true);", news[i])
if news[i].find_element_by_tag_name("time").get_attribute("innerText") == "a year ago":
break
links.append(news[i].find_element_by_tag_name("a").get_attribute("href"))
i += 1
time.sleep(.5)
except:
pass
driver.quit()
#links
for link in links:
paragraphs = driver.find_elements_by_xpath("//div[contains(#class,'Article__container')]/div/div/div[2]/p")
for para in paragraphs:
news.append(para.get_attribute("innerText"))
import pandas as pd
df = pd.DataFrame({'x':links, 'y':news})
df
Full error stacktrace:
Here's a generic answer.
Following are the list of things to keep in mind when scraping a website to prevent detection-
1) Adding User-Agent headers- Many websites do not allow access to their website if valid headers are not passed, and user-agent header is a very important one.
Example:- chrome_options.add_argument("user-agent=Mozilla/5.0")
2) Setting window-size when going headless- Websites are often able to detect when headless browsers are being run on their server, a common workaround is to add window-size argument to your scripts.
Example:- chrome_options.add_argument("--window-size=1920,1080")
3) Mimicking human behavior- Avoid clicking or navigating through the website at very fast rates. Use timely waits to make your behavior more human-like.
4) Using random waits - This is a continuation of the previous point, people often try to keep constant delays between actions, even that can lead to detection. Randomize them as well.
5) User-Agent rotation- Try changing your user agent time-to-time when scraping a website. (Read More)
6) IP-rotation (Using proxies)- Some websites ban individual IP's or even complete geographical areas from accessing their sites, if they are detected as a scraper. Rotating your IP might trick the server into believing that the requests are coming from different devices. IP-rotation combined with User-Agent rotation can be very effective.
Note:- Please don't use any freely available proxies, they have very low success rate, and hardly work. Use premium proxy services.
7) Using external libraries- There are a lot cases where all the above methods might not work, when the website has very good bot detection mechanism. At that time, you might as well try the undetected_chromedriver library. It has come in handy a few times.

Python Selenium script only works in the first execution (ERR_CONNECTION_CLOSED)

I'm trying to scrape a website that contains judicial information of my country (Colombia). I have a python script that uses Selenium to open the website and later insert a process number:
pathDriver = 'yourpathdriver'
driver = webdriver.Chrome(executable_path=pathDriver)
url = 'https://consultaprocesos.ramajudicial.gov.co/Procesos/NumeroRadicacion'
driver.get(url)
However the script only works the first time is executed, in later executions I get this error:
selenium.common.exceptions.WebDriverException: Message: unknown error: net::ERR_CONNECTION_CLOSED
I have to wait about 30 minutes to try the script again, but the result is the same, only works the first time.
I've tried to open the browser with the --incognito flag but this doesn't work. Also, I've tried to find a way to send request headers with Selenium but it seems this feature is not supported.
I am using Windows 10 and ChromeDriver.
Is there any Selenium tip to overcome this issue?
Thanks
When I have seen this error, it was a network issue (site not accessible from internal company network). To confirm or exclude this, try to run the tests from a computer outside your company, for example, your home computer. Here are more suggestions, but some of them are advanced (dangerous) and you should execute them only if you know what you are doing.
Additionally, the site is loaded on my computer for more than 20 seconds and in the console, I see the error:
GET https://consultaprocesos.ramajudicial.gov.co/js/chunk-3b114a7f.921eecf3.js net::ERR_CONNECTION_TIMED_OUT
However, this does not seem to cause the observed behavior.
Another possible reason could be an outdated browser/WebDriver or incorrect disposal (quit()) of the driver. If the issue is not reproduced manually (opening the site without Selenium), you can try with another WebDriver. You are using Chrome, so try with Firefox.

MaxRetryError while web scraping workaround - Python, Selenium

I am having some really hard times trying to figure out how to webscrape making multiple requests to the same website. I have to web scrape 3000 products from a website. That implies making various requests to that server (for example searching the product, clicking on it, going back to the home page) 3000 times.
I state that I am using Selenium. If I only launch one instance of my Firefox webdriver I don't get a MaxRetryError, but as the search goes on my webdriver gets slower and slower, and when the program reaches about half of the searches it stops responding. I looked it up on some forums and I found out it does so for some browser memory issues. So I tried quitting and reinstantiating the webdriver every n seconds (I tried with 100, 200 and 300 secs), but when I do so I get that MaxRetryError because of the too many requests to that url using the same session.
I then tried making the program sleep for a minute when the exception occurs but that hasn't worked (I am only able to make another search and then an exception is again thrown, and so on).
I am wondering if there is any workaround for these kind of issue.
It might be using another library, a way for changing IP or session dynamically or something like that.
P.S. I would rather keep working with selenium if possible.
This error is normally raised if the server determines a high request rate from your client.
As you mentioned, the server bans your IP from making further requests so you can get around that by using some available technologies. Look into Zalenium and also see here for some other possible ways.
Another possible (but tedious) way is to use a number of browser instances to make the call, for example, an answer from here illustrates that.
urlArr = ['https://link1', 'https://link2', '...']
for url in urlArr:
chrome_options = Options()
chromedriver = webdriver.Chrome(executable_path='C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe', options=chrome_options)
with chromedriver as browser:
browser.get(url)
# your task
chromedriver.close() # will close only the current chrome window.
browser.quit() # should close all of the open windows,

python - Speeding up Chrome Webdriver in Selenium

I am making a simple bot with selenium that will like, comment and message people on certain intervals.
I am using chrome web driver:
browser = webdriver.Chrome()
Also, I am on a x64 linux system. Distro is ubuntu 15.04 and am running with python3 from terminal.
and this works good and all, but it's pretty slow. I know as my code progresses, testing the app will become a pain. I've looked into this already and know it may have something to do with the proxy settings.
I am clueless when it comes to this type of stuff.
I fiddled with my system settings and changed my proxy settings to not require a connection, but nothing changed.
I notice when the driver loads, I see 'Establishing secure connection' for a few seconds in the browser window. I feel this is a culprit.
Also, 'establishing host' shows up multiple times. I'd say it takes about 5-8 seconds just to get a page.
login_url = 'http://www.skout.com/login'
browser.get(login_url)
In what ways can I speed up chrome driver, and is it proxy settings? It could definitely be something else.
Thanks for your time.
Chrome webdriver can be clunky and a bit slow to initialize as it is spawning a fresh instance every time you call the Webdriver object.
If speed is of the utmost importance I might recommend investing some time into looking at a headless alternative such as PhantomJS. This can save a significant amount of time if you are running multiple tests or instances of your application.

Firefox 14 on Ubuntu gets stuck connecting

I have a custom application written in python on Ubuntu. It's a bit hairy to unwind all the pieces to get to a reduced question to ask (will post more if I get there), but I have a few things to ennumerate. After trial-and-error, I have backed this problem off to just firefox 14.
Things were fine on firefox 13, firefox 14 was updated on Ubuntu, and stuff broke. (this is not uncommon, but I can't find this problem referenced anywhere yet)
We go to a page in our webservice and reload, 10 or so times, and then the reload hangs, spinning with "Connecting" in the status bar.
Connections on Firefox are getting consumed by XHRs. Increasing the max connection setting in firefox works around the issue. Basically we open up an XHR that in chrome, I can't even see, but in firefox shows with a spinner in firebug. That XHR seems to stay open across page reloads, and eventually consumes the open connections to the site.
After a couple minutes or so, a connection frees up and the load goes through.
Has anyone seen this? Is there a proper way to release the connection? All other browsers tried are not having this problem.
Thanks!
I have many tests in my rails application that worked ok before I updated to firefox 14.01. After that, Firefox browsers opens and just hangs there. I had to switch to Chrome (downloaded driver from Google). If of any help, this is how I initialize driver in ruby:
#driver = Selenium::WebDriver.for :chrome, :switches => %w[--ignore-certificate-errors --disable-popup-blocking --disable-translate]
Upgrading to Firefox 15 beta has solved the problem. If I find anything in the FF release notes, I'll update the answer.
There is now a Firefox bug to track this issue.

Categories