I am having some really hard times trying to figure out how to webscrape making multiple requests to the same website. I have to web scrape 3000 products from a website. That implies making various requests to that server (for example searching the product, clicking on it, going back to the home page) 3000 times.
I state that I am using Selenium. If I only launch one instance of my Firefox webdriver I don't get a MaxRetryError, but as the search goes on my webdriver gets slower and slower, and when the program reaches about half of the searches it stops responding. I looked it up on some forums and I found out it does so for some browser memory issues. So I tried quitting and reinstantiating the webdriver every n seconds (I tried with 100, 200 and 300 secs), but when I do so I get that MaxRetryError because of the too many requests to that url using the same session.
I then tried making the program sleep for a minute when the exception occurs but that hasn't worked (I am only able to make another search and then an exception is again thrown, and so on).
I am wondering if there is any workaround for these kind of issue.
It might be using another library, a way for changing IP or session dynamically or something like that.
P.S. I would rather keep working with selenium if possible.
This error is normally raised if the server determines a high request rate from your client.
As you mentioned, the server bans your IP from making further requests so you can get around that by using some available technologies. Look into Zalenium and also see here for some other possible ways.
Another possible (but tedious) way is to use a number of browser instances to make the call, for example, an answer from here illustrates that.
urlArr = ['https://link1', 'https://link2', '...']
for url in urlArr:
chrome_options = Options()
chromedriver = webdriver.Chrome(executable_path='C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe', options=chrome_options)
with chromedriver as browser:
browser.get(url)
# your task
chromedriver.close() # will close only the current chrome window.
browser.quit() # should close all of the open windows,
Related
I am trying to get data from Reuters and have the code as below. But I think due to continuous requests, I got blocked from scraping more data. Is there a way to resolve this? I am using Google Colab. Although there are a lot of similar questions, they are all unanswered. So would really appreciate if I could get some help with this. Thanks!
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
from selenium import webdriver
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.reuters.com/companies/AAPL.O")
links=[]
news=[]
i=0
try:
while True:
news = driver.find_elements_by_xpath("//div[#class='item']")
driver.execute_script("arguments[0].scrollIntoView(true);", news[i])
if news[i].find_element_by_tag_name("time").get_attribute("innerText") == "a year ago":
break
links.append(news[i].find_element_by_tag_name("a").get_attribute("href"))
i += 1
time.sleep(.5)
except:
pass
driver.quit()
#links
for link in links:
paragraphs = driver.find_elements_by_xpath("//div[contains(#class,'Article__container')]/div/div/div[2]/p")
for para in paragraphs:
news.append(para.get_attribute("innerText"))
import pandas as pd
df = pd.DataFrame({'x':links, 'y':news})
df
Full error stacktrace:
Here's a generic answer.
Following are the list of things to keep in mind when scraping a website to prevent detection-
1) Adding User-Agent headers- Many websites do not allow access to their website if valid headers are not passed, and user-agent header is a very important one.
Example:- chrome_options.add_argument("user-agent=Mozilla/5.0")
2) Setting window-size when going headless- Websites are often able to detect when headless browsers are being run on their server, a common workaround is to add window-size argument to your scripts.
Example:- chrome_options.add_argument("--window-size=1920,1080")
3) Mimicking human behavior- Avoid clicking or navigating through the website at very fast rates. Use timely waits to make your behavior more human-like.
4) Using random waits - This is a continuation of the previous point, people often try to keep constant delays between actions, even that can lead to detection. Randomize them as well.
5) User-Agent rotation- Try changing your user agent time-to-time when scraping a website. (Read More)
6) IP-rotation (Using proxies)- Some websites ban individual IP's or even complete geographical areas from accessing their sites, if they are detected as a scraper. Rotating your IP might trick the server into believing that the requests are coming from different devices. IP-rotation combined with User-Agent rotation can be very effective.
Note:- Please don't use any freely available proxies, they have very low success rate, and hardly work. Use premium proxy services.
7) Using external libraries- There are a lot cases where all the above methods might not work, when the website has very good bot detection mechanism. At that time, you might as well try the undetected_chromedriver library. It has come in handy a few times.
I'm trying to scrape a website that contains judicial information of my country (Colombia). I have a python script that uses Selenium to open the website and later insert a process number:
pathDriver = 'yourpathdriver'
driver = webdriver.Chrome(executable_path=pathDriver)
url = 'https://consultaprocesos.ramajudicial.gov.co/Procesos/NumeroRadicacion'
driver.get(url)
However the script only works the first time is executed, in later executions I get this error:
selenium.common.exceptions.WebDriverException: Message: unknown error: net::ERR_CONNECTION_CLOSED
I have to wait about 30 minutes to try the script again, but the result is the same, only works the first time.
I've tried to open the browser with the --incognito flag but this doesn't work. Also, I've tried to find a way to send request headers with Selenium but it seems this feature is not supported.
I am using Windows 10 and ChromeDriver.
Is there any Selenium tip to overcome this issue?
Thanks
When I have seen this error, it was a network issue (site not accessible from internal company network). To confirm or exclude this, try to run the tests from a computer outside your company, for example, your home computer. Here are more suggestions, but some of them are advanced (dangerous) and you should execute them only if you know what you are doing.
Additionally, the site is loaded on my computer for more than 20 seconds and in the console, I see the error:
GET https://consultaprocesos.ramajudicial.gov.co/js/chunk-3b114a7f.921eecf3.js net::ERR_CONNECTION_TIMED_OUT
However, this does not seem to cause the observed behavior.
Another possible reason could be an outdated browser/WebDriver or incorrect disposal (quit()) of the driver. If the issue is not reproduced manually (opening the site without Selenium), you can try with another WebDriver. You are using Chrome, so try with Firefox.
I am making a simple bot with selenium that will like, comment and message people on certain intervals.
I am using chrome web driver:
browser = webdriver.Chrome()
Also, I am on a x64 linux system. Distro is ubuntu 15.04 and am running with python3 from terminal.
and this works good and all, but it's pretty slow. I know as my code progresses, testing the app will become a pain. I've looked into this already and know it may have something to do with the proxy settings.
I am clueless when it comes to this type of stuff.
I fiddled with my system settings and changed my proxy settings to not require a connection, but nothing changed.
I notice when the driver loads, I see 'Establishing secure connection' for a few seconds in the browser window. I feel this is a culprit.
Also, 'establishing host' shows up multiple times. I'd say it takes about 5-8 seconds just to get a page.
login_url = 'http://www.skout.com/login'
browser.get(login_url)
In what ways can I speed up chrome driver, and is it proxy settings? It could definitely be something else.
Thanks for your time.
Chrome webdriver can be clunky and a bit slow to initialize as it is spawning a fresh instance every time you call the Webdriver object.
If speed is of the utmost importance I might recommend investing some time into looking at a headless alternative such as PhantomJS. This can save a significant amount of time if you are running multiple tests or instances of your application.
I am trying to scrape data from the URLs below. But selenium fails when driver.get(url) Some times the error is [Errno 104] Connection reset by peer, sometimes [Errno 111] Connection refused. On rare days it works just fine and on my mac with real browser the same spider works fine every single time. So this isn't related to my spider.
Have tried many solutions like waiting got selectors on page, implicit wait, using selenium-requests yo pass proper request headers, etc. But nothing seems to work.
http://www.snapdeal.com/offers/deal-of-the-day
https://paytm.com/shop/g/paytm-home/exclusive-discount-deals
I am using python, selenium & headless Firefox webdriver to achieve this. The os is centos 6.5.
Note: I have many AJAX heavy pages that gets scraped successfully some are below.
http://www.infibeam.com/deal-of-the-day.html, http://www.amazon.in/gp/goldbox/ref=nav_topnav_deals
Already spent many days trying to debug the issue with no luck. Any help would be appreciated.
After days of jingling around this issue, finally found the cause. Writing it here for the benefit of the community. The headless browser was failing due to lack of RAM on the server, strange error messages from webdriver were real pita.
The server was running straight up for 60 days without reboot, Rebooting it did the trick. After increasing the swap by 3 times, has not faced issue for past few days. Also scheduled a task to cleanup page file caches (http://www.yourownlinux.com/2013/10/how-to-free-up-release-unused-cached-memory-in-linux.html).
Found this question while looking for similar error.
Look's like it's a selenium 3.8.1 and 3.9.0 bug.
https://github.com/SeleniumHQ/selenium/issues/5296
Downgrade to 3.8.0 solves this problem
I have been using Selenium and chromedriver (python3) for scraping purposes for some time now. With the latest Google Chrome update I had to deal with two issues.
1) Error on webdriver launch:
Solution: I had to add "no-sandbox" argument.
chrome_options.add_argument('--no-sandbox')
2) [Errno 104] Connection reset by peer:
Solution. There seems to be a problem with sockets and http requests. Either the webpage content is too big or you don't give the page enough time to load. At least that's what I thought.
I set the maximum page load time to 60 seconds and it seems to be working fine.
driver.set_page_load_timeout(60)
I added a small delay between webdrivers initialisations which also seems to help.
time.sleep(0.5)
For the IE webdriver, it opens the IE browsers but it starts to load the local host and then stops (ie/ It never stated loading ). WHen the browser stops loading it shows the msg 'Initial start page for webdriver server'. The problem is that this does not occur every time I execute the test case making it difficult to identify what could be the cause of the issue. What I have noticed is when this issue occurs, the url will take ~25 secs to load manually on the same machine. When the issue does not occur, the URL will load within 3secs.
All security setting are the same (protected Mode enabled across all zone)
enhance protected mode is disabled
IE version 11
the URL is added as a trusted site.
Any clue why it does not load the URL sometimes?
I would try with disabling IE Native event. And, sorry that I cannot provide you the Python syntax right a way. The following is C# which should be fairly easy to convert.
var ieOptions = new InternetExplorerOptions
{ EnableNativeEvents = false };
ieOptions.EnsureCleanSession = true;
driver = new InternetExplorerDriver(ieOptions);
Use remote driver with desired cap (pageLoadStrategy)
Release notes from seleniumhq.org. Note that we have to use version 2.46 for the jar, iedriverserver.exe and python client driver in order to have things work correctly. It is unclear why 2.45 does not work given the release notes below.
v2.45.0.2
Updates to JavaScript automation atoms.
Added pageLoadStrategy to IE driver. Setting a capability named
pageLoadStrategy when creating a session with the IE driver will now change
the wait behavior when navigating to a new page. The valid values are:
"normal" - Waits for document.readyState to be 'complete'. This is the
default, and is the same behavior as all previous versions of
the IE driver.
"eager" - Will abort the wait when document.readyState is
'interactive' instead of waiting for 'complete'.
"none" - Will abort the wait immediately, without waiting for any of
the page to load.
Setting the capability to an invalid value will result in use of the
"normal" page load strategy.
It hasn't been updated for a while, but recently I had very similar issue - IEDriverServer was eventually opening page under test, but in most cases just stuck on Initial page of WebDriver.
What I found the root cause (in my case) was startup setting of IE. I had Start with tabs from the last session enabled, when changed back to Start with home page driver started to work like a charm, opening page under test in 100% of tries.