Failed to establish a new connection: [Errno 111] Connection refused - python

I am trying to get data from Reuters and have the code as below. But I think due to continuous requests, I got blocked from scraping more data. Is there a way to resolve this? I am using Google Colab. Although there are a lot of similar questions, they are all unanswered. So would really appreciate if I could get some help with this. Thanks!
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
from selenium import webdriver
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.reuters.com/companies/AAPL.O")
links=[]
news=[]
i=0
try:
while True:
news = driver.find_elements_by_xpath("//div[#class='item']")
driver.execute_script("arguments[0].scrollIntoView(true);", news[i])
if news[i].find_element_by_tag_name("time").get_attribute("innerText") == "a year ago":
break
links.append(news[i].find_element_by_tag_name("a").get_attribute("href"))
i += 1
time.sleep(.5)
except:
pass
driver.quit()
#links
for link in links:
paragraphs = driver.find_elements_by_xpath("//div[contains(#class,'Article__container')]/div/div/div[2]/p")
for para in paragraphs:
news.append(para.get_attribute("innerText"))
import pandas as pd
df = pd.DataFrame({'x':links, 'y':news})
df
Full error stacktrace:

Here's a generic answer.
Following are the list of things to keep in mind when scraping a website to prevent detection-
1) Adding User-Agent headers- Many websites do not allow access to their website if valid headers are not passed, and user-agent header is a very important one.
Example:- chrome_options.add_argument("user-agent=Mozilla/5.0")
2) Setting window-size when going headless- Websites are often able to detect when headless browsers are being run on their server, a common workaround is to add window-size argument to your scripts.
Example:- chrome_options.add_argument("--window-size=1920,1080")
3) Mimicking human behavior- Avoid clicking or navigating through the website at very fast rates. Use timely waits to make your behavior more human-like.
4) Using random waits - This is a continuation of the previous point, people often try to keep constant delays between actions, even that can lead to detection. Randomize them as well.
5) User-Agent rotation- Try changing your user agent time-to-time when scraping a website. (Read More)
6) IP-rotation (Using proxies)- Some websites ban individual IP's or even complete geographical areas from accessing their sites, if they are detected as a scraper. Rotating your IP might trick the server into believing that the requests are coming from different devices. IP-rotation combined with User-Agent rotation can be very effective.
Note:- Please don't use any freely available proxies, they have very low success rate, and hardly work. Use premium proxy services.
7) Using external libraries- There are a lot cases where all the above methods might not work, when the website has very good bot detection mechanism. At that time, you might as well try the undetected_chromedriver library. It has come in handy a few times.

Related

How to retrieve Youtube startup delay, rebuffering events from Client-side Chrome browser?

I am looking for a solution to accurately measure Youtube Startup delay and rebuffering events from the Chrome web browser. I wish to ideally leverage the selenium automation-based python script to repeat the experiment for a large number of YouTube videos to get the measurements. Below is a piece of code I started with, however, it doesn't return any meaningful figures. Must be, I am missing something. Am I on the right track? Any help would be appreciated.
driver = webdriver.Chrome(service=s, options=options)
driver.get("https://www.youtube.com/<example-video-id>")
sleep(5)
player_status = driver.execute_script("return document.getElementById('movie_player').getPlayerState()")

Can't bypass cloudflare with python cloudscraper

I faced with cloudflare issue when I tried to parse the website.
I got this code
import cloudscraper
url = "https://author.today"
scraper = cloudscraper.create_scraper()
print(scraper.post(url).status_code)
This code prints me
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.
I searched for workaround, but couldn't find any solution. If visit the website via a browser you could see
Checking your browser before accessing author.today.
Is there any solution to bypass cloudflare in my case?
Install httpx
pip3 install httpx[http2]
Define http2 client
client = httpx.Client(http2=True)
Make request
response = client.get("https://author.today")
Cheers!
Although for this site is does not seem to work, sometimes adding some parameters when initializing the scraper helps:
import cloudscraper
url = "https://author.today"
scraper = cloudscraper.create_scraper(
browser={
'browser': 'chrome',
'platform': 'android',
'desktop': False
}
)
print(scraper.post(url).status_code)
import cfscrape
from fake_useragent import UserAgent
ua = UserAgent()
s = cfscrape.create_scraper()
k = s.post("https://author.today", headers = {"useragent": f"{ua.random}"})
print(k)
I'd try to create a Playwright scraper that mimics a real user, this works for me most of the time, just need to find the right settings (they can vary from website to website).
Otherwise, if the website has a native App, try to figure out how the App behaves and then mimic it.
I can suggest such workflow to "try" to avoid Cloudflare WAF/bot mitigation:
don't cycle user agents, proxies or weird tunnels to surf
don't use fixed ip addresses, better leased lines like xDSL, home links and 4G/LTE
try to appear as mobile instead of a desktop/tablet
try to reproduce pointer movements like never before AKA record your mouse moves and migrate them 1:1 while scraping (yes u need JS enabled and some headless browser able to make up as "common" one)
don't cycle against different Cloudflare protected entities otherwise the attacker ip will be greylisted in a minute (AKA build your own targets blacklist, never touch such entities or you will go in the CF blacklist flawlessy)
try to reproduce a real life navigation in all aspects, including errors, waitings and more
check your used ip after any scrape against popular blacklists otherwise bad errors will shortly appears (crowdsec is a good starting point)
the usual scrape is a googlebot scrape, a single regex WAF rule on CLoudflare will block 99,99% of the tries then.. avoid to fake as google and try to be LESS evil instead (ex: asking webmasters for APIs or data export if any).
Source: I use Cloudflare with hundreds of domains and thousands of records (Enterprise) from the beginning of the company.
That way you will be closer to the point (and you will help them increasing the overall security).
I used this line:
scraper = cloudscraper.create_scraper(browser={'browser': 'chrome','platform': 'windows','mobile': False})
and then used httpx package after that
with httpx.Client() as s:
//Remaining Code
And I was able to bypass the issue cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.

Python Selenium script only works in the first execution (ERR_CONNECTION_CLOSED)

I'm trying to scrape a website that contains judicial information of my country (Colombia). I have a python script that uses Selenium to open the website and later insert a process number:
pathDriver = 'yourpathdriver'
driver = webdriver.Chrome(executable_path=pathDriver)
url = 'https://consultaprocesos.ramajudicial.gov.co/Procesos/NumeroRadicacion'
driver.get(url)
However the script only works the first time is executed, in later executions I get this error:
selenium.common.exceptions.WebDriverException: Message: unknown error: net::ERR_CONNECTION_CLOSED
I have to wait about 30 minutes to try the script again, but the result is the same, only works the first time.
I've tried to open the browser with the --incognito flag but this doesn't work. Also, I've tried to find a way to send request headers with Selenium but it seems this feature is not supported.
I am using Windows 10 and ChromeDriver.
Is there any Selenium tip to overcome this issue?
Thanks
When I have seen this error, it was a network issue (site not accessible from internal company network). To confirm or exclude this, try to run the tests from a computer outside your company, for example, your home computer. Here are more suggestions, but some of them are advanced (dangerous) and you should execute them only if you know what you are doing.
Additionally, the site is loaded on my computer for more than 20 seconds and in the console, I see the error:
GET https://consultaprocesos.ramajudicial.gov.co/js/chunk-3b114a7f.921eecf3.js net::ERR_CONNECTION_TIMED_OUT
However, this does not seem to cause the observed behavior.
Another possible reason could be an outdated browser/WebDriver or incorrect disposal (quit()) of the driver. If the issue is not reproduced manually (opening the site without Selenium), you can try with another WebDriver. You are using Chrome, so try with Firefox.

MaxRetryError while web scraping workaround - Python, Selenium

I am having some really hard times trying to figure out how to webscrape making multiple requests to the same website. I have to web scrape 3000 products from a website. That implies making various requests to that server (for example searching the product, clicking on it, going back to the home page) 3000 times.
I state that I am using Selenium. If I only launch one instance of my Firefox webdriver I don't get a MaxRetryError, but as the search goes on my webdriver gets slower and slower, and when the program reaches about half of the searches it stops responding. I looked it up on some forums and I found out it does so for some browser memory issues. So I tried quitting and reinstantiating the webdriver every n seconds (I tried with 100, 200 and 300 secs), but when I do so I get that MaxRetryError because of the too many requests to that url using the same session.
I then tried making the program sleep for a minute when the exception occurs but that hasn't worked (I am only able to make another search and then an exception is again thrown, and so on).
I am wondering if there is any workaround for these kind of issue.
It might be using another library, a way for changing IP or session dynamically or something like that.
P.S. I would rather keep working with selenium if possible.
This error is normally raised if the server determines a high request rate from your client.
As you mentioned, the server bans your IP from making further requests so you can get around that by using some available technologies. Look into Zalenium and also see here for some other possible ways.
Another possible (but tedious) way is to use a number of browser instances to make the call, for example, an answer from here illustrates that.
urlArr = ['https://link1', 'https://link2', '...']
for url in urlArr:
chrome_options = Options()
chromedriver = webdriver.Chrome(executable_path='C:/Users/andre/Downloads/chromedriver_win32/chromedriver.exe', options=chrome_options)
with chromedriver as browser:
browser.get(url)
# your task
chromedriver.close() # will close only the current chrome window.
browser.quit() # should close all of the open windows,

Selenium headless browser webdriver [Errno 104] Connection reset by peer

I am trying to scrape data from the URLs below. But selenium fails when driver.get(url) Some times the error is [Errno 104] Connection reset by peer, sometimes [Errno 111] Connection refused. On rare days it works just fine and on my mac with real browser the same spider works fine every single time. So this isn't related to my spider.
Have tried many solutions like waiting got selectors on page, implicit wait, using selenium-requests yo pass proper request headers, etc. But nothing seems to work.
http://www.snapdeal.com/offers/deal-of-the-day
https://paytm.com/shop/g/paytm-home/exclusive-discount-deals
I am using python, selenium & headless Firefox webdriver to achieve this. The os is centos 6.5.
Note: I have many AJAX heavy pages that gets scraped successfully some are below.
http://www.infibeam.com/deal-of-the-day.html, http://www.amazon.in/gp/goldbox/ref=nav_topnav_deals
Already spent many days trying to debug the issue with no luck. Any help would be appreciated.
After days of jingling around this issue, finally found the cause. Writing it here for the benefit of the community. The headless browser was failing due to lack of RAM on the server, strange error messages from webdriver were real pita.
The server was running straight up for 60 days without reboot, Rebooting it did the trick. After increasing the swap by 3 times, has not faced issue for past few days. Also scheduled a task to cleanup page file caches (http://www.yourownlinux.com/2013/10/how-to-free-up-release-unused-cached-memory-in-linux.html).
Found this question while looking for similar error.
Look's like it's a selenium 3.8.1 and 3.9.0 bug.
https://github.com/SeleniumHQ/selenium/issues/5296
Downgrade to 3.8.0 solves this problem
I have been using Selenium and chromedriver (python3) for scraping purposes for some time now. With the latest Google Chrome update I had to deal with two issues.
1) Error on webdriver launch:
Solution: I had to add "no-sandbox" argument.
chrome_options.add_argument('--no-sandbox')
2) [Errno 104] Connection reset by peer:
Solution. There seems to be a problem with sockets and http requests. Either the webpage content is too big or you don't give the page enough time to load. At least that's what I thought.
I set the maximum page load time to 60 seconds and it seems to be working fine.
driver.set_page_load_timeout(60)
I added a small delay between webdrivers initialisations which also seems to help.
time.sleep(0.5)

Categories