I faced with cloudflare issue when I tried to parse the website.
I got this code
import cloudscraper
url = "https://author.today"
scraper = cloudscraper.create_scraper()
print(scraper.post(url).status_code)
This code prints me
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.
I searched for workaround, but couldn't find any solution. If visit the website via a browser you could see
Checking your browser before accessing author.today.
Is there any solution to bypass cloudflare in my case?
Install httpx
pip3 install httpx[http2]
Define http2 client
client = httpx.Client(http2=True)
Make request
response = client.get("https://author.today")
Cheers!
Although for this site is does not seem to work, sometimes adding some parameters when initializing the scraper helps:
import cloudscraper
url = "https://author.today"
scraper = cloudscraper.create_scraper(
browser={
'browser': 'chrome',
'platform': 'android',
'desktop': False
}
)
print(scraper.post(url).status_code)
import cfscrape
from fake_useragent import UserAgent
ua = UserAgent()
s = cfscrape.create_scraper()
k = s.post("https://author.today", headers = {"useragent": f"{ua.random}"})
print(k)
I'd try to create a Playwright scraper that mimics a real user, this works for me most of the time, just need to find the right settings (they can vary from website to website).
Otherwise, if the website has a native App, try to figure out how the App behaves and then mimic it.
I can suggest such workflow to "try" to avoid Cloudflare WAF/bot mitigation:
don't cycle user agents, proxies or weird tunnels to surf
don't use fixed ip addresses, better leased lines like xDSL, home links and 4G/LTE
try to appear as mobile instead of a desktop/tablet
try to reproduce pointer movements like never before AKA record your mouse moves and migrate them 1:1 while scraping (yes u need JS enabled and some headless browser able to make up as "common" one)
don't cycle against different Cloudflare protected entities otherwise the attacker ip will be greylisted in a minute (AKA build your own targets blacklist, never touch such entities or you will go in the CF blacklist flawlessy)
try to reproduce a real life navigation in all aspects, including errors, waitings and more
check your used ip after any scrape against popular blacklists otherwise bad errors will shortly appears (crowdsec is a good starting point)
the usual scrape is a googlebot scrape, a single regex WAF rule on CLoudflare will block 99,99% of the tries then.. avoid to fake as google and try to be LESS evil instead (ex: asking webmasters for APIs or data export if any).
Source: I use Cloudflare with hundreds of domains and thousands of records (Enterprise) from the beginning of the company.
That way you will be closer to the point (and you will help them increasing the overall security).
I used this line:
scraper = cloudscraper.create_scraper(browser={'browser': 'chrome','platform': 'windows','mobile': False})
and then used httpx package after that
with httpx.Client() as s:
//Remaining Code
And I was able to bypass the issue cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.
Related
I am trying to get data from Reuters and have the code as below. But I think due to continuous requests, I got blocked from scraping more data. Is there a way to resolve this? I am using Google Colab. Although there are a lot of similar questions, they are all unanswered. So would really appreciate if I could get some help with this. Thanks!
!pip install selenium
!apt-get update
!apt install chromium-chromedriver
from selenium import webdriver
import time
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--no-sandbox')
chrome_options.add_argument('--disable-dev-shm-usage')
driver = webdriver.Chrome('chromedriver',chrome_options=chrome_options)
driver.maximize_window()
driver.implicitly_wait(10)
driver.get("https://www.reuters.com/companies/AAPL.O")
links=[]
news=[]
i=0
try:
while True:
news = driver.find_elements_by_xpath("//div[#class='item']")
driver.execute_script("arguments[0].scrollIntoView(true);", news[i])
if news[i].find_element_by_tag_name("time").get_attribute("innerText") == "a year ago":
break
links.append(news[i].find_element_by_tag_name("a").get_attribute("href"))
i += 1
time.sleep(.5)
except:
pass
driver.quit()
#links
for link in links:
paragraphs = driver.find_elements_by_xpath("//div[contains(#class,'Article__container')]/div/div/div[2]/p")
for para in paragraphs:
news.append(para.get_attribute("innerText"))
import pandas as pd
df = pd.DataFrame({'x':links, 'y':news})
df
Full error stacktrace:
Here's a generic answer.
Following are the list of things to keep in mind when scraping a website to prevent detection-
1) Adding User-Agent headers- Many websites do not allow access to their website if valid headers are not passed, and user-agent header is a very important one.
Example:- chrome_options.add_argument("user-agent=Mozilla/5.0")
2) Setting window-size when going headless- Websites are often able to detect when headless browsers are being run on their server, a common workaround is to add window-size argument to your scripts.
Example:- chrome_options.add_argument("--window-size=1920,1080")
3) Mimicking human behavior- Avoid clicking or navigating through the website at very fast rates. Use timely waits to make your behavior more human-like.
4) Using random waits - This is a continuation of the previous point, people often try to keep constant delays between actions, even that can lead to detection. Randomize them as well.
5) User-Agent rotation- Try changing your user agent time-to-time when scraping a website. (Read More)
6) IP-rotation (Using proxies)- Some websites ban individual IP's or even complete geographical areas from accessing their sites, if they are detected as a scraper. Rotating your IP might trick the server into believing that the requests are coming from different devices. IP-rotation combined with User-Agent rotation can be very effective.
Note:- Please don't use any freely available proxies, they have very low success rate, and hardly work. Use premium proxy services.
7) Using external libraries- There are a lot cases where all the above methods might not work, when the website has very good bot detection mechanism. At that time, you might as well try the undetected_chromedriver library. It has come in handy a few times.
This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.
I am using Python (2.7) and Selenium (3.4.3) to drive Firefox (52.2.0 ESR) via geckodriver (0.19.0) to automate a process on a CentOS 7 machine.
I need totally unattended operation of this automation with user credentials passed through; no storage allowed and no breaking in.
One piece of drama is being caused by the fact that the internal website required for the process is within an Active Directory domain while the machine running my automation is not. I have no need to validate the user, only pass the credentials to the website in such a way as to not require human interaction or for the person to be a local user on the machine.
I have tried various permutations of:
[protocol]://[user,pass]#[url]
driver.switch_to_alert() + send_keys
It seems some of those only work on IE, something I have no access to.
I have checked for libraries to handle this and all to no avail.
I can add libraries to python and I have sudo access to the machine - can't touch authentication, so AD integration is not possible.
How can I give this AD website the credentials of an arbitrary user such that no local storage of their credentials happens an no user interaction is required?
Thank you
EDIT
I think something like a proxy which could authenticate the user then retain that authentication for selenium to do its thing ...
Is there a simple LDAP/AD proxy available?
EDIT 2
Perhaps a very simple way of stating this is that I want to pass user credentials and prevent the authentication popup from happening.
Solution Found:
I needed to use a browser extension.
My solution has been built for chromium but it should port almost-unchanged for Firefox and maybe edge.
First up, you need 2 APIs to be available for your browser:
webRequest.onAuthRequired - Chrome & Firefox
runtime.nativeMessaging - Chrome & Firefox
While both browser APIs are very similar, they do have some significant differences - such as Chrome's implementation lacking Promises.
If you setup your Native Messaging Host to send a properly-formed JSON string, you need only poll it once. This means you can use a single call to runtime.sendNativeMessage() and be assured that your credentials are paresable. Pun intended.
Next, we need to look at how we're supposed to handle the webRequest.onAuthRequired event.
Since I'm working in Chromium, I need to use the promise-less Chrome API.
chrome.webRequest.onAuthRequired.addListener(
callbackFunctionHere,
{urls:[targetUrls]},
['asyncBlocking'] // --> this line is important, too. Very.
The Change:
I'll be calling my function provideCredentials because I'm a big stealy-stealer and used an example from this source. Look for the asynchronous version.
The example code fetches the credentials from storage.local ...
chrome.storage.local.get(null, gotCredentials);
We don't want that. Nope.
We want to get the credentials from a single call to sendNativeMessage so we'll change that one line.
chrome.runtime.sendNativeMessage(hostName, { text: "Ready" }, gotCredentials);
That's all it takes. Seriously. As long as your Host plays nice, this is the big secret. I won't even tell you how long it took me to find it!
Links:
My questions with helpful links:
Here - Workaround for Authenticating against Active Directory
Here - Also has some working code for a functional NM Host
Here - Some enlightening material on promises
So this turns out to be a non-trivial problem.
I haven't implemented the solution, yet, but I know how to get there...
Passing values to an extension is the first step - this can be done in both Chrome and Firefox. Watch the version to make sure the API required, nativeMessaging, actually exists in your version. I have had to switch to chromium for this reason.
Alternatively, one can use the storage API to put values in browser storage first. [edit: I did not go this way for security concerns]
Next is to use the onAuthRequired event from the webRequest API . Setup a listener on the event and pass in the values you need.
Caveats: I have built everything right up to the extension itself for the nativeMessaging API solution and there's still a problem with getting the script to recognise the data. This is almost certainly my JavaScript skills clashing with the arcane knowledge required to make these APIs make much sense ...
I have yet to attempt the storage method as it's less secure (in my mind) but it does seem to be simpler.
I'm running Selenium 2.0b4dev on Selenium Grid in Ubuntu 10.04, using Python code to write test cases. I've been having trouble with getting basic HTTP authentication to a specific site working, and with a quick google search found that my problem could be solved with the addition of the line self.selenium.add_custom_request_header("Authorization", "Basic %s" % _encoded) (with a proper line break in the middle to conform to PEP 8, of course.)
Unfortunately, apparently also through my search I found in order for that line of code to work I need to configure my browser (whichever one I'm using to run the test cases on the grid) to treat Selenium's (automatically running, apparently?) proxy server as a proxy for that browser to use. But apparently I need to modify the profile of Firefox (or IE)'s launcher to automatically use that proxy, since the whole point of these Selenium Grid test cases is that they aren't supposed to require user intervention, and I have little to no idea how to do that. I've just been using the "ant launch-hub" and "ant launch-remote-control" and then running python programs on the hub that import selenium and unittest.
If anyone could help, that would be just fantastic.
I wrote up an article on how to do this in Ruby. It links to a complementary article on testing self-signed certificates and gives you the set of flags you need to launch Selenium with.
http://mogotest.com/blog/2010/06/23/how-to-perform-basic-auth-in-selenium
To pass args through from grid to the underlying RC server, you need to use something like:
ant -DseleniumArgs="-trustAllSSLCertificates" launch-remote-control
Re: browsers . . . firefox will auto-enable the proxy mode stuff if you pass trustAllSSLCertificates now. Otherwise you need to use *firefoxproxy. IE requires the use of *iexploreproxy or a custom HTA launcher that configures the proxy (the article links to one we open-sourced but would need to be updated to work with 2.0 beta 4).
I've got mechanize setup and working with python. I am adding support for using a proxy, but how do I check that I am actually using the proxy?
Here is some code I am using:
ip = 'some proxy ip address'
br.set_proxies({"http://": ip} )
I started to wonder if it was working because just to do some testing I typed in:
ip = 'asdfasdf'
and it didn't throw an error. So how do I go about checking if it is really using the ip address for the proxy that I pass in or the ip address of my computer? Is there a way to return info on your ip in mechanize?
maybe like this ?
br = mechanize.Browser()
br.set_proxies({"http": '127.0.0.1:80'})
you need to debug for more information
br.set_debug_http(True)
br.set_debug_redirects(True)
I am not sure how to handle this issue with mechanize, but you could read the next link that explains how to do it without mechanize (but still in python):
Proxy Check in python
The simple solution provided at the above-mentioned link could be easily adapted to your needs.
Thus, instead of the line:
print "Connection error! (Check proxy)"
you could replace by
SucceededYesNo="NO"
and instead of
print "All was fine"
just replace by
SucceededYesNo="YES"
Now, you have a variable available for further processing.
I am however afraid this will not cover the cases when the target web page is down because the same error might occur out of two causes (so one would not know whether a NO outcome is coming from a not working proxy server or from a bad web page), but still could be a solution: what about to check with the above-mentioned code a working web page? i.e. www.google.com? In this way, you could eliminate one cause and it remains the other.