python mechanize proxy question - python

I've got mechanize setup and working with python. I am adding support for using a proxy, but how do I check that I am actually using the proxy?
Here is some code I am using:
ip = 'some proxy ip address'
br.set_proxies({"http://": ip} )
I started to wonder if it was working because just to do some testing I typed in:
ip = 'asdfasdf'
and it didn't throw an error. So how do I go about checking if it is really using the ip address for the proxy that I pass in or the ip address of my computer? Is there a way to return info on your ip in mechanize?

maybe like this ?
br = mechanize.Browser()
br.set_proxies({"http": '127.0.0.1:80'})
you need to debug for more information
br.set_debug_http(True)
br.set_debug_redirects(True)

I am not sure how to handle this issue with mechanize, but you could read the next link that explains how to do it without mechanize (but still in python):
Proxy Check in python
The simple solution provided at the above-mentioned link could be easily adapted to your needs.
Thus, instead of the line:
print "Connection error! (Check proxy)"
you could replace by
SucceededYesNo="NO"
and instead of
print "All was fine"
just replace by
SucceededYesNo="YES"
Now, you have a variable available for further processing.
I am however afraid this will not cover the cases when the target web page is down because the same error might occur out of two causes (so one would not know whether a NO outcome is coming from a not working proxy server or from a bad web page), but still could be a solution: what about to check with the above-mentioned code a working web page? i.e. www.google.com? In this way, you could eliminate one cause and it remains the other.

Related

Can't bypass cloudflare with python cloudscraper

I faced with cloudflare issue when I tried to parse the website.
I got this code
import cloudscraper
url = "https://author.today"
scraper = cloudscraper.create_scraper()
print(scraper.post(url).status_code)
This code prints me
cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.
I searched for workaround, but couldn't find any solution. If visit the website via a browser you could see
Checking your browser before accessing author.today.
Is there any solution to bypass cloudflare in my case?
Install httpx
pip3 install httpx[http2]
Define http2 client
client = httpx.Client(http2=True)
Make request
response = client.get("https://author.today")
Cheers!
Although for this site is does not seem to work, sometimes adding some parameters when initializing the scraper helps:
import cloudscraper
url = "https://author.today"
scraper = cloudscraper.create_scraper(
browser={
'browser': 'chrome',
'platform': 'android',
'desktop': False
}
)
print(scraper.post(url).status_code)
import cfscrape
from fake_useragent import UserAgent
ua = UserAgent()
s = cfscrape.create_scraper()
k = s.post("https://author.today", headers = {"useragent": f"{ua.random}"})
print(k)
I'd try to create a Playwright scraper that mimics a real user, this works for me most of the time, just need to find the right settings (they can vary from website to website).
Otherwise, if the website has a native App, try to figure out how the App behaves and then mimic it.
I can suggest such workflow to "try" to avoid Cloudflare WAF/bot mitigation:
don't cycle user agents, proxies or weird tunnels to surf
don't use fixed ip addresses, better leased lines like xDSL, home links and 4G/LTE
try to appear as mobile instead of a desktop/tablet
try to reproduce pointer movements like never before AKA record your mouse moves and migrate them 1:1 while scraping (yes u need JS enabled and some headless browser able to make up as "common" one)
don't cycle against different Cloudflare protected entities otherwise the attacker ip will be greylisted in a minute (AKA build your own targets blacklist, never touch such entities or you will go in the CF blacklist flawlessy)
try to reproduce a real life navigation in all aspects, including errors, waitings and more
check your used ip after any scrape against popular blacklists otherwise bad errors will shortly appears (crowdsec is a good starting point)
the usual scrape is a googlebot scrape, a single regex WAF rule on CLoudflare will block 99,99% of the tries then.. avoid to fake as google and try to be LESS evil instead (ex: asking webmasters for APIs or data export if any).
Source: I use Cloudflare with hundreds of domains and thousands of records (Enterprise) from the beginning of the company.
That way you will be closer to the point (and you will help them increasing the overall security).
I used this line:
scraper = cloudscraper.create_scraper(browser={'browser': 'chrome','platform': 'windows','mobile': False})
and then used httpx package after that
with httpx.Client() as s:
//Remaining Code
And I was able to bypass the issue cloudscraper.exceptions.CloudflareChallengeError: Detected a Cloudflare version 2 challenge, This feature is not available in the opensource (free) version.

make any internet-accessing python code work (proxy + custom .crt)

The situation
If the following is not done, all outgoing HTTP or HTTPS requests made with python ends in a WinError 10054 Connection Reset, or a SSL bad handshake error.
set the HTTP_PROXY, HTTPS_PROXY environment variable, or their counterparts
What needs to be verified must be verified with a custom .crt file.
For example, assuming the .crt file is in place, both gets me a 200 OK:
import os
os.environ['HTTP_PROXY'] = #some_appropriate_address
os.environ['HTTPS_PROXY'] = #some appropriate_address
requests.get('http://www.google.com',verify="C:\the_file.crt") # 200 OK
requests.get('http://httpbin.org',verify=False) # 200 OK, but unsafe
requests.get('http://httpbin.org') # SSL bad handshake error
The Problem
There is this massive jumble of pre-written code (heavily utilizing urllib3 and requests and possibly other pieces of internet-accessing code) I have, and I have to make it work under the conditions outlined above.
Sure, I can write verify='C:\the_file.crt' for every requests.get(), but that can very quickly get hairy, right? And the code may also be using some other library (that is not requests). So I am looking for a global setting (environment variable etc.) I should alter, so that everything works well (return a 200 OK upon a GET request to a server, whether or not the code is written in requests-py).
Also, if there is no such way, I would like an explanation as to why.
What I tried (am trying)
Maybe editing the .condarc file (via conda --config) is a solution. I tried, to no avail: python gives me a "SSL verification failed" error. On the contrary, note that the code snippet above gave me a 200 OK. To my knowledge, this does not fit nicely with many situations that were previously discussed in Stack Overflow.
By the way, setting ssl_verify to false does not solve the problem either; I still get a bad handshake error for some reason.
I am using Win 10, Python 3.7.4 (Anaconda).
Update
I have edited the question to prevent future misunderstandings about the content of this question. A few answers below are a reiteration of what was written here from the start.
The current answers are not entirely satisfactory either, as they only seem to address the case where I am using requests or urllib3.
You should be able to get any python code that uses the requests module(which is inside urllib3) to work behind a proxy without modifying the python code itself by setting the following environment variables in Windows.
http_proxy http://[<user>:<pwd>#]<http_host>:<http_port>
https_proxy http://[<user>:<pwd>#]<https_host>:<https_port>
requests_ca_bundle <path_to_ca_bundle.crt>
curl_ca_bundle <path_to_ca_bundle.crt>
You can set environment variables by doing the following:
Press Windows-Key + R, enter sysdm.cpl ,3 (mind the space before the comma) and press Enter
Click the Environment variables button
In either of the fields (User variables or System variables), add the four variables
According to Doc in Requests:
https://requests.readthedocs.io/en/master/user/advanced/#proxies
you can use proxy in this way:
proxies = { 'http': 'http://10.10.1.10:3128', 'https': 'http://10.10.1.10:1080',}
requests.get('http://example.org', proxies=proxies)
Then depending on if you want to add .crt or .pem:
requests.get('https://kennethreitz.com', cert=('/path/server.crt', '/path/key'))
requests.get('https://kennethreitz.org', cert='/path/client.pem')
https://2.python-requests.org//en/v1.0.4/user/advanced/
You are trying to make https requests to an outer url and you need to provide the proper certificate files for verification. You are trying to make these configurations inside each component. But I would suggest that you make those configurations globally and system-wide so neither of the components need to provide certificates and deal with ssl-verification stuff.
I am awful at windows related networking configurations, but I would suggest you go check Proxifier and I am pretty sure you can configure a ssl proxy with proper certificates.

Python web scraping : urllib.error.URLError: urlopen error [Errno 11001] getaddrinfo failed

This is the first time I am trying to use Python for Web scraping. I have to extract some information from a website. I work in an institution, so I am using a proxy for Internet access.
I have used this code. Which works fine with URLs like e.g. https://www.google.co.in, or https://www.pythonprogramming.net
But when I use this URL: http://www.genecards.org/cgi-bin/carddisp.pl?gene=APOA1 which I need for scraping data, it shows
urllib.error.URLError: <urlopen error [Errno 11001] getaddrinfo failed>
Here is my code.
import urllib.request as req
proxy = req.ProxyHandler({'http': r'http://username:password#url:3128'})
auth = req.HTTPBasicAuthHandler()
opener = req.build_opener(proxy, auth, req.HTTPHandler)
req.install_opener(opener)
conn = req.urlopen('https://www.google.co.in')
return_str = conn.read()
print(return_str)
Please guide me on what the issue here which I am not able to understand.
Also while searching for the above error, I read something about absolute URLs. Is that related to it?
The problem is that your proxy server, and your own host, seem to use two different DNS resolvers, or two resolvers updated at different instants in time.
So when you pass www.genecards.org, the proxy does not know that address, and the attempt to get address information (getAddrInfo) fails. Hence the error.
The problem is quite a bit more awkward than that, though. GeneCards.org is an alias for an Incapsula DNS host:
$ host www.genecards.org
www.genecards.org is an alias for 6hevx.x.incapdns.net.
And that machine is itself a proxy, hiding the real GeneCards site behind (so you might use http://192.230.83.165/ as an address, and it would never work).
This kind of merry-go-round is used by those sites that, among other things - how shall I put it - take a dim view of being scraped:
So yes, you could try several things to make scraping work. Chances are that they will only work for a short time, before being shut down harder and harder. So in the best scenario, you would be forced to continuously update your scraping code. Which can, and will, break down whenever it's most inconvenient to you.
This is no accident: it is intentional on GeneCards' part, and clearly covered in their terms of service:
Misuse of the Services
7.2 LifeMap may restrict, suspend or terminate the account of any Registered Users who abuses or misuses the GeneCards Suite Products. Misuse of the GeneCards Suite Products includes scraping, spidering and/or crawling GeneCards Suite Products; creating multiple or false profiles...
I suggest you take a different approach - try enquiring for a consultation license. Scraping a web site that does not care (or is unable, or hasn't yet come around) to providing its information in a easier format is one thing - stealing that information is quite different.
Also, note that you're connecting to a Squid proxy that in all probability is logging the username you're using. Any scraping made through that proxy would immediately be traced back to that user, in the event that LifeMap files a complaint for unauthorized scraping.
Try to ping url:3128 from your terminal. Provide responses? Problem seems related to security from server.

Inexplicable Urllib2 problem between virtualenv's.

I have some test code (as a part of a webapp) that uses urllib2 to perform an operation I would usually perform via a browser:
Log in to a remote website
Move to another page
Perform a POST by filling in a form
I've created 4 separate, clean virtualenvs (with --no-site-packages) on 3 different machines, all with different versions of python but the exact same packages (via pip requirements file), and the code only works on the two virtualenvs on my local development machine(2.6.1 and 2.7.2) - it won't work on either of my production VPSs
In the failing cases, I can log in successfully, move to the correct page but when I submit the form, the remote server replies telling me that there has been an error - it's an application server error page ('we couldn't complete your request') and not a webserver error.
because I can successfully log in and maneuver to a second page, this doesn't seem to be a session or a cookie problem - it's particular to the final POST
because I can perform the operation on a particular machine with the EXACT same headers and data, this doesn't seem to be a problem with what I am requesting/posting
because I am trying the code on two separate VPS rented from different companies, this doesn't seem to be a problem with the VPS physical environment
because the code works on 2 different python versions, I can't imagine it being an incompabilty problem
I'm completely lost at this stage as to why this wouldn't work. I've even 'turned-it-off-and-turn-it-on-again' because I just can't see what the problem could be.
I think it has to be something to do with the final POST coming from a VPS that the remote server doesn't like, but I can't figure out what that could be. I feel like there is something going on under the hood of URLlib that is causing the remote server to dislike the reply.
EDIT
I've installed the exact same Python version (2.6.1) on the VPS as is on my working local copy and it doesn't work remotely, so it must be something to do with originating from a VPS. How could this effect the Http request? Is it something lower level?
You might try setting the debuglevel=1 for urllib2 and see what it comes up with:
import urllib2
h=urllib2.HTTPHandler(debuglevel=1)
opener = urllib2.build_opener(h)
...
This is a total shot in the dark, but are your VPSs 64-bit and your home computer 32-bit, or vice versa? Maybe a difference in default sizes or accuracies of something could be freaking out the server.
Barring that, can you try to find out any information on the software stack the web server is using?
I had similar issues with urllib2 (working with Zimbra's REST api), in the end switched to pycurl with success.
PS
for operations of login/navigate/post, I usually find Mechanize useful and easier to use. Maybe you can give it a show.
Well, it looks like I know why the problem was happening, but I'm not 100% the reason for it.
I simply had to make the server wait (time.sleep()) after it sent the 2nd request (Move to another page) before doing the 3rd request (Perform a POST by filling in a form).
I don't know is it because of a condition with the 3rd party server, or if it's some sort of odd issue with URLlib? The reason it seemed to work on my development machine is presumably because it was slower then the server at running the code?

How can I use TOR as a proxy?

I'm trying to use TOR as a generic proxy but it fails
Right now I'm trying with python but I'm pretty sure it would be the same with any other language. I can connect to other proxies with python so I get how it "should" be done.
I found a list of TOR entry nodes
h = httplib.HTTPConnection("one entry node", 80)
h.connect()
h.request("GET", "www.google.com")
resp = h.getresponse()
page = resp.read()
unfortunately that doesnt work, i get redirected to a 404 message.
I'm just not sure of what I'm doing wrong. Probably the list of entry nodes cannot be connected just like that. I'm searching on how to do it properly but i dont get any documentation about how to program applications with tor
edit :
ditch the tor proxy list, i don't know why i should want to know about it.
the "entry node" is yourself, after you've installed the (windows) vidalia client and privoxy (all bundled as one)
httplib.HTTPConnection("one entry node", 80)
becomes
httplib.HTTPConnection("127.0.0.1", 8118)
and voilĂ , everything is routed through TOR
First, make sure you are using the correct node location and port. Most proxies use ports other than 80. Second, specify the protocol to use with the correct URL on your request string.
Under normal circumstances, your code should work if it looks something like this one:
h = httplib.HTTPConnection("138.45.68.134", 8080)
h.connect()
h.request("GET", "http://www.google.com")
resp = h.getresponse()
page = resp.read()
h.close();
You can also use socket as an alternative but that's another issue and it's even more complicated than the one above.
Hope that helps! :-)

Categories