I am trying to scrape a website using requests in python.
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
# set the headers like we are a browser,
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers )
This is working fine when I use my personal wifi. However, when I connect to my company's VPN, I get the following error.
ConnectionError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Max retries exceeded with url: /questions/23013220/max-retries-exceeded-with-url (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))
Now, I need this to work over my company's VPN because I need to access a website which works only in that. How to resolve this?
In my case, the problem was related to IPv6.
Our VPN used split tunneling, and it seems the VPN configuration does not support IPv6.
So for example this would hang forever:
requests.get('https://pokeapi.co/api/v2/pokemon')
But if you add a timeout, the request succeeds:
requests.get('https://pokeapi.co/api/v2/pokemon', timeout=1)
But not all machines were having this problem. So I compared the output of this among two different machines:
import socket
for line in socket.getaddrinfo('pokeapi.co', 443):
print(line)
The working one only returned IPv4 addresses. The non-working machine returned both IPv4 and IPv6 addresses.
So with the timeout specified, my theory is that python fails quickly with IPv6 and then moves to IPv4, where the request succeeds.
Ultimately we resolved this by disabling IPv6 on the machine:
networksetup -setv6off "Wi-Fi"
But I assume that this could instead be resolved through VPN configuration.
How about trying like this:
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
ua = UserAgent()
headers = headers = {"User-Agent": ua.random}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers)
It seems to be caused by UserAgent() settings difference.
Try to set trust_env = None
trust_env = None #
Trust environment settings for proxy configuration, default authentication and similar.
Or you can disable proxies for a particular domain. The question
import os
os.environ['NO_PROXY'] = 'stackoverflow.com'
In my organization, I have to run my program under VPN for different geo locations. so we have multiple proxy configurations.
I found it simpler to use a package called PyPAC to get my proxy details automatically
from pypac import PACSession
from requests.auth import HTTPProxyAuth
session = PACSession()
# when the username and password is required
# session = PACSession(proxy_auth=HTTPProxyAuth(name, password))
r = session.get('http://example.org')
How does this work:
The package locates the PAC file which is configured by the organization. This file consist of proxy configuration detail (more info).
Related
I have a flask web app running a just-dial scraper code, In my code, I have to request multiple pages of the Justdial site to use it in the bs4 module to extract the data and fill it in the excel sheet. I use requests.Session() to do the process.
session = requests.Session()
session.headers.update({"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/78.0.3904.108 Safari/537.36"})
url=f"{entry}/page-{page_number}"
session.verify = False
r = session.get(url).text
Then this "r" is passed into the bs4 module and the extraction process takes place.
Whenever I run this code in the local host my program works fine, the data is getting extracted and the values are getting stored in the excel file. But when I host this as webapp in heroku and try the same process in heroku, I am not getting the desired output, there are no errors shown in except and try as well. Also I am getting empty excel file as output.
I tried using Urllib, requests.get() and also requests.get(url, verify-False) but the same problem exists.
This warning pops up while i run the program in localhost
/home/disciple/.local/lib/python3.8/site-packages/urllib3/connectionpool.py:846: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
warnings.warn((
I am trying currently to scrape a website using a proxy. I chose to use Luminati as proxy provider.
I created a zone with a Data Center using Shared IPs.
Luminati Dashboard
Then I installed Luminati Proxy Manager on my local machine, set up a proxy port using default config.
import requests
ip_json = requests.get('http://lumtest.com/myip.json', proxies={"http":"http://localhost:24000/",
"https":"http://localhost:24000/"}).json()
proxy = "https://" + ip_json['ip'] + ':' + str(ip_json['asn']['asnum'])
proxies ={"https": proxy , "http": proxy }
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get('http://google.com/', headers=headers, proxies=proxies)
However, each time I get
ProxyError: HTTPSConnectionPool(host='x.x.x.x', port=x): Max retries exceeded with url: http://google.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x11d3b5358>: Failed to establish a new connection: [Errno 61] Connection refused',)))
I tried using urls with http and https however nothing changed. I tried putting the proxy address as https, but still nothing.
Does anybody encountered this error and resolved it? I would really appreciate your help.
Thank you.
First, Luminati blocks google by default, and only allow specific use cases with the SERP zone.
Second, is google the only domain you target? Try lumtest.io/myip.json
The proxy manager should show you error codes in the logs, are there any clues there?
Try contacting Luminati live chat support from the control panel, you may have limitations on your account.
I'm trying to search using beautifulsoup with anaconda for python 3.6.
I am trying to scrape accuweather.com to find the weather in Tel Aviv.
This is my code:
from bs4 import BeautifulSoup
import requests
data=requests.get("https://www.accuweather.com/he/il/tel-
aviv/215854/weather-forecast/215854")
soup=BeautifulSoup(data.text,"html parser")
soup.find('div',('class','info'))
I get this error:
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
What can I do and what does this error mean?
What does this error mean
Googling for "errno 10600" yields quite a few results. Basically, it's a low-level network error (it's not http specific, you can have the same issue for any kind of network connection), whose canonical description is
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
IOW, your system failed to connect to the host. This might come from a lot of reasons, either temporary (like your internet connection is down) or not (like a proxy - if you are behind a proxy - blocking access to this host, etc), or quite simply (as is the case here) the host blocking your requests.
The first thing to do when you have such an error is to check your internet connection, then try to get the url in your browser. If you can get it in your browser then it's most often the host blocking you, most often based on your client's "user-agent" header (the client here is requests), and specifying a "standard" user-agent header as explained in newbie's answer should solve the problem (and it does in this case, or at least it did for me).
NB : to set the user agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data = requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
The problem does not come from the code, but from the website.
If you add User-Agent field in the header of the request it will look like it comes from a browser.
Example:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data=requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
I am using the python shell to test requests together with proxy servers.
After reading documentation (http://docs.python-requests.org/en/master/user/advanced/) and a few stackoverflow threads I am doing the following:
import requests
s = requests.session()
proxies = {'http': 'http://90.178.216.202:3128'}
s.proxies.update(proxies)
req = s.get('http://jsonip.com')
After this, if I print req.text, I get this:
u'{"ip":"my current IP (not the proxy server IP I have inserted before)","about":"/about", ......}'
Can you please explain why I'm getting my computer's IP address and not the proxy server's IP address?
Did I go wrong somewhere or am I expecting the wrong thing to happen here?
I am new to requests + proxy servers so I would like to make sure I am understanding this.
UPDATE
I also have this in my code:
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.0; WOW64; rv:24.0) Gecko/20100101 Firefox/24.0'}
s.headers.update(headers)
Thanks
Vittorio
The site ( http://jsonip.com ) broadcasts an 'Upgrade-Insecure-Requests' header. This means that your request gets redirected to https://jsonip.com, so requests doesn't use a proxy because you don't have an https proxy in your proxies dict.
So, all you have to do is add an https proxy in proxies , eg:
proxies = {'http':'http://90.178.216.202:3128', 'https':'https://90.178.216.202:3128'}
Instead of doing this pass user-agent
requests.post(url='abc.com',header={'user-agent':'Mozila 5.0'})
u need to change ur get request to have the proxies used.
something like this:req = s.get('http://jsonip.com', proxies=proxies)
I'm trying to determine high anonymity proxies. Also called private/elite proxies. From a forum I've read this:
High anonymity Servers don't send HTTP_X_FORWARDED_FOR, HTTP_VIA and
HTTP_PROXY_CONNECTION variables. Host doesn't even know you are using
proxy server and of course it doesn't know your IP address.
A highly anonymous proxy will display the following information:
REMOTE_ADDR = Proxy's IP address
HTTP_VIA = blank
HTTP_X_FORWARDED_FOR = blank
So, how I can check for this headers in Python, to discard them as a HA Proxy ? I have tried to retrieve the headers for 20-30 proxies using the requests package, also with urllib, with the build-in http.client, with urllib2. But I didn't see these headers, never. So I should be doing something wrong...
This is the code I've used to test with requests:
proxies = {'http': 'http://176.100.108.214:3128'}
header = {'user-agent':'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.360',}
s = requests.session()
s.proxies = proxies
r = s.get('http://www.python.org', headers=header)
print(r.status_code)
print(r.request.headers)
print(r.headers)
It sounds like the forum post you're referring to is talking about the headers seen by the server on your proxied request, not the headers seen by the client on the proxied response.
Since you're testing with www.python.org as the server, the only way to see the headers it receives would be to have access to their logs. Which you don't.
But there's a simple solution: run your own HTTP server, make requests against that, and then you can see what it receives. (If you're behind a firewall or NAT that the proxy you're testing won't be able to connect to, you may have to get a free hosted server somewhere; if not, you can just run it on your machine.)
If you have no idea how to set up and configure a web server, Python comes with one of its own. Just run this script with Python 3.2+ (on your own machine, or an Amazon EC2 free instance, or whatever):
from http.server import HTTPServer, SimpleHTTPRequestHandler
class HeaderDumper(SimpleHTTPRequestHandler):
def do_GET(self):
try:
return super().do_GET()
finally:
print(self.headers)
server = HTTPServer(("", 8123), HeaderDumper)
server.serve_forever()
Then run that script with python3 in the shell.
Then just run your client script, with http://my.host.ip instead of http://www.python.org, and look at what the script dumps to the server's shell.