I am trying currently to scrape a website using a proxy. I chose to use Luminati as proxy provider.
I created a zone with a Data Center using Shared IPs.
Luminati Dashboard
Then I installed Luminati Proxy Manager on my local machine, set up a proxy port using default config.
import requests
ip_json = requests.get('http://lumtest.com/myip.json', proxies={"http":"http://localhost:24000/",
"https":"http://localhost:24000/"}).json()
proxy = "https://" + ip_json['ip'] + ':' + str(ip_json['asn']['asnum'])
proxies ={"https": proxy , "http": proxy }
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
response = requests.get('http://google.com/', headers=headers, proxies=proxies)
However, each time I get
ProxyError: HTTPSConnectionPool(host='x.x.x.x', port=x): Max retries exceeded with url: http://google.com/ (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x11d3b5358>: Failed to establish a new connection: [Errno 61] Connection refused',)))
I tried using urls with http and https however nothing changed. I tried putting the proxy address as https, but still nothing.
Does anybody encountered this error and resolved it? I would really appreciate your help.
Thank you.
First, Luminati blocks google by default, and only allow specific use cases with the SERP zone.
Second, is google the only domain you target? Try lumtest.io/myip.json
The proxy manager should show you error codes in the logs, are there any clues there?
Try contacting Luminati live chat support from the control panel, you may have limitations on your account.
Related
I have three versions that i tested: 3.10.2, 3.9.9 and 3.8.10 on different machines and even one online compiler. In all of them i did the following:
import requests
requests.get(url, proxies=proxies, headers=headers)
Testing in each parameter:
url:
"https://www.icanhazip.com"
"http://www.icanhazip.com"
proxies:
{"https": "223.241.0.250:3000", "http": "223.241.0.250:3000"}
{"https": "223.241.0.250:3000", "http": "223.241.0.250:3000"}
headers:
{'User-Agent': 'Chrome'}
{"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36"}
In all of them but the 3.10.2 I got this error:
ProxyError: HTTPSConnectionPool(host='www.icanhazip.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', NewConn
ectionError('<urllib3.connection.HTTPSConnection object at 0x05694F40>: Failed to establish a new connection: [WinError 10060] A connection attempt failed beca
use the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond')))
On the 3.10.2 i got:
InvalidURL: Proxy URL had no scheme, should start with http:// or https://
But when i tried to put the proxies like this:
{"https": "223.241.0.250:3000", "http": "223.241.0.250:3000"}
It didnt work and it just showed my normal ip.
What am I missing? A normal requests works just fine but when I add the proxies it just doesnt work. This code was working fine a while back and now outputs this error I cant figure out why.
Try adding a scheme in the proxy address
{"https": "http://223.241.0.250:3000", "http": "http://223.241.0.250:3000"}
{"https": "http://223.241.0.250:3000", "http": "http://223.241.0.250:3000"}
I have deployed an AWS ec2 instance to use a proxy. I have edited the security policies and have allowed my machine to have access to the server. I am using port 22 for SSH, and port 4444 for the proxy. For some reason I still can not start a session using the proxy.
The code:
import requests
session = requests.Session()
user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'
headers = {'user-agent' : user_agent}
proxies = {
'http' : 'socks5h://ec2-ip-address-here.us-east-2.compute.amazonaws.com:4444',
'https' : 'socks5h://ec2-ip-address-here.us-east-2.compute.amazonaws.com:4444',
}
print(session.get('https://www.ipchicken.com/', headers=headers, proxies=proxies).content)
The error:
requests.exceptions.ConnectionError: SOCKSHTTPSConnectionPool(host='www.ipchicken.com', port=443): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.contrib.socks.SOCKSHTTPSConnection object at 0x107a09048>: Failed to establish a new connection: [Errno 61] Connection refused'))
I'm not sure what I am doing wrong. I followed this video https://www.youtube.com/watch?v=HOL2eg0g0Ng for setting up the server. Thanks to all of those who reply in advance.
You need to be using socks5h:// for your http and https proxies.
I get this error on macOS when using socks5://.
I am trying to scrape a website using requests in python.
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
# set the headers like we are a browser,
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36'}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers )
This is working fine when I use my personal wifi. However, when I connect to my company's VPN, I get the following error.
ConnectionError: HTTPSConnectionPool(host='stackoverflow.com', port=443): Max retries exceeded with url: /questions/23013220/max-retries-exceeded-with-url (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10061] No connection could be made because the target machine actively refused it',))
Now, I need this to work over my company's VPN because I need to access a website which works only in that. How to resolve this?
In my case, the problem was related to IPv6.
Our VPN used split tunneling, and it seems the VPN configuration does not support IPv6.
So for example this would hang forever:
requests.get('https://pokeapi.co/api/v2/pokemon')
But if you add a timeout, the request succeeds:
requests.get('https://pokeapi.co/api/v2/pokemon', timeout=1)
But not all machines were having this problem. So I compared the output of this among two different machines:
import socket
for line in socket.getaddrinfo('pokeapi.co', 443):
print(line)
The working one only returned IPv4 addresses. The non-working machine returned both IPv4 and IPv6 addresses.
So with the timeout specified, my theory is that python fails quickly with IPv6 and then moves to IPv4, where the request succeeds.
Ultimately we resolved this by disabling IPv6 on the machine:
networksetup -setv6off "Wi-Fi"
But I assume that this could instead be resolved through VPN configuration.
How about trying like this:
url = "https://stackoverflow.com/questions/23013220/max-retries-exceeded-with-url"
ua = UserAgent()
headers = headers = {"User-Agent": ua.random}
# download the homepage
s = requests.Session()
s.trust_env = False
response = s.get(url, headers=headers)
It seems to be caused by UserAgent() settings difference.
Try to set trust_env = None
trust_env = None #
Trust environment settings for proxy configuration, default authentication and similar.
Or you can disable proxies for a particular domain. The question
import os
os.environ['NO_PROXY'] = 'stackoverflow.com'
In my organization, I have to run my program under VPN for different geo locations. so we have multiple proxy configurations.
I found it simpler to use a package called PyPAC to get my proxy details automatically
from pypac import PACSession
from requests.auth import HTTPProxyAuth
session = PACSession()
# when the username and password is required
# session = PACSession(proxy_auth=HTTPProxyAuth(name, password))
r = session.get('http://example.org')
How does this work:
The package locates the PAC file which is configured by the organization. This file consist of proxy configuration detail (more info).
I'm trying to search using beautifulsoup with anaconda for python 3.6.
I am trying to scrape accuweather.com to find the weather in Tel Aviv.
This is my code:
from bs4 import BeautifulSoup
import requests
data=requests.get("https://www.accuweather.com/he/il/tel-
aviv/215854/weather-forecast/215854")
soup=BeautifulSoup(data.text,"html parser")
soup.find('div',('class','info'))
I get this error:
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', OSError("(10060,
'WSAETIMEDOUT')",))
What can I do and what does this error mean?
What does this error mean
Googling for "errno 10600" yields quite a few results. Basically, it's a low-level network error (it's not http specific, you can have the same issue for any kind of network connection), whose canonical description is
A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
IOW, your system failed to connect to the host. This might come from a lot of reasons, either temporary (like your internet connection is down) or not (like a proxy - if you are behind a proxy - blocking access to this host, etc), or quite simply (as is the case here) the host blocking your requests.
The first thing to do when you have such an error is to check your internet connection, then try to get the url in your browser. If you can get it in your browser then it's most often the host blocking you, most often based on your client's "user-agent" header (the client here is requests), and specifying a "standard" user-agent header as explained in newbie's answer should solve the problem (and it does in this case, or at least it did for me).
NB : to set the user agent:
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data = requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
The problem does not come from the code, but from the website.
If you add User-Agent field in the header of the request it will look like it comes from a browser.
Example:
from bs4 import BeautifulSoup
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.106 Safari/537.36',
}
data=requests.get("https://www.accuweather.com/he/il/tel-aviv/215854/weather-forecast/215854", headers=headers)
I'm in internship for 3 months in a computer science laboratory (LIRIS). My internship supervisor asks me to retrieve some data on meilleurs-agents.com. This is a real estate website and I would like to retrieve the price of square meter for each city. My program is in Python and I actually try to send multiple requests to get data. But it doesn't work because of a proxy error :
HTTPConnectionPool(host='XXXXXX', port=XXXX): Max retries exceeded with url: "..." (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<urllib3.connection.HTTPConnection object at 0x000000000B304320>: Failed to establish a new connection: [Errno 11001] getaddrinfo failed',)))
A preview of my code :
headers = requests.utils.default_headers()
headers.update({
'User-Agent': 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.0.7) Gecko/2009021910 Firefox/3.0.7'
})
for city, postal_code in zip(cities, postal_codes):
url = 'https://www.meilleursagents.com/prix-immobilier/'+city+'-'+postal_code+'/'
PROXY = {'https' : 'XX.XXX.X.XXX:XXXX'}
try:
response = requests.get(url, timeout=10, proxies=PROXY)
except Exception as e :
print(e)
If I remove the proxy, my request works but the html code contains a message like "you seems to be a bot so your request hasn't been completed" so I can't get prices... But I really need this data
Hope that my problem is clear and that someone could help me :)
Thanks, Nelly
PS : Sorry for my English, I'm a French student :D
Try changing User-Agent header and cookies for your requests.
Another workaround is to try adding some timeout between requests:
time.sleep(1) # try to use different time values
This of course will slow down your script, but may help to avoid too-many-requests errors.