I'm trying to get the content of App Store > Business:
import requests
from lxml import html
page = requests.get("https://itunes.apple.com/in/genre/ios-business/id6000?mt=8")
tree = html.fromstring(page.text)
flist = []
plist = []
for i in range(0, 100):
app = tree.xpath("//div[#class='column first']/ul/li/a/#href")
ap = app[0]
page1 = requests.get(ap)
When I try the range with (0,2) it works, but when I put the range in 100s it shows this error:
Traceback (most recent call last):
File "/home/preetham/Desktop/eg.py", line 17, in <module>
page1 = requests.get(ap)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 55, in get
return request('get', url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 44, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 383, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 486, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 378, in send
raise ConnectionError(e)
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='itunes.apple.com', port=443): Max retries exceeded with url: /in/app/adobe-reader/id469337564?mt=8 (Caused by <class 'socket.gaierror'>: [Errno -2] Name or service not known)
Just use requests features:
import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
session = requests.Session()
retry = Retry(connect=3, backoff_factor=0.5)
adapter = HTTPAdapter(max_retries=retry)
session.mount('http://', adapter)
session.mount('https://', adapter)
session.get(url)
This will GET the URL and retry 3 times in case of requests.exceptions.ConnectionError. backoff_factor will help to apply delays between attempts to avoid failing again in case of periodic request quota.
Take a look at urllib3.util.retry.Retry, it has many options to simplify retries.
What happened here is that itunes server refuses your connection (you're sending too many requests from same ip address in short period of time)
Max retries exceeded with url: /in/app/adobe-reader/id469337564?mt=8
error trace is misleading it should be something like "No connection could be made because the target machine actively refused it".
There is an issue at about python.requests lib at Github, check it out here
To overcome this issue (not so much an issue as it is misleading debug trace) you should catch connection related exceptions like so:
try:
page1 = requests.get(ap)
except requests.exceptions.ConnectionError:
r.status_code = "Connection refused"
Another way to overcome this problem is if you use enough time gap to send requests to server this can be achieved by sleep(timeinsec) function in python (don't forget to import sleep)
from time import sleep
All in all requests is awesome python lib, hope that solves your problem.
Just do this,
Paste the following code in place of page = requests.get(url):
import time
page = ''
while page == '':
try:
page = requests.get(url)
break
except:
print("Connection refused by the server..")
print("Let me sleep for 5 seconds")
print("ZZzzzz...")
time.sleep(5)
print("Was a nice sleep, now let me continue...")
continue
You're welcome :)
I got similar problem but the following code worked for me.
url = <some REST url>
page = requests.get(url, verify=False)
"verify=False" disables SSL verification. Try and catch can be added as usual.
pip install pyopenssl seemed to solve it for me.
https://github.com/requests/requests/issues/4246
Specifying the proxy in a corporate environment solved it for me.
page = requests.get("http://www.google.com:80", proxies={"http": "http://111.233.225.166:1234"})
The full error is:
requests.exceptions.ConnectionError: HTTPSConnectionPool(host='www.google.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond'))
It is always good to implement exception handling. It does not only help to avoid unexpected exit of script but can also help to log errors and info notification. When using Python requests I prefer to catch exceptions like this:
try:
res = requests.get(adress,timeout=30)
except requests.ConnectionError as e:
print("OOPS!! Connection Error. Make sure you are connected to Internet. Technical Details given below.\n")
print(str(e))
renewIPadress()
continue
except requests.Timeout as e:
print("OOPS!! Timeout Error")
print(str(e))
renewIPadress()
continue
except requests.RequestException as e:
print("OOPS!! General Error")
print(str(e))
renewIPadress()
continue
except KeyboardInterrupt:
print("Someone closed the program")
Here renewIPadress() is a user define function which can change the IP address if it get blocked. You can go without this function.
Adding my own experience for those who are experiencing this in the future. My specific error was
Failed to establish a new connection: [Errno 8] nodename nor servname provided, or not known'
It turns out that this was actually because I had reach the maximum number of open files on my system. It had nothing to do with failed connections, or even a DNS error as indicated.
When I was writing a selenium browser test script, I encountered this error when calling driver.quit() before a usage of a JS api call.Remember that quiting webdriver is last thing to do!
i wasn't able to make it work on windows even after installing pyopenssl and trying various python versions (while it worked fine on mac), so i switched to urllib and it works on python 3.6 (from python .org) and 3.7 (anaconda)
import urllib
from urllib.request import urlopen
html = urlopen("http://pythonscraping.com/pages/page1.html")
contents = html.read()
print(contents)
just import time
and add :
time.sleep(6)
somewhere in the for loop, to avoid sending too many request to the server in a short time.
the number 6 means: 6 seconds.
keep testing numbers starting from 1, until you reach the minimum seconds that will help to avoid the problem.
It could be network config issue also. So, for that u need to re-config ur network confgurations.
for Ubuntu :
sudo vim /etc/network/interfaces
add 8.8.8.8 in dns-nameserver and save it.
reset ur network : /etc/init.d/networking restart
Now try..
Adding my own experience :
r = requests.get(download_url)
when I tried to download a file specified in the url.
The error was
HTTPSConnectionPool(host, port=443): Max retries exceeded with url (Caused by SSLError(SSLError("bad handshake: Error([('SSL routines', 'tls_process_server_certificate', 'certificate verify failed')])")))
I corrected it by adding verify = False in the function as follows :
r = requests.get(download_url + filename)
open(filename, 'wb').write(r.content)
Check your network connection. I had this and the VM did not have a proper network connection.
I had the same error when I run the route in the browser, but in postman, it works fine. It issue with mine was that, there was no / after the route before the query string.
127.0.0.1:5000/api/v1/search/?location=Madina raise the error and removing / after the search worked for me.
This happens when you send too many requests to the public IP address of https://itunes.apple.com. It as you can see caused due to some reason which does not allow/block access to the public IP address mapping with https://itunes.apple.com. One better solution is the following python script which calculates the public IP address of any domain and creates that mapping to the /etc/hosts file.
import re
import socket
import subprocess
from typing import Tuple
ENDPOINT = 'https://anydomainname.example.com/'
ENDPOINT = 'https://itunes.apple.com/'
def get_public_ip() -> Tuple[str, str, str]:
"""
Command to get public_ip address of host machine and endpoint domain
Returns
-------
my_public_ip : str
Ip address string of host machine.
end_point_ip_address : str
Ip address of endpoint domain host.
end_point_domain : str
domain name of endpoint.
"""
# bash_command = """host myip.opendns.com resolver1.opendns.com | \
# grep "myip.opendns.com has" | awk '{print $4}'"""
# bash_command = """curl ifconfig.co"""
# bash_command = """curl ifconfig.me"""
bash_command = """ curl icanhazip.com"""
my_public_ip = subprocess.getoutput(bash_command)
my_public_ip = re.compile("[0-9.]{4,}").findall(my_public_ip)[0]
end_point_domain = (
ENDPOINT.replace("https://", "")
.replace("http://", "")
.replace("/", "")
)
end_point_ip_address = socket.gethostbyname(end_point_domain)
return my_public_ip, end_point_ip_address, end_point_domain
def set_etc_host(ip_address: str, domain: str) -> str:
"""
A function to write mapping of ip_address and domain name in /etc/hosts.
Ref: https://stackoverflow.com/questions/38302867/how-to-update-etc-hosts-file-in-docker-image-during-docker-build
Parameters
----------
ip_address : str
IP address of the domain.
domain : str
domain name of endpoint.
Returns
-------
str
Message to identify success or failure of the operation.
"""
bash_command = """echo "{} {}" >> /etc/hosts""".format(ip_address, domain)
output = subprocess.getoutput(bash_command)
return output
if __name__ == "__main__":
my_public_ip, end_point_ip_address, end_point_domain = get_public_ip()
output = set_etc_host(ip_address=end_point_ip_address, domain=end_point_domain)
print("My public IP address:", my_public_ip)
print("ENDPOINT public IP address:", end_point_ip_address)
print("ENDPOINT Domain Name:", end_point_domain )
print("Command output:", output)
You can call the above script before running your desired function :)
My situation is rather special. I tried the answers above, none of them worked. I suddenly thought whether it has something to do with my Internet proxy? You know, I'm in mainland China, and I can't access sites like google without an internet proxy. Then I turned off my Internet proxy and the problem was solved.
In my case, I am deploying some docker containers inside the python script and then calling one of the deployed services. Error is fixed when I add some delay before calling the service. I think it needs time to get ready to accept connections.
from time import sleep
#deploy containers
#get URL of the container
sleep(5)
response = requests.get(url,verify=False)
print(response.json())
First I ran the run.py file and then I ran the unit_test.py file, it works for me
Add headers for this request.
headers={
'Referer': 'https://itunes.apple.com',
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36'
}
requests.get(ap, headers=headers)
I am coding a test with Gauge and I encountered this error as well, it was because I was trying to request an internal URL without activating VPN.
Related
I want to crawl a given list by the Top-1-Million from Alexa, to check which website still offers acces via http:// an do not redirect to https://.
If the webpage does not redirect to a https:// Domain, it should be written into a csv file.
The Problem occurs, when I am adding a bunch of multiple URLs. Than I get two errors:
ssl.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1056
or
requests.exceptions.ConnectionError: HTTPConnectionPool(host='17ok.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11001] getaddrinfo failed')
I have tried the opportunities mentioned in the following threads and documentation:
https://2.python-requests.org//en/latest/user/advanced/#ssl-cert-verification
Edit: the sample url: https://requestb.in raises a 404 error actually, probably does not exist even more (?)
Python Requests throwing SSLError
Python Requests: NewConnectionError
requests.exceptions.SSLError: HTTPSConnectionPool: (Caused by SSLError(SSLError(336445449, '[SSL] PEM lib (_ssl.c:3816)')))
and some other delivered solutions.
The option to set verify=False helps, when using it for few URLs, but not when using a List > 10 URLs, the program brakes. I tried my program on a Win10 machine as well as on Ubuntu 16.04.
As expected, its the same issue. I also tried the option using Sessions and installed the certificate library as sugested.
If I am just calling three pages like 'http://www.example.com', 'https://www.github.com' and 'http://www.python.org', its not a big deal and the delivered solutions. The Headache starts, when using a bunch of URLs from the Alexa List.
Here is my code, which is working, when using it for only 3-4 urls:
import requests
from requests.utils import urlparse
urls = ['http://www.example.com',
'http://bloomberg.com',
'http://github.com',
'https://requestbin.fullcontact.com/']
with open('G:\\Request_HEADER_Suite/dummy/http.csv', 'w') as f:
for url in urls:
r = requests.get(url, stream=True, verify=False)
parsed_url = urlparse(r.url)
print("URL: ", url)
print("Redirected to: ", r.url)
print("Status Code: ", r.status_code)
print("Scheme: ", parsed_url.scheme)
if parsed_url.scheme == 'http':
f.write(url + '\n')
I expect to crawl at least a list with 100 URLs. The code should write URLs which are accessible by http:// and do not redirect to https:// into a csv file or complementary database and ignore all URLs with https://.
Because it is working for few URLs, I would expectd a stable opportunity for a larger scan.
But 2 errors araise and break the program. Is it worthy to try a workaround using pytest? Any other suggestions? Thanks in advance.
EDIT:
This is a list, which will raise errors. Only for clarification, this list from a study based on the Alexa-Top-1-Million.
urls = ['http://www.example.com',
'http://bloomberg.com',
'http://github.com',
'https://requestbin.fullcontact.com/',
'http://51sole.com',
'http://58.com',
'http://9gag.com',
'http://abs-cbn.com',
'http://academia.edu',
'http://accuweather.com',
'http://addroplet.com',
'http://addthis.com',
'http://adf.ly',
'http://adhoc2.net',
'http://adobe.com',
'http://1688.com',
'http://17ok.com',
'http://17track.net',
'http://1and1.com',
'http://1tv.ru',
'http://2ch.net',
'http://360.cn',
'http://39.net',
'http://4chan.org',
'http://4pda.ru']
I double checked, the last time the errors starts with the url 17.ok.com. But I have also tried different lists with urls. Thanks for your support.
I am using the Python Requests module (v. 2.19.1) with Python 3.4.3, calling a function on a remote server that generates a .csv file for download. In general, it works perfectly. There is one particular file that takes >6 minutes to complete, and no matter what I set the timeout parameter to, I get an error after exactly 5 minutes trying to generate that file.
import requests
s = requests.Session()
authPayload = {'UserName': 'myloginname','Password': 'password'}
loginURL = 'https://myremoteserver.com/login/authenticate'
login = s.post(loginURL, data=authPayload)
backupURL = 'https://myremoteserver.com/directory/jsp/Backup.jsp'
payload = {'command': fileCommand}
headers = {'Connection': 'keep-alive'}
post = s.post(backupURL, data=payload, headers=headers, timeout=None)
This times out after exactly 5 minutes with the error:
File "/usr/lib/python3/dist-packages/requests/adapters.py", line 330, in send
timeout=timeout
File "/usr/lib/python3/dist-packages/urllib3/connectionpool.py", line 612, in urlopen
raise MaxRetryError(self, url, e)
urllib3.exceptions.MaxRetryError: > HTTPSConnectionPool(host='myremoteserver.com', port=443): Max retries exceeded with url: /directory/jsp/Backup.jsp (Caused by < class 'http.client.BadStatusLine'>: '')
If I set timeout to something much smaller, say, 5 seconds, I get a error that makes perfect sense:
urllib3.exceptions.ReadTimeoutError:
HTTPSConnectionPool(host='myremoteserver.com', port=443): Read
timed out. (read timeout=5)
If I run the process from a browser, it works fine, so it doesn't seem like it's the remote server closing the connection, or a firewall or something in-between closing the connection.
Posted at the request of the OP -- my comments on the original question pointed to a related SO problem
The clue to the problem lies in the http.client.BadStatusLine error.
Take a look at the following related SO Q & A that discusses the impact of proxy servers on HTTP requests and responses.
I'm trying to use the Python requests library to send an android .apk file to a API service. I've successfully used requests and this file type to submit to another service but I keep getting a:
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='REDACTED', port=443): Max retries exceeded with url: /upload/app (Caused by : [WinError 10054] An existing connection was forcibly closed by the remote host)",),)
This is the code responsible:
url = "https://website"
files = {'file': open(app, 'rb')}
headers = {'user':'value', 'pass':'value'}
try:
response = requests.post(url, files=files, headers=headers)
jsonResponse = json.loads(response.text)
if 'error' in jsonResponse:
logger.error(jsonResponse['error'])
except Exception as e:
logger.error("Exception when trying to upload app to host")
The response line is throwing the above mentioned exception. I've used these exact same parameters using the Chrome Postman extension to replicate the POST request and it works perfectly. I've used the exact same format of file to upload to another RESTful service as well. The only difference between this request and the one that works is that this one has custom headers attached in order to verify the POST. The API doesn't stipulate this as authentication in the sense of needing to be encoded and the examples both in HTTP and cURL define these values as headers or -H.
Any help would be most appreciated!
So this was indeed a certificates issue. In my case I was able to stay internal to my company and connect to another URL, but the requests library, which is quite amazing, has information on certs at: http://docs.python-requests.org/en/latest/user/advanced/?highlight=certs
For all intents and purposes this is answered but perhaps it will be useful to someone in posterity.
I'm trying to follow the suggestions from this thread, but its not working. At this point, I just wanted to perform a very simple test to make sure that I was actually getting the data returned from the site I'm trying to open. For my simple test I was trying to open the Yahoo weather api. I've verified that typing in this address in a web browser does indeed return data. I've tried both of these code snippets and neither one is working.
import urllib
params = urllib.urlencode({'w': 2482950})
f = urllib.urlopen("http://weather.yahooapis.com/forecastrss?%s" % params)
print f.read()
This example came straight from the Python website. This dies with:
IOError: [Errno socket error] [Errno 110] Connection timed out
I also tried using httplib like this:
import httplib
conn = httplib.HTTPConnection(host='weather.yahooapis.com', timeout=10)
req = '/forecastrss?w=2482950'
try:
conn.request('GET',req)
except:
print "Didn't work"
content = conn.getresponse().read()
print content
Trying this gives me the following error:
self.fp = sock.makefile('rb', 0)
AttributeError: 'NoneType' object has no attribute 'makefile'
It appears that I'm not ever making the connection to the remote host. Any ideas what I'm doing wrong?
The problem was indeed the firewall. I added the following code to get it to work:
proxy_support = urllib2.ProxyHandler({"http":"http://<proxy>:<port>"})
opener = urllib2.build_opener(proxy_support)
urllib2.install_opener(opener)
html = urllib2.urlopen(url).read()
print html
I have a pile of tasks to automate within cPanel. There is a cPanel API described at http://videos.cpanel.net/cpanel-api-automation/ but I tried what I thought was easier for me...
Based on an answer from skyronic at How do I send a HTTP POST value to a (PHP) page using Python? I tried
import urllib, urllib2, ssl
url = 'https://mysite.com:2083/login'
user_agent = 'Mozilla/5.0 meridia (Windows NT 5.1; U; en)'
values = {'name':cpaneluser,
'pass':cpanelpw}
headers = {'User-Agent':user_agent}
data = urllib.urlencode(values)
req = urllib2.Request(url,data,headers)
response = urllib2.urlopen(req)
page = response.read()
The call to urlopen() is raising NameError: global name 'HTTPSConnectionV3' is not defined.
So then based on http://bugs.python.org/issue11220 I tried preceding the code above with
import httplib
class HTTPSConnectionV3(httplib.HTTPSConnection):
def __init__(self, *args, **kwargs):
httplib.HTTPSConnection.__init__(self, *args, **kwargs)
def connect(self):
sock = socket.create_connection((self.host, self.port), self.timeout)
if self._tunnel_host:
self.sock = sock
self._tunnel()
try:
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file, \
ssl_version=ssl.PROTOCOL_SSLv3)
except ssl.SSLError, e:
print("Trying SSLv3.")
self.sock = ssl.wrap_socket(sock, self.key_file, self.cert_file, \
ssl_version=ssl.PROTOCOL_SSLv23)
class HTTPSHandlerV3(urllib2.HTTPSHandler):
def https_open(self, req):
return self.do_open(HTTPSConnectionV3, req)
urllib2.install_opener(urllib2.build_opener(HTTPSHandlerV3()))
This does print the "Trying SSLv3" and raises URLError: <urlopen error [Errno 1] _ssl.c:504: error:140770FC:SSL routines:SSL23_GET_SERVER_HELLO:unknown protocol>
And finally that led me to https://github.com/kennethreitz/requests/issues/606 where gregakespret who say he solved a similar problem using a solution from Senthil Kuaran at http://bugs.python.org/issue11220 :
https_sslv3_handler = urllib.request.HTTPSHandler(context=ssl.SSLContext(ssl.PROTOCOL_SSLv3))
opener = urllib.request.build_opener(https_sslv3_handler)
urllib.request.install_opener(opener)
But that raises AttributeError: 'module' object has no attribute 'request'. And indeed help(urllib) doesn't include any mention of request, and import urllib.request results in No module named request.
I'm using Python 2.7.3 within the Enthought Canopy distribution. The cPanel site is using a self-signed certificate, which I mention since it'sa an irregularity that would trip up a regular browser, though I gather that urllib and urllib2 don't actually authenticate the certificate anyway.
Thank you for reading, more so if you have a suggestion or can help me understand the problem.
I would use the requests library.
I'm the OP and it's been a while since I posted this. I've solved other related tasks (POST to use Instructure's Canvas API) using the requests library and found that code that had worked with urllib/urllib2 is now much shorter and sweeter.
Someone just upvoted this question, causing me to see that noone had answered it. My answer isn't much of one to my OP, but it is the direction I'd advise, having solved related problems since the post.
As for this question, I solved the problem by scripting in BASH on the server that had cPanel running. It was just a matter of identifying which cPanel scripts to call. But I did not get it running through the cPanel Web API.