urllib.error.URLError urlopen error [Errno 54] Connection reset by peer
I got this error by when try to fetch notino.com . I guess the guy used some clever way to prevent the screen scraper . I tried to add header and cookie but this doesn't work
from urllib.request import urlopen
url = "https://www.notino.com"
html = urlopen(url)
An auto-bot detection mechanism is most likely dropping your connection. You should provide a User-Agent header to fake a browser visit - worked for me:
>>> import requests
>>> response = requests.get(url, headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/55.0.2883.95 Safari/537.36'})
>>> response.status_code
200
Using requests module in this example.
Related
My code is below:
import urllib.request
import urllib.parse
from lxml import etree
HEADERS = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
url = url + urllib.parse.quote(word)
print('search: ', word)
req = urllib.request.Request(url=url, headers=HEADERS, method='GET')
response = urllib.request.urlopen(req)
text = response.read().decode('utf-8')
This works fine, but after about 400 requests, I got this error:
urllib.error.URLError: <urlopen error [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1076)>
What might be the cause of this?
The connection is being rejected during the TLS handshake for some reason. It could just be something transient (and a retry would work) or it may be due to a mismatch due to TLS protocol versions or ciphers. Or it could also be for other reasons that aren't immediately obvious, like a block list, or a broken server etc.
Adding some detection to your code that retries a few times then skips is probably a good general approach. Something like:
for i in range(0,3):
try:
# CONNECT
except:
continue
break
If you want to understand exactly why this particular URL is failing, the easiest solution is to download a copy of wireshark and see what is happening on the wire. There will likely be a TLS error, and possibly an alert code and message that gives more information.
I am looking for downloading the PDFs with python and using requests library for the same. Following code works for some of the PDF documents but It throws an error for few documents.
from pathlib import Path
import requests
filename = Path('c:/temp.pdf')
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
response = requests.get(url,verify=False)
filename.write_bytes(response.content)
Following is the exact response (response.content), however, I can download the same document using a chrome browser without any error
b'<HTML><HEAD>\n<TITLE>Access Denied</TITLE>\n</HEAD><BODY>\n<H1>Access Denied</H1>\n \nYou don\'t have permission to access "http://www.rolls-royce.com/%7e/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf" on this server.<P>\nReference #18.36ad4d68.1562842755.6294c42\n</BODY>\n</HTML>\n'
Is there any way to get rid out of this?
You get 403 Forbidden because requests by default sends User-Agent: python-requests/2.19.1 header and server denies your request.
You can get the correct value for this header from your browser and everything will be fine.
For example:
import requests
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.169 YaBrowser/19.6.1.153 Yowser/2.5 Safari/537.36'}
url = 'https://www.rolls-royce.com/~/media/Files/R/Rolls-Royce/documents/investors/annual-reports/rr-full%20annual%20report--tcm92-55530.pdf'
r = requests.get(url, headers=headers)
print(r.status_code) # 200
I am trying to download an image from an url, with python requests.session also adding user-agent, still facing 403 forbidden error, please help.
my code:
import requests
from bs4 import BeautifulSoup
import pandas as pd
s = requests.session()
headers = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36'}
#logged in! cookies saved for future requests.
r=s.get("https://example.com/homepage",headers=headers)
#cookies sent automatically!
soup=BeautifulSoup(r.text,'html.parser')
te=s.get("https://www.example.com/"+soup.find(class_='yes').find('a').get('href'),headers=headers).text
s.get('https://img.example.com/exampleimgcode.jpg', stream = True,headers=headers}))
out[]: <Response [403]>
I have trouble fetching zomato.com website using Python and requests library.
import requests
r = requests.get('https://www.zomato.com/san-antonio')
print r.status_code
I run this script and get no response. I'm guessing that the problem is https, but I tried it with some other https websites and it worked liked a charm, and 200 was printed to the console.
Am I missing something here?
You'll need to pretend you're coming from an actual browser:
import requests
r = requests.get('https://www.zomato.com/san-antonio', headers={"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/47.0.2526.106 Safari/537.36"})
print(r.status_code)
# returns: 200
I want to open a url using urllib.request.urlopen('someurl'):
with urllib.request.urlopen('someurl') as url:
b = url.read()
I keep getting the following error:
urllib.error.HTTPError: HTTP Error 403: Forbidden
I understand the error to be due to the site not letting python access it, to stop bots wasting their network resources- which is understandable. I went searching and found that you need to change the user agent for urllib. However all the guides and solutions I have found for this issue as to how to change the user agent have been with urllib2, and I am using python 3 so all the solutions don't work.
How can I fix this problem with python 3?
From the Python docs:
import urllib.request
req = urllib.request.Request(
url,
data=None,
headers={
'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.47 Safari/537.36'
}
)
f = urllib.request.urlopen(req)
print(f.read().decode('utf-8'))
from urllib.request import urlopen, Request
urlopen(Request(url, headers={'User-Agent': 'Mozilla'}))
I just answered a similar question here: https://stackoverflow.com/a/43501438/206820
In case you just not only want to open the URL, but also want to download the resource(say, a PDF file), you can use the code as below:
# proxy = ProxyHandler({'http': 'http://192.168.1.31:8888'})
proxy = ProxyHandler({})
opener = build_opener(proxy)
opener.addheaders = [('User-Agent','Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_4) AppleWebKit/603.1.30 (KHTML, like Gecko) Version/10.1 Safari/603.1.30')]
install_opener(opener)
result = urlretrieve(url=file_url, filename=file_name)
The reason I added proxy is to monitor the traffic in Charles, and here is the traffic I got:
The host site rejection is coming from the OWASP ModSecurity Core Rules for Apache mod-security. Rule 900002 has a list of "bad" user agents, and one of them is "python-urllib2". That's why requests with the default user agent fail.
Unfortunately, if you use Python's "robotparser" function,
https://docs.python.org/3.5/library/urllib.robotparser.html?highlight=robotparser#module-urllib.robotparser
it uses the default Python user agent, and there's no parameter to change that. If "robotparser"'s attempt to read "robots.txt" is refused (not just URL not found), it then treats all URLs from that site as disallowed.