I'm trying to learn web scraping in python with the request-html package. At first, I render a mainpage and pull out all the necessary links. That works just fine. Later I iterate over all links and render the specific subpage for that link. 2 Iterations are successful, but with the third I get an error that i am unable to solve.
Here is my code:
# import HTMLSession from requests_html
from requests_html import HTMLSession
# create an HTML Session object
session = HTMLSession()
# Use the object above to connect to needed webpage
baseurl = 'http://www.möbelfreude.de/'
resp = session.get(baseurl+'alle-boxspringbetten')
# Run JavaScript code on webpage
resp.html.render()
links = resp.html.find('a.image-wrapper.text-center')
for link in links:
print('Rendering... {}'.format(link.attrs['href']))
r = session.get(baseurl + link.attrs['href'])
r.html.render()
print('Completed rendering... {}'.format(link.attrs['href']))
# do stuff
Error:
Completed rendering... bett/boxspringbett-bea
Rendering... bett/boxspringbett-valina
Completed rendering... bett/boxspringbett-valina
Rendering... bett/boxspringbett-benno-anthrazit
Traceback (most recent call last):
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 603, in urlopen
chunked=chunked)
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 387, in _make_request
six.raise_from(e, None)
File "<string>", line 2, in raise_from
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\site-packages\urllib3\connectionpool.py", line 383, in _make_request
httplib_response = conn.getresponse()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 1336, in getresponse
response.begin()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 306, in begin
version, status, reason = self._read_status()
File "C:\Users\pasca\AppData\Local\Programs\Python\Python37-32\lib\http\client.py", line 275, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
During handling of the above exception, another exception occurred:```
The error is due to the connection closure, and may be due to some configurations on the server side.
Have you scraping the site and appending the links to a list.
Then request each link individually to find and locate the specific directory that is cause issue.
Using dev mode in chrome under the network tab can help identify the necessary headers for requests that require them.
Related
I'm using the requests (which uses urllib3 and the Python http module under the hood) library to upload a file from a Python script.
My backend starts by inspecting the headers of the request and if it doesn't comply with the needed prerequisites, it stops the request right away and respond with a valid 400 response.
This behavior works fine in Postman, or with Curl; i.e. the client is able to parse the 400 response even though it hasn't completed the upload and the server answers prematurely.
However, while doing so in Python with requests/urllib3, the library is unable to process the backend response :
Traceback (most recent call last):
File "C:\Users\Neumann\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\urllib3\connectionpool.py", line 670, in urlopen
httplib_response = self._make_request(
File "C:\Users\Neumann\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.8_qbz5n2kfra8p0\LocalCache\local-packages\Python38\site-packages\urllib3\connectionpool.py", line 392, in _make_request
conn.request(method, url, **httplib_request_kw)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1255, in request
self._send_request(method, url, body, headers, encode_chunked)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1301, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1250, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 1049, in _send_output
self.send(chunk)
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.8_3.8.1776.0_x64__qbz5n2kfra8p0\lib\http\client.py", line 971, in send
self.sock.sendall(data)
ConnectionResetError: [WinError 10054] Une connexion existante a dû être fermée par l’hôte distant
Because the server answers before the transfer is complete, it mistakenly considers that the connection has been aborted, even though the server DOES return a valid response.
Is there a way to avoid this and parse the response nonetheless ?
Steps to reproduce the issue :
Download minIO : https://min.io/download#/
Run minIO :
export MINIO_ACCESS_KEY=<access_key>
export MINIO_SECRET_KEY=<secret_key>
.\minio.exe server <data folder>
Run the following script :
import os
import sys
import requests
from requests_toolbelt.multipart.encoder import MultipartEncoder
def fatal(msg):
print(msg)
sys.exit(1)
def upload_file():
mp_encoder = MultipartEncoder(fields={'file': (open('E:/Downloads/kek.mp3', 'rb'))})
headers = { "Authorization": "invalid" }
print('Uploading file with headers : ' + str(headers))
upload_endpoint = 'http://localhost:9000/mybucket/myobject'
try:
r = requests.put(upload_endpoint, headers=headers, data=mp_encoder, verify=False)
except requests.exceptions.ConnectionError as e:
print(e.status)
for property, value in vars(e).items():
print(property, ":", value)
fatal(str(e))
if r.status_code != 201:
for property, value in vars(r).items():
print(property, ":", value)
fatal('Error while uploading file. Status ' + str(r.status_code))
print('Upload successfully completed')
if __name__ == "__main__":
upload_file()
If you change the request line with this, it will work (i.e. the server returns 400 and the client is able to parse it) :
r = requests.put(upload_endpoint, headers=headers, data='a string', verify=False)
EDIT : I updated the traceback and changed the question title to reflect the fact that it's neither requests or urllib3 fault, but the Python http module that is used by both of them.
This problem should be fixed in urllib3 v1.26.0. What version are you running?
The problem is that the server closes the connection after it responds with 400, so the socket is closed when urllib3 tries to keep sending data to it. So it isn't really mistakenly thinking that the connection is closed, it just mishandles that situation.
Your example code works fine on my machine with urllib3==1.26.0 . But I notice that you get a different exception on your Windows machine, so it might be that the fix doesn't work. In that case, I would just catch the exception and file a bug report to the maintainers of urllib3.
you should try requests.get instead of put and thans check if it works . Please find my sample code below .
try:
output = requests.get('http://' + <ipaddress>)
status = output.status_code
print(status)
if status == 200:
print("PASS: HTTP is successful")
else:
raise RuntimeError("FAIL: HTTP is not successful")
except:
Flag = Flag + 1
print("FAIL: HTTP is not successful ")
this code of mine works fine with python3
All I want to do is scrape some data about earthquakes from a website. In fact, I just want Python to be able to extract data from URL's. For some reason, even the simplest code which only opens a url and uses '.readlines()' is met with a wall of errors. It doesn't seem to understand the 'openurl' command, nor most anything else.
I don't know what to even try, because I can't parse the errors that it's giving me. I was hoping, before I had to do something drastic like re-download python or something, that someone would have an answer for me.
import urllib.request
def urltest():
url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
f = urllib.request.urlopen(url)
allLines = f.readlines()
f.close()
line = allLines[0].decode()
print(line)
This is the code I've used to simply test it. The URL goes to a website which holds a .csv file, which python should easily acquire and read through.
If anyone wants, I can actually post the entire wall of errors that this code returns. There looks to be at least 6 different ones, but this is the final line that it spits back:
urllib.error.URLError: <urlopen error unknown url type: https>
Looking through the urllib.requests module it loads a collection of handlers. we can see this code snippet in urllib.request.py
if hasattr(http.client, "HTTPSConnection"):
default_classes.append(HTTPSHandler)
skip = set()
for klass in default_classes:
for check in handlers:
if isinstance(check, type):
if issubclass(check, klass):
skip.add(klass)
elif isinstance(check, klass):
skip.add(klass)
for klass in skip:
default_classes.remove(klass)
for klass in default_classes:
opener.add_handler(klass())
So the https handler class is only loaded if the http.client.py has the attribute HTTPSConnection. If we look in the http.client.py we can see the following code for setting this attribute.
try:
import ssl
except ImportError:
pass
else:
class HTTPSConnection(HTTPConnection):
"This class allows communication via SSL."
default_port = HTTPS_PORT
So the HTTPSConnection class is only created if the ssl module can successfully be imported. If you system doesnt have the ssl module then http.client wont load the HTTPSConnection class which in turn will not add the attribute and as such urllib wont load a handler for https.
While the code you provided worked on my system. I added the following code before it to cause my system to not be able to locate the ssl module.
#load then remove the ssl module from the system
import sys
import ssl
del ssl
sys.modules['ssl']=None
import urllib.request
def urltest():
url = "http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv"
f = urllib.request.urlopen(url)
allLines = f.readlines()
f.close()
line = allLines[0].decode()
print(line)
urltest()
Doing this i get the same error you were getting
C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\python.exe C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py
Traceback (most recent call last):
File "C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py", line 19, in <module>
urltest()
File "C:/Users/cd00119621/PycharmProjects/ideas/stackoverflow.py", line 13, in urltest
f = urllib.request.urlopen(url)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 641, in http_response
'http', request, response, code, msg, hdrs)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 563, in error
result = self._call_chain(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 755, in http_error_302
return self.parent.open(new, timeout=req.timeout)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 525, in open
response = self._open(req, data)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 548, in _open
'unknown_open', req)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 503, in _call_chain
result = func(*args)
File "C:\Users\cd00119621\AppData\Local\Programs\Python\Python37\lib\urllib\request.py", line 1387, in unknown_open
raise URLError('unknown url type: %s' % type)
urllib.error.URLError: <urlopen error unknown url type: https>
So i suspect you have installed python without ssl configured. You should be able to verify this easly by just trying to import ssl from the python command line import ssl if you get an error like
>>> import ssl
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ModuleNotFoundError: No module named 'ssl'
Then that will be the cause of your issues. You would have to either reinstall python with ssl configured or somehow build the ssl module from source
It looks like the problem is a network(dns/proxy/firewall) issue.
https://github.com/pbugnion/gmaps/issues/245
You can use Pandas:
import pandas as pd
data = pd.read_csv('http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/all_day.csv')
print (data)
I am trying to implement proxies into my web crawler. Without the proxies, my code has no problem connecting to the website, however when I try to add in proxies, suddenly it won't connect! It doesn't look like anybody in python-requests has made a post about this problem, so I'm hoping you all can help me!
Background info: I'm using a Mac and using Anaconda's Python 3.4 inside of a virtual environment.
Here is my code that works without proxies
proxyDict = {'http': 'http://10.10.1.10:3128'}
def pmc_spider(max_pages, pmid):
start = 1
titles_list = []
url_list = []
url_keys = []
while start <= max_pages:
url = 'http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/'+str(pmid)+'/citedby/?page='+str(start)
req = requests.get(url) #this works
plain_text = req.text
soup = BeautifulSoup(plain_text, "lxml")
for items in soup.findAll('div', {'class': 'title'}):
title = items.get_text()
titles_list.append(title)
for link in items.findAll('a'):
urlkey = link.get('href')
url_keys.append(urlkey) #url = base + key
url = "http://www.ncbi.nlm.nih.gov"+str(urlkey)
url_list.append(url)
start += 1
return titles_list, url_list, authors_list
Based on other posts I'm looking at, I should just be able to replace this:
req = requests.get(url)
with this:
req = requests.get(url, proxies=proxyDict, timeout=2)
But this doesn't work! :( If I run it with this line of code the terminal gives me a TimeOut error
socket.timeout: timed out
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 578, in urlopen
chunked=chunked)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 362, in _make_request
conn.request(method, url, **httplib_request_kw)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1137, in request
self._send_request(method, url, body, headers)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1182, in _send_request
self.endheaders(body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 1133, in endheaders
self._send_output(message_body)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 963, in _send_output
self.send(msg)
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/http/client.py", line 898, in send
self.connect()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 167, in connect
conn = self._new_conn()
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connection.py", line 147, in _new_conn
(self.host, self.timeout))
requests.packages.urllib3.exceptions.ConnectTimeoutError: (<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)')
And then I get a few of these printed in the terminal with different traces but the same error:
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/adapters.py", line 403, in send
timeout=timeout
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/connectionpool.py", line 623, in urlopen
_stacktrace=sys.exc_info()[2])
File "/Users/hclent/anaconda3/envs/py34/lib/python3.4/site-packages/requests/packages/urllib3/util/retry.py", line 281, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
requests.packages.urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='10.10.1.10', port=3128): Max retries exceeded with url: http://www.ncbi.nlm.nih.gov/pmc/articles/pmid/18269575/citedby/?page=1 (Caused by ConnectTimeoutError(<requests.packages.urllib3.connection.HTTPConnection object at 0x1052665f8>, 'Connection to 10.10.1.10 timed out. (connect timeout=2)'))
Why would the addition of proxies to my code suddenly cause me to timeout? I tried it on several random urls and had the same thing happen. So it seems to be a problem with proxies rather than a problem with my code. However, I'm at the point where I MUST use proxies now so I need to get to the root of this and fix it. I've also tried several different IP addresses for the proxy from a VPN that I use, so I know the IP addresses are valid.
I appreciate your help so much! Thank you!
It looks like you'll need to use a http or https proxy that will respond to requests.
The 10.10.1.10:3128 in your code seems to be from examples in the requests documentation
Taking a proxy from the list at http://proxylist.hidemyass.com/search-1291967 (may not be the best source) your proxyDict should look like this: {'http' : 'http://209.242.141.60:8080'}
testing this on the command line seems to work fine:
>>> proxies = {'http' : 'http://209.242.141.60:8080'}
>>> requests.get('http://google.com', proxies=proxies)
<Response [200]>
The following code in Python, usually gives me the html of a website specified in the url variable. Urlopen works correctly. The problem: with a particular website from now I am no more able to make this code working.
from urllib.request import urlopen
url = "http://www.somewebsite.com"
html = urlopen(url)
print(html.read())
The error is the following:
File "C:/script.py", line 4, in <module>
html = urlopen(url)
[...]
File "C:\Python\Python35\lib\http\client.py", line 251, in _read_status
raise RemoteDisconnected("Remote end closed connection without"
http.client.RemoteDisconnected: Remote end closed connection without response
I'm trying to fetch some data from http://m.finnkino.fi/events/now_showing, but at the moment I'm failing badly because I'm not even able to load the page source with python.
At the moment I'm using following code:
req = urllib2.urlopen(URL,None,2.5)
page = req.read()
print page
Here is the traceback for timeout error:
Traceback (most recent call last):
File "user/src/finnkinoParser.py", line 26, in <module>
main()
File "user/src/finnkinoParser.py", line 13, in main
getNowPlayingMovies()
File "user/src/finnkinoParser.py", line 17, in getNowPlayingMovies
req = urllib2.urlopen(baseURL,None,2.5)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 124, in urlopen
return _opener.open(url, data, timeout)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 383, in open
response = self._open(req, data)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 401, in _open
'_open', req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 361, in _call_chain
result = func(*args)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1130, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/urllib2.py", line 1105, in do_open
raise URLError(err)
urllib2.URLError: <urlopen error timed out>
If I browse to the url with my browser it works fine. So could someone tell me what makes that site that much different so the urllib2 is unable to load the page. I suppose it has something to do with the site being aimed to mobile users. With "regular" sites urllib2 works fine. Is there any other kind of sites to which the basic urlopen(URL) doesn't work?
Thanks for help
Following snippet works fine.
import httplib
headers = {"User-Agent": "Mozilla/5.0"}
conn = httplib.HTTPConnection("m.finnkino.fi")
conn.request("GET", "/events/now_showing", "", headers)
response = conn.getresponse()
print response.status, response.reason
data = response.read()
print data
conn.close()
It seems their server has verified several request vars. After tested some times, here is conclusion:
http protocol must be HTTP/1.1.
if request headers have Connection prop, its value should be keep-alive.
request headers must have User-Agent prop, whatever its value.
While in urllib2, Connection prop in HTTPHandler has been set to Close by default (L1127 in urllib2.py). you can use urlgrabber or other HTTP handler which supports HTTP/1.1 and keep-alive.