Python requests chokes on a UTF-8 filename

Python requests chokes on a UTF-8 filename - python

I'm trying to upload a file, using the requests library to submit a POST.
This works fine:
theFile = { 'LUuploadFile': ("linea.ipa", open(path_to_file, 'rb'), 'application/octet-stream') }
request = requests.post(url, files=theFile)
This throws an error:
theFile = { 'LUuploadFile': ("línea.ipa", open(path_to_file, 'rb'), 'application/octet-stream') }
request = requests.post(url, files=theFile)
The error is very odd:
( <class 'requests.exceptions.ConnectionError'>,
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='fupload.apperian.com', port=443):
Max retries exceeded with url: /upload?transactionID=...
(Caused by <class 'socket.error'>: [Errno 32] Broken pipe)",),),
<traceback object at 0x100a8e3f8>)
It's not the server, it accepts the filename if I use curl:
curl --form "LUuploadFile=#línea.ipa" http://...

This means that something in the particular server doesn't implement the parsing of Content-Disposition correctly (according to RFC 5987). I can't be more specific than that since there any many "moving parts" to a web application server (for example you might be using nginx + fastcgi + PHP) and any one (or all :)) of those might be broken. You might find this SO thread as well as this page useful, which approaches the issue from the other side (downloading the file with an UTF-8 name), but boils down to the same issue (parsing the "Content-Disposition" header).
For what it's worth requests is doing the "correct" thing (according to the standard), but there isn't really much it can do if some component on the server doesn't follow the standard (or it might not even be on the server - for example there might be a proxy you're passing trough that is causing the issue).

Related

Python requests: NewConnectionError, urllib3, using cert and verify attributes

So the program i am developing involves posting documents in bank DMS server. They have provided me server certificate in .cer format which i have inserted in my verify variable in code. They also provided client id and password which i have to embed in the header itself. I generated self signed client certificate and private key and gave them the client certificate in cer format and public key. Also in code i gave path of client certificate and private key in cert tuple.
Upon executing code, i am getting this error:
HTTPSConnectionPool(host='apimuat.xxxbank.com', port=9095): Max retries exceeded with url: /doc-mgmt/v1/uploadDoc (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7fb01bd8a160>: Failed to establish a new connection: [Errno 60] Operation timed out'))
File "/Users/fpl_mayank/Documents/FPL/python-virtual-env/uploadDocApi/server.py", line 164, in main
result = requests.post(url,
File "/Users/fpl_mayank/Documents/FPL/python-virtual-env/uploadDocApi/server.py", line 189, in <module>
main()
I have tested it with 'https://postman-echo.com/post' without mentioning cert and verify just to check if my request is going through or not. it is working fine there.
This is my code snippet where i am using request functions.
url='https://apimuat.xxxbank.com:9095/doc-mgmt/v1/uploadDoc'
headers = {"Content-Type": "application/json", "client_id":"af197b22539647fba4db8b971b43e38",
"client_secret":"c1AA406e24074d8887954472C78a924"}
data = req
result = requests.post(url,
data=data,
headers=headers,
cert=('/Users/fpl_mayank/Documents/FPL/python-virtual-
env/uploadDocApi/keystore/dms_csr_certificate_self.cer','/Users/fpl_mayank/Documents/FPL/python-virtual-env/uploadDocApi/keystore/dms_private_key.key'),
verify='/Users/fpl_mayank/Documents/FPL/python-virtual-env/uploadDocApi/truststore/APIM-UAT.cer'
)
res = result.json()
In apidoc it was mentioned, 2-way SSL authentication will be implemented bw client and server. Also i have made virtual-env for this program for that matter. Please help. I am the first one to write an API using python in my company so only way to get my issue resolve is through good ol stackoverflow.

So i solved this. idk exactly what solved it but make sure when working on api's, get the endpoint's ip whitelisted from your network, as per requirement and same goes from their side too. Also i was sending formatted json request having identation and spaces so make sure to keep json in one line.

SSL and NewConnectionError

I want to crawl a given list by the Top-1-Million from Alexa, to check which website still offers acces via http:// an do not redirect to https://.
If the webpage does not redirect to a https:// Domain, it should be written into a csv file.
The Problem occurs, when I am adding a bunch of multiple URLs. Than I get two errors:
ssl.SSLError: [SSL: SSLV3_ALERT_HANDSHAKE_FAILURE] sslv3 alert handshake failure (_ssl.c:1056
or
requests.exceptions.ConnectionError: HTTPConnectionPool(host='17ok.com', port=80): Max retries exceeded with url: / (Caused by NewConnectionError(': Failed to establish a new connection: [Errno 11001] getaddrinfo failed')
I have tried the opportunities mentioned in the following threads and documentation:
https://2.python-requests.org//en/latest/user/advanced/#ssl-cert-verification
Edit: the sample url: https://requestb.in raises a 404 error actually, probably does not exist even more (?)
Python Requests throwing SSLError
Python Requests: NewConnectionError
requests.exceptions.SSLError: HTTPSConnectionPool: (Caused by SSLError(SSLError(336445449, '[SSL] PEM lib (_ssl.c:3816)')))
and some other delivered solutions.
The option to set verify=False helps, when using it for few URLs, but not when using a List > 10 URLs, the program brakes. I tried my program on a Win10 machine as well as on Ubuntu 16.04.
As expected, its the same issue. I also tried the option using Sessions and installed the certificate library as sugested.
If I am just calling three pages like 'http://www.example.com', 'https://www.github.com' and 'http://www.python.org', its not a big deal and the delivered solutions. The Headache starts, when using a bunch of URLs from the Alexa List.
Here is my code, which is working, when using it for only 3-4 urls:
import requests
from requests.utils import urlparse
urls = ['http://www.example.com',
'http://bloomberg.com',
'http://github.com',
'https://requestbin.fullcontact.com/']
with open('G:\\Request_HEADER_Suite/dummy/http.csv', 'w') as f:
for url in urls:
r = requests.get(url, stream=True, verify=False)
parsed_url = urlparse(r.url)
print("URL: ", url)
print("Redirected to: ", r.url)
print("Status Code: ", r.status_code)
print("Scheme: ", parsed_url.scheme)
if parsed_url.scheme == 'http':
f.write(url + '\n')
I expect to crawl at least a list with 100 URLs. The code should write URLs which are accessible by http:// and do not redirect to https:// into a csv file or complementary database and ignore all URLs with https://.
Because it is working for few URLs, I would expectd a stable opportunity for a larger scan.
But 2 errors araise and break the program. Is it worthy to try a workaround using pytest? Any other suggestions? Thanks in advance.
EDIT:
This is a list, which will raise errors. Only for clarification, this list from a study based on the Alexa-Top-1-Million.
urls = ['http://www.example.com',
'http://bloomberg.com',
'http://github.com',
'https://requestbin.fullcontact.com/',
'http://51sole.com',
'http://58.com',
'http://9gag.com',
'http://abs-cbn.com',
'http://academia.edu',
'http://accuweather.com',
'http://addroplet.com',
'http://addthis.com',
'http://adf.ly',
'http://adhoc2.net',
'http://adobe.com',
'http://1688.com',
'http://17ok.com',
'http://17track.net',
'http://1and1.com',
'http://1tv.ru',
'http://2ch.net',
'http://360.cn',
'http://39.net',
'http://4chan.org',
'http://4pda.ru']
I double checked, the last time the errors starts with the url 17.ok.com. But I have also tried different lists with urls. Thanks for your support.

Error -3 while decompressing data: incorrect header check - urllib2

I am checking for url status with this code:
h = httplib2.Http()
hdr = {'User-Agent': 'Mozilla/5.0'}
resp = h.request("http://" + url, headers=hdr)
if int(resp[0]['status']) < 400:
return 'ok'
else:
return 'bad'
and getting
Error -3 while decompressing data: incorrect header check
the url i am checking is:
http://www.sueddeutsche.de/wirtschaft/deutschlands-innovationsangst-wir-neobiedermeier-1.2117528
the Exception Location is:
Exception Location: C:\Python27\lib\site-packages\httplib2-0.9-py2.7.egg\httplib2\__init__.py in _decompressContent, line 403
try:
encoding = response.get('content-encoding', None)
if encoding in ['gzip', 'deflate']:
if encoding == 'gzip':
content = gzip.GzipFile(fileobj=StringIO.StringIO(new_content)).read()
if encoding == 'deflate':
content = zlib.decompress(content) ##<---- error line
response['content-length'] = str(len(content))
# Record the historical presence of the encoding in a way the won't interfere.
response['-content-encoding'] = response['content-encoding']
del response['content-encoding']
except IOError:
content = ""
http status is 200 which is ok for my case, but i am getting this error
I actually need only http status, why is it reading the whole content?

You may have any number of reasons why you choose httplib2, but it's far too easy to get the status code of an HTTP request using the python module requests. Install with the following command:
$ pip install requests
See an extremely simple example below.
In [1]: import requests as rq
In [2]: url = "http://www.sueddeutsche.de/wirtschaft/deutschlands-innovationsangst-wir-neobiedermeier-1.2117528"
In [3]: r = rq.get(url)
In [4]: r
Out[4]: <Response [200]>
Link
Unless you have a considerable constraint that needs httplib2 explicitly, this solves your problem.

This may be a bug (or just uncommon design decision) in httplib2. I don't get this problem with urllib2 or httplib in the 2.x stdlib, or urllib.request or http.client in the 3.x stdlib, or the third-party libraries requests, urllib3, or pycurl.
So, is there a reason you need to use this particular library?
If so:
I actually need only http status, why is it reading the whole content?
Well, most HTTP libraries are going to read and parse the whole content, or at least the headers, before returning control. That way they can respond to simple requests about the headers or chunked encoding or MIME envelope or whatever without any delay.
Also, many of them automate things like 100 continue, 302 redirect, various kinds of auth, etc., and there's no way they could do that if they didn't read ahead. In particular, according to the description for httplib2, handling these things automatically is one of the main reasons you should use it in the first place.
Also, the first TCP read is nearly always going to include the headers anyway, so why not read them?
This means that if the headers are invalid, you may get an exception immediately. They may still provide a way to get the status code (or the raw headers, or other information) anyway.
As a side note, if you only want the HTTP status, you should probably send a HEAD request rather than a GET. Unless you're writing and testing a server, you can almost always rely on the fact that, as the RFC says, the status and headers should be identical to what you'd get with GET. In fact, that would almost certainly solve things in this case—if there is no body to decompress, the fact that httplib2 has gotten confused into thinking the body is gzipped when it isn't won't matter anyway.

Python requests: sending file via POST returns ConnectionError

I'm trying to use the Python requests library to send an android .apk file to a API service. I've successfully used requests and this file type to submit to another service but I keep getting a:
ConnectionError(MaxRetryError("HTTPSConnectionPool(host='REDACTED', port=443): Max retries exceeded with url: /upload/app (Caused by : [WinError 10054] An existing connection was forcibly closed by the remote host)",),)
This is the code responsible:
url = "https://website"
files = {'file': open(app, 'rb')}
headers = {'user':'value', 'pass':'value'}
try:
response = requests.post(url, files=files, headers=headers)
jsonResponse = json.loads(response.text)
if 'error' in jsonResponse:
logger.error(jsonResponse['error'])
except Exception as e:
logger.error("Exception when trying to upload app to host")
The response line is throwing the above mentioned exception. I've used these exact same parameters using the Chrome Postman extension to replicate the POST request and it works perfectly. I've used the exact same format of file to upload to another RESTful service as well. The only difference between this request and the one that works is that this one has custom headers attached in order to verify the POST. The API doesn't stipulate this as authentication in the sense of needing to be encoded and the examples both in HTTP and cURL define these values as headers or -H.
Any help would be most appreciated!

So this was indeed a certificates issue. In my case I was able to stay internal to my company and connect to another URL, but the requests library, which is quite amazing, has information on certs at: http://docs.python-requests.org/en/latest/user/advanced/?highlight=certs
For all intents and purposes this is answered but perhaps it will be useful to someone in posterity.

Posting only part of a file with Python's poster.encode

Using the poster.encode module, this works when I post a whole file to Solr:
f = open(filePath, 'rb')
datagen, headers = multipart_encode({'file': f})
# use wt=json because it's more convenient to navigate
request = urllib2.Request(SOLR_BASE_URL + 'update/extract?extractOnly=true&extractFormat=text&indent=true&wt=json', datagen, headers) # assumes solrPath ends in '/'
extracted = urllib2.urlopen(request).read()
However, for some files I'd like to send only the first n bytes of the files. I thought this would work:
f = open(filePath, 'rb')
mp = MultipartParam('file', fileobj=f, filesize=f)
datagen, headers = multipart_encode({'file': mp})
# use wt=json because it's more convenient to navigate
request = urllib2.Request(SOLR_BASE_URL + 'update/extract?extractOnly=true&extractFormat=text&indent=true&wt=json', datagen, headers) # assumes solrPath ends in '/'
extracted = urllib2.urlopen(request).read()
...but I get a timed out request (and the odd thing is that I then have to restart apache before requests to my web2py app work again). I get a 'http 400 content missing' error from urlopen() when I leave off the filesize argument. Am I just using MultipartParam incorrectly?
(The point of all this is that I'm using Solr to extract text content and metadata from files. For video and audio files, I'd like to get away with sending just the first 100-300k or so, as presumably the relevant data's all in the file headers.)

The reason you're having trouble is that mime encoding introduces sentinels in the post, if you don't specify the file size - that means that you have to do chunked transfer encoding so that the web server knows when to stop reading the file. But, that's the other problem - if you stop sending a MIME encoded POST to a server mid-stream, it'll just sit there waiting for the block to finish. Chunked transfer encoding and mixed-multipart mime encoding are both dead serious when it comes down to message segment sizes.
If you only want to send 100-300k of data, then only read that much, then every post you make to the server will terminate at the byte you want and the web server is expecting.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.