HPE_UNEXPECTED_CONTENT_LENGTH error when Content-Length to BytesIO length - python

I'm moving a python pyramid project from python 2 to 3. I was using ReportLab to generate PDF files and send it to the front end. According to their examples I need to use io.BytesIO(), when previously it was StringIO().
Now using the generated document length to set the Content-Length in my response, I get an HPE_UNEXPECTED_CONTENT_LENGTH error.
pdf = io.BytesIO()
doc = SimpleDocTemplate(pdf)
doc.build(story)
pdfcontent = pdf.getvalue()
pdf.close()
response = Response(content_type='application/pdf', body=pdfcontent)
response.headers.add("Content-Length", str(len(pdfcontent)))
If I don't set the Content-Length attribute the download works fine, but I would rather not leave it blank.

I'm not sure about you particular example and error, but I'm pretty sure that when you provide the response body bytes like this, Pyramid sends the Content-Length header. No need to set it manually, it already has the bytes and therefore knows its size.
You should check the response headers (using your browser developer tools or a command line tools like curl or httpie).

Related

How to separate files of an multipart/form-data answer

I connect via Python to a web interface, where I get back as response files in multipart/form-data format.
I know the format type only when a browser responds to a form, but here the server sends its response in this format.
How can I get the original file from this answer without the metadata of the interface?
I have here a few examples with one, two, three and four contents.
I have no idea how to solve it and ask for your help.
I uploaded the files to zippyshare, because is to big for text view.
http://www22.zippyshare.com/v/EEXIbj79/file.html
http://www22.zippyshare.com/v/eXF62wpq/file.html
http://www22.zippyshare.com/v/sSi9crCT/file.html
http://www22.zippyshare.com/v/RiXF57WD/file.html
Thank you in advance
for http/s this works for me.
r = requests.get(URL, headers={'Connection': 'close'})
boundary = r.headers["Content-Type"].split(";", 1)[1].strip().replace("boundary=", "", 1)
comps = Content.split(boundary.encode())

Image download mime type validation python requests

I use the requests library in python to download a large number of image files via http. I convert the received content to raw bytes using BytesIO in python and then use Pillow() to save this raw content as a jpeg file.
from PIL import Image
from io import BytesIO
rsp = requests.get(imageurl)
content_type_received = rsp.headers['Content-Type'] # mime type
binarycontent = BytesIO(rsp.content)
if content_type_received.startswith('image'): # image/jpeg, image/png etc
i = Image.open(binarycontent)
outfilename = os.path.join(outfolder,'myimg'+'.jpg')
with open(outfilename, 'wb') as f:
f.write(rsp.content)
rsp.close()
What is the potential security risk of this code? (I am not sure how much we can trust the server saying mime type in the response header is really what the server says it is?) Is there a better way to write a secure download routine?
The potential security risk of your code depends on how much you trust the server your contacting.
If you're sure that the server will never try to fool you with some malicious content, then you're relatively safe to use that piece of code.
Otherwise, check for the content-type by yourself.
The biggest potential risk might to unknowingly save an executable rather than an image.
A smaller one might be to store a different kind of content that may crash PIL or another component in your application.
Keep in mind that the server is free to choose whatever value it wants for any response headers, including the content-type.
If you have any reason to believe the server you're contacting might not be honest about it, you shouldn't trust request headers.
If you want a more reliable way to determine the content type of the content you received, I suggest you take a look at python-magic, a wrapper for libmagic.
This library will help you determine yourself the content type, so you don't have to "trust" the server you're downloading from.
# ...
content = BytesIO(rsp.content)
mime = magic.from_buffer(content.read(1024), mime=True)
if mime.startswith('image'):
content.seek(0) # Reset the bytes stream position because you read from it
# ...
python-magic is very well documented, so I recommend you have a look at their README if you consider user it.

Python | Http - can't get the correct mime type

I am building a web crawler using urllib3. Example code:
from urllib3 import PoolManager
pool = PoolManager()
response = pool.request("GET", url)
mime_type = response.getheader("content-type")
I have stumbled upon few links to document files such as docx and epub and the mime type I'm getting from the server is text/plain.It is important to me to get the correct mime type.
Example to a problematic url:
http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx
Right now the logic of getting file's mime type is getting it from the server and if not available trying to get the file's extension.
How come Firefox is not getting confused by these kind of urls and let the user download the file right away? How does it know that this file is not plain text? How can i get the correct mime type?
I haven't read the Firefox source code, but I would guess that Firefox either tries to guess the filetype based on the URL, or refuses to render it inline if it's a specific Content-Type and larger than some maximum size, or perhaps it even inspects some of the file contents to figure out what it is based on a magic number at the start.
You can use the Python mimetypes module in the standard library to guess what the filetype is based on the URL:
import mimetypes
url = "http://lsa.mcgill.ca/pubdocs/files/advancedcommonlawobligations/523-gold_advancedcommonlawobligations_-2013.docx"
type, encoding = mimetypes.guess_type(url)
In this case, type is "application/vnd.openxmlformats-officedocument.wordprocessingml.document" which is probably what you want.
Unfortunately, text/plain is the right MIME type for your response, as stated here.
For text documents without specific subtype, text/plain should be used.
I tested your URL in Chrome and the behaviour you described for Firefox happened as well: Chrome downloaded the file instead of opening it, even with the Content type header being text/plain.
This means that those browsers use more than just this header to determine whether they should download or open the said file, which might include their own limitation to parse that file.
That said, you're not able to rely on the Content type header if you want to determine the real MIME type of whatever will come in the request's response. Maybe an alternative is to temporarily store the response's file and determine its MIME type afterwards.

Given a URL, how to encode a file's contents as base64 with Python / Django?

I am building a Django-based website, and am having trouble figuring out a decent way to email some larger PDFs and such to my users.
The files in question never touch our servers; they're handled on a CDN. So, my starting point is with the unique URLs for the files, not with the files themselves. It would be nice to find a solution that doesn't involve saving the files locally.
In order for me to be able to send the email in the way I want (with the PDF/DOCX/whatever attached to it), I need to be able to encode the attachment as a base-64 string.
I would prefer not to save the file to our server; I would also prefer not to read a response object in chunks and write it plainly to a file on our server, then encode that file.
That said, given a direct url to a file is there a way to stream the response and encode it in base64 as it comes in?
I have been reading about Django's StreamingHttpResponse and FileWrapper and feel like I am close, but I'm not able to put it together just yet.
Edit: the snippet below is working for now, but I'm worried about memory usage - how well would something like this scale?
import base64
req = requests.get('url')
encoded = base64.b64encode(req.content)
Thanks to beetea I am comfortable implementing the simple:
import base64
req = requests.get('url')
encoded = base64.b64encode(req.content)
As the solution to this issue.

Using Python to download a document that's not explicitly referenced in a URL

I wrote a web crawler in Python 2.6 using the Bing API that searches for certain documents and then downloads them for classification later. I've been using string methods and urllib.urlretrieve() to download results whose URL ends in .pdf, .ps etc., but I run into trouble when the document is 'hidden' behind a URL like:
http://www.oecd.org/officialdocuments/displaydocument/?cote=STD/CSTAT/WPNA(2008)25&docLanguage=En
So, two questions. Is there a way in general to tell if a URL has a pdf/doc etc. file that it's linking to if it's not doing so explicitly (e.g. www.domain.com/file.pdf)? Is there a way to get Python to snag that file?
Edit:
Thanks for replies, several of which suggest downloading the file to see if it's of the correct type. Only problem is... I don't know how to do that (see question #2, above). urlretrieve(<above url>) gives only an html file with an href containing that same url.
There's no way to tell from the URL what it's going to give you. Even if it ends in .pdf it could still give you HTML or anything it likes.
You could do a HEAD request and look at the content-type, which, if the server isn't lying to you, will tell you if it's a PDF.
Alternatively you can download it and then work out whether what you got is a PDF.
In this case, what you refer to as "a document that's not explicitly referenced in a URL" seems to be what is known as a "redirect". Basically, the server tells you that you have to get the document at another URL. Normally, python's urllib will automatically follow these redirects, so that you end up with the right file. (and - as others have already mentioned - you can check the response's mime-type header to see if it's a pdf).
However, the server in question is doing something strange here. You request the url, and it redirects you to another url. You request the other url, and it redirects you again... to the same url! And again... And again... At some point, urllib decides that this is enough already, and will stop following the redirect, to avoid getting caught in an endless loop.
So how come you are able to get the pdf when you use your browser? Because apparently, the server will only serve the pdf if you have cookies enabled. (why? you have to ask the people responsible for the server...) If you don't have the cookie, it will just keep redirecting you forever.
(check the urllib2 and cookielib modules to get support for cookies, this tutorial might help)
At least, that is what I think is causing the problem. I haven't actually tried doing it with cookies yet. It could also be that the server is does not "want" to serve the pdf, because it detects you are not using a "normal" browser (in which case you would probably need to fiddle with the User-Agent header), but it would be a strange way of doing that. So my guess is that it is somewhere using a "session cookie", and in the case you haven't got one yet, keeps on trying to redirect.
As has been said there is no way to tell content type from URL. But if you don't mind getting the headers for every URL you can do this:
obj = urllib.urlopen(URL)
headers = obj.info()
if headers['Content-Type'].find('pdf') != -1:
# we have pdf file, download whole
...
This way you won't have to download each URL just it's headers. It's still not exactly saving network traffic, but you won't get better than that.
Also you should use mime-types instead of my crude find('pdf').
No. It is impossible to tell what kind of resource is referenced by a URL just by looking at it. It is totally up to the server to decide what he gives you when you request a certain URL.
Check the mimetype with the urllib.info() function. This might not be 100% accurate, it really depends on what the site returns as a Content-Type header. If it's well behaved it'll return the proper mime type.
A PDF should return application/pdf, but that may not be the case.
Otherwise you might just have to download it and try it.
You can't see it from the url directly. You could try to only download the header of the HTTP response and look for the Content-Type header. However, you have to trust the server on this - it could respond with a wrong Content-Type header not matching the data provided in the body.
Detect the file type in Python 3.x and webapp with url to the file which couldn't have an extension or a fake extension. You should install python-magic, using
pip3 install python-magic
For Mac OS X, you should also install libmagic using
brew install libmagic
Code snippet
import urllib
import magic
from urllib.request import urlopen
url = "http://...url to the file ..."
request = urllib.request.Request(url)
response = urlopen(request)
mime_type = magic.from_buffer(response.read())
print(mime_type)

Categories