ReSTful Flask file upload with request.stream - python

I am attempting to create a simple Flask endpoint for uploading files via POST or PUT. I want the filename in the URL, and then to (after the request headers) just stream the raw file data in the request.
I also need to be able to upload files slightly larger than 2GB, and I need to do this without storing the entire file in memory. At first, this seemed simple enough:
#application.route("/upload/<filename>", methods=['POST', 'PUT'])
def upload(filename):
# Authorization and sanity checks skipped.
filename = secure_filename(filename)
fileFullPath = os.path.join(application.config['UPLOAD_FOLDER'], filename)
with open(fileFullPath, 'wb') as f:
copyfileobj(request.stream, f)
return jsonify({'filename': filename})
With a multipart/formdata upload, I can simply call .save() on the file.
However, any file I upload seems to have a different checksum (well, sha256sum, on the server then on the source). When uploading a standard text file, newlines seem to be getting stripped. Binary files seem to be getting mangled in other strange ways.
I am sending Content-Type: application/octet-stream when uploading to try to make Flask treat all uploads as binary. Is request.stream (a proxy to wsgi.input) opened as non-binary? I can't seem to figure that out from the Flask code. How can I stream the request data, in raw binary format, to a file on disk?
I'm open to hacks; this is for a test project (so I'm also not interested in hearing how sending this as formdata would be better, or how this isn't a good way to upload files, etc.)
I am testing this via:
curl -H 'Content-Type: application/octet-stream' -H 'Authorization: ...' -X PUT --data #/path/to/test/file.name https://test.example.com/upload/file.name

Related

Download doesn't start in web browser by flask.Response

In Flask (micro web framework), we have a view as:
#app.route('/download/<id>/<resolution>/<extension>/')
def download_by_id(id, resolution=None, extension=None):
stream = youtube.stream_url(id, resolution, extension)
binary = requests.get(stream['url'], stream=True)
return flask.Response(
binary,
headers={'Content-Disposition': 'attachment; '
'filename=' + stream['filename']})
In template we have a link as Download 240p Video and when it's clicked, it should start downloading that video.
Issue is:
It is working fine in some browsers where no Download Manager like IDM etc. is installed. But IDM fails to download it. IDM just hangs at http://example.com/download/adkdsk457jds/240p/mp4/
Same is the case with Firefox's own download manager. Firefox just downloads a plain .html page and not the actual video.
But, videos gets downloaded successfully in Chrome when no IDM or other Download Manager is installed.
Please help and advice why it's not working. Do i need to change something in code?
You haven't included any response information, including the content type; you need to copy over a little more information about the original response to communicate what type of response you are returning. Otherwise defaults are used (dictated either by the HTTP standard or by Flask).
Specifically, at the very least you want to copy across the content type, length, and the transfer encoding:
headers={
'Content-Disposition': 'attachment; filename=' + stream['filename']
}
for header in ('content-type', 'content-length', 'transfer-encoding'):
if header in binary.headers:
headers[header] = binary.headers[header]
return flask.Response(binary.raw, headers=headers)
I'm using the response.raw underlying raw file object; this should work too but has the added advantage that any compression applied by YouTube is retained.
Some download managers may try to use a HTTP range request to grab a download in parallel, even when the server is not advertising that it supports such requests. You should probably respond with a 406 Not Acceptable response (requesting byte ranges when not supported is a Accept-* violation). You'll need to log what headers the download manager sends to be sure if this is the case.
Add 'Content-Type': 'application/octet-stream' to headers

Python file upload from url using requests library

I want to upload a file to an url. The file I want to upload is not on my computer, but I have the url of the file. I want to upload it using requests library. So, I want to do something like this:
url = 'http://httpbin.org/post'
files = {'file': open('report.xls', 'rb')}
r = requests.post(url, files=files)
But, only difference is, the file report.xls comes from some url which is not in my computer.
The only way to do this is to download the body of the URL so you can upload it.
The problem is that a form that takes a file is expecting the body of the file in the HTTP POST. Someone could write a form that takes a URL instead, and does the fetching on its own… but that would be a different form and request than the one that takes a file (or, maybe, the same form, with an optional file and an optional URL).
You don't have to download it and save it to a file, of course. You can just download it into memory:
urlsrc = 'http://example.com/source'
rsrc = requests.get(urlsrc)
urldst = 'http://example.com/dest'
rdst = requests.post(urldst, files={'file': rsrc.content})
Of course in some cases, you might always want to forward along the filename, or some other headers, like the Content-Type. Or, for huge files, you might want to stream from one server to the other without downloading and then uploading the whole file at once. You'll have to do any such things manually, but almost everything is easy with requests, and explained well in the docs.*
* Well, that last example isn't quite easy… you have to get the raw socket-wrappers off the requests and read and write, and make sure you don't deadlock, and so on…
There is an example in the documentation that may suit you. A file-like object can be used as a stream input for a POST request. Combine this with a stream response for your GET (passing stream=True), or one of the other options documented here.
This allows you to do a POST from another GET without buffering the entire payload locally. In the worst case, you may have to write a file-like class as "glue code", allowing you to pass your glue object to the POST that in turn reads from the GET response.
(This is similar to a documented technique using the Node.js request module.)
import requests
img_url = "http://...."
res_src = requests.get(img_url)
payload={}
files=[
('files',('image_name.jpg', res_src.content,'image/jpeg'))
]
headers = {"token":"******-*****-****-***-******"}
response = requests.request("POST", url, headers=headers, data=payload, files=files)
print(response.text)
above code is working for me.

Download file, parse it and serve in Flask

I'm taking my first steps with Flask. I can successfuly download a file from a client and give it back with the code from here:
http://flask.pocoo.org/docs/patterns/fileuploads/
But how to change it (eg. line after line) and then serve it to the client?
I can get the string with read() after:
if file and allowed_file(file.filename):
and then process it. So the question really is: how do I serve output string as a file?
I don't want to save it on a server's hdd at all (both original version and changed).
You can use make_response to create the response for your string and add Content-Disposition: attachment; filename=anyNameHere.txt to it before returning it:
#app.route("/transform-file", methods=["POST"])
def transform():
# Check for valid file and assign it to `inbound_file`
data = inbound_file.read()
data = data.replace("A", "Z")
response = make_response(data)
response.headers["Content-Disposition"] = "attachment; filename=outbound.txt"
return response
See also: The docs on streaming content

Posting only part of a file with Python's poster.encode

Using the poster.encode module, this works when I post a whole file to Solr:
f = open(filePath, 'rb')
datagen, headers = multipart_encode({'file': f})
# use wt=json because it's more convenient to navigate
request = urllib2.Request(SOLR_BASE_URL + 'update/extract?extractOnly=true&extractFormat=text&indent=true&wt=json', datagen, headers) # assumes solrPath ends in '/'
extracted = urllib2.urlopen(request).read()
However, for some files I'd like to send only the first n bytes of the files. I thought this would work:
f = open(filePath, 'rb')
mp = MultipartParam('file', fileobj=f, filesize=f)
datagen, headers = multipart_encode({'file': mp})
# use wt=json because it's more convenient to navigate
request = urllib2.Request(SOLR_BASE_URL + 'update/extract?extractOnly=true&extractFormat=text&indent=true&wt=json', datagen, headers) # assumes solrPath ends in '/'
extracted = urllib2.urlopen(request).read()
...but I get a timed out request (and the odd thing is that I then have to restart apache before requests to my web2py app work again). I get a 'http 400 content missing' error from urlopen() when I leave off the filesize argument. Am I just using MultipartParam incorrectly?
(The point of all this is that I'm using Solr to extract text content and metadata from files. For video and audio files, I'd like to get away with sending just the first 100-300k or so, as presumably the relevant data's all in the file headers.)
The reason you're having trouble is that mime encoding introduces sentinels in the post, if you don't specify the file size - that means that you have to do chunked transfer encoding so that the web server knows when to stop reading the file. But, that's the other problem - if you stop sending a MIME encoded POST to a server mid-stream, it'll just sit there waiting for the block to finish. Chunked transfer encoding and mixed-multipart mime encoding are both dead serious when it comes down to message segment sizes.
If you only want to send 100-300k of data, then only read that much, then every post you make to the server will terminate at the byte you want and the web server is expecting.

S3 Backup Memory Usage in Python

I currently use WebFaction for my hosting with the basic package that gives us 80MB of RAM. This is more than adequate for our needs at the moment, apart from our backups. We do our own backups to S3 once a day.
The backup process is this: dump the database, tar.gz all the files into one backup named with the correct date of the backup, upload to S3 using the python library provided by Amazon.
Unfortunately, it appears (although I don't know this for certain) that either my code for reading the file or the S3 code is loading the entire file in to memory. As the file is approximately 320MB (for today's backup) it is using about 320MB just for the backup. This causes WebFaction to quit all our processes meaning the backup doesn't happen and our site goes down.
So this is the question: Is there any way to not load the whole file in to memory, or are there any other python S3 libraries that are much better with RAM usage. Ideally it needs to be about 60MB at the most! If this can't be done, how can I split the file and upload separate parts?
Thanks for your help.
This is the section of code (in my backup script) that caused the processes to be quit:
filedata = open(filename, 'rb').read()
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(filedata), {'x-amz-acl': 'public-read', 'Content-Type': content_type})
It's a little late but I had to solve the same problem so here's my answer.
Short answer: in Python 2.6+ yes! This is because the httplib supports file-like objects as of v2.6. So all you need is...
fileobj = open(filename, 'rb')
content_type = mimetypes.guess_type(filename)[0]
if not content_type:
content_type = 'text/plain'
print 'Uploading to S3...'
response = connection.put(BUCKET_NAME, 'daily/%s' % filename, S3.S3Object(fileobj), {'x-amz-acl': 'public-read', 'Content-Type': content_type})
Long answer...
The S3.py library uses python's httplib to do its connection.put() HTTP requests. You can see in the source that it just passes the data argument to the httplib connection.
From S3.py...
def _make_request(self, method, bucket='', key='', query_args={}, headers={}, data='', metadata={}):
...
if (is_secure):
connection = httplib.HTTPSConnection(host)
else:
connection = httplib.HTTPConnection(host)
final_headers = merge_meta(headers, metadata);
# add auth header
self._add_aws_auth_header(final_headers, method, bucket, key, query_args)
connection.request(method, path, data, final_headers) # <-- IMPORTANT PART
resp = connection.getresponse()
if resp.status < 300 or resp.status >= 400:
return resp
# handle redirect
location = resp.getheader('location')
if not location:
return resp
...
If we take a look at the python httplib documentation we can see that...
HTTPConnection.request(method, url[, body[, headers]])
This will send a request to the server using the HTTP request method method and the selector url. If the body argument is present, it should be a string of data to send after the headers are finished. Alternatively, it may be an open file object, in which case the contents of the file is sent; this file object should support fileno() and read() methods. The header Content-Length is automatically set to the correct value. The headers argument should be a mapping of extra HTTP headers to send with the request.
Changed in version 2.6: body can be a file object.
don't read the whole file into your filedata variable. you could use a loop and then just read ~60 MB and submit them to amazon.
backup = open(filename, 'rb')
while True:
part_of_file = backup.read(60000000) # not exactly 60 MB....
response = connection.put() # submit part_of_file here to amazon

Categories