Django api stream csv file and upload to s3 in a thread

Django api stream csv file and upload to s3 in a thread - python

I have a Django APIView that generates CSV file stream and returns it using StreamingHttpResponse. I want to upload the generated file to s3 as well in a separate thread as I do not want the response to stall until upload is completed.
I thought of saving the stream in a TemporaryFile object and use that to trigger s3 upload in a separate thread once the stream is complete. On following this approach I cannot use Context Manager as the file is automatically closed when context manager exits. I had to resort to closing the file manually once the upload got completed. I would like to avoid this manual file closing process. How do I solve this?
Here is the view:
class CsvExportView(APIView):
def get(self, request):
response = StreamingHttpResponse(
streaming_content=get_file_stream(),
content_type='text/csv'
)
response['Content-Disposition'] = 'attachment; filename="{}"'.format('some_name')
return response
Here is the get_file_stream() method:
import threading
from tempfile import TemporaryFile
class TempFileManager(object):
def __init__(self):
self._file = TemporaryFile()
def get_file(self, parameter_list):
pass
def write_to_file(self, data)
self._file.write(data)
def close_file(self):
self._file.close()
def get_file_stream():
file_mng = TempFileManager()
# get_csv_stream method is a generator that yields csv file content
# by calling external api and complex data processing
for data in get_csv_stream():
# keep writing to temp file
file_mng.write_to_file(data)
# yield chunks to browser
yield data
# once all data is received
# spawn a thread and do the upload
# complete request lifecycle
t = threading.Thread(target=upload_file_to_s3, args=(file_mng, ))
t.start()
def upload_file_to_s3(self, manager):
# some code to upload file
upload(manager.get_file())
# remember to close file
manager.close_file()
The above approach works fine but seems very cluttered and error-prone. This would be much easier if a context manager is used like this:
def get_file_stream():
with TemporaryFile() as tmp:
for data in get_csv_stream():
# keep writing to temp file
tmp.write(data)
yield data
upload_file_to_s3(tmp)
But I do not want the user to wait until the file is uploaded to s3 so since threading is involved, the context manager might close the file even before the upload process starts.
Does anyone know a better way to handle this situation?

Related

Flask file threading 'ValueError: I/O operation on closed file'

I am building an API with Flask. When a user uploads a file, it should asynchronously upload the file to an AWS s3 bucket. This is done by threading this task, while the user receives a confirmation that the request was reveiced. (In this case, the files are assessed as low value and it thus not really matters if the thread explodes.) I use a Flaks request to save the file as a variable, which I pass to the threaded function which handles the files. However, when trying to access the file in this handler, I get an 'ValueError: I/O operation on closed file'. What am I doing wrong and how can I solve this? For simplicity, both functions are completely stripped down to the core without error handling.
Flask app route
#app.route('/file/upload', methods=['POST'])
def post_upload_file():
return operator(receive_file)
Called function
def receive_file():
if 'file' in request.files and request.files['file'] != '':
x_file = request.files['file']
if allowed_file(x_file.filename):
fileid = shortuuid.uuid()
extension = x_file.filename.rsplit('.', 1)[1].lower()
newfilename = fileid + '.' + extension
Thread(target = handle_file, args=(fileid, x_file, newfilename)).start()
return {'fileid': fileid}
Threated handler
def handle_file(fileid, x_file, name=None, bucket='XXX'):
name = "file/"+name
s3_client = boto3.client('s3')
s3_client.upload_fileobj(x_file, bucket, name)

The problem was that the file location was not existent anymore at the start of the called handle_file function, since the called function was threaded from the main thread.
Besides, I think it was probably not the best idea to pass large files around in the memory of the API server. I have solved the problem by storing the incoming files in the server space, fetching the location and passing that to the handle_file function. This retrieves the file from the storage and uploads it to the bucket.

Delete file when file download is complete on Python x Django [duplicate]

I'm using the following django/python code to stream a file to the browser:
wrapper = FileWrapper(file(path))
response = HttpResponse(wrapper, content_type='text/plain')
response['Content-Length'] = os.path.getsize(path)
return response
Is there a way to delete the file after the reponse is returned? Using a callback function or something?
I could just make a cron to delete all tmp files, but it would be neater if I could stream files and delete them as well from the same request.

You can use a NamedTemporaryFile:
from django.core.files.temp import NamedTemporaryFile
def send_file(request):
newfile = NamedTemporaryFile(suffix='.txt')
# save your data to newfile.name
wrapper = FileWrapper(newfile)
response = HttpResponse(wrapper, content_type=mime_type)
response['Content-Disposition'] = 'attachment; filename=%s' % os.path.basename(modelfile.name)
response['Content-Length'] = os.path.getsize(modelfile.name)
return response
temporary file should be deleted once the newfile object is evicted.

For future references:
I just had the case in which I couldn't use temp files for downloads.
But I still needed to delete them after it; so here is how I did it (I really didn't want to rely on cron jobs or celery or wossnames, its a very small system and I wanted it to stay that way).
def plug_cleaning_into_stream(stream, filename):
try:
closer = getattr(stream, 'close')
#define a new function that still uses the old one
def new_closer():
closer()
os.remove(filename)
#any cleaning you need added as well
#substitute it to the old close() function
setattr(stream, 'close', new_closer)
except:
raise
and then I just took the stream used for the response and plugged into it.
def send_file(request, filename):
with io.open(filename, 'rb') as ready_file:
plug_cleaning_into_stream(ready_file, filename)
response = HttpResponse(ready_file.read(), content_type='application/force-download')
# here all the rest of the heards settings
# ...
return response
I know this is quick and dirty but it works. I doubt it would be productive for a server with thousands of requests a second, but that's not my case here (max a few dozens a minute).
EDIT: Forgot to precise that I was dealing with very very big files that could not fit in memory during the download. So that is why I am using a BufferedReader (which is what is underneath io.open())

Mostly, we use periodic cron jobs for this.
Django already has one cron job to clean up lost sessions. And you're already running it, right?
See http://docs.djangoproject.com/en/dev/topics/http/sessions/#clearing-the-session-table
You want another command just like this one, in your application, that cleans up old files.
See this http://docs.djangoproject.com/en/dev/howto/custom-management-commands/
Also, you may not really be sending this file from Django. Sometimes you can get better performance by creating the file in a directory used by Apache and redirecting to a URL so the file can be served by Apache for you. Sometimes this is faster. It doesn't handle the cleanup any better, however.

One way would be to add a view to delete this file and call it from the client side using an asynchronous call (XMLHttpRequest). A variant of this would involve reporting back from the client on success so that the server can mark this file for deletion and have a periodic job clean it up.

This is just using the regular python approach (very simple example):
# something generates a file at filepath
from subprocess import Popen
# open file
with open(filepath, "rb") as fid:
filedata = fid.read()
# remove the file
p = Popen("rm %s" % filepath, shell=True)
# make response
response = HttpResponse(filedata, content-type="text/plain")
return response

Python 3.7 , Django 2.2.5
from tempfile import NamedTemporaryFile
from django.http import HttpResponse
with NamedTemporaryFile(suffix='.csv', mode='r+', encoding='utf8') as f:
f.write('\uFEFF') # BOM
f.write('sth you want')
# ref: https://docs.python.org/3/library/tempfile.html#examples
f.seek(0)
data=f.read()
response = HttpResponse(data, content_type="text/plain")
response['Content-Disposition'] = 'inline; filename=export.csv'

Python - Upload a in-memory file (generated by API calls) in FTP by chunks

I need to be able to upload a file through FTP and SFTP in Python but with some not so usual constraints.
File MUST NOT be written in disk.
The file how it is generated is by calling an API and writing the response which is in JSON to the file.
There are multiple calls to the API. It is not possible to retrieve the whole result in one single call of the API.
I can not store in a string variable the full result by doing the multiple calls needed and appending in each call until I have the whole file in memory. File could be huge and there is a memory resource constraint. Each chunk should be sent and memory deallocated.
So here some sample code of what I would like to:
def chunks_generator():
range_list = range(0, 4000, 100)
for i in range_list:
data_chunk = requests.get(url=someurl, url_parameters={'offset':i, 'limit':100})
yield str(data_chunk)
def upload_file():
chunks_generator = chunks_generator()
for chunk in chunks_generator:
data_chunk= chunk
chunk_io = io.BytesIO(data_chunk)
ftp = FTP(self.host)
ftp.login(user=self.username, passwd=self.password)
ftp.cwd(self.remote_path)
ftp.storbinary("STOR " + "myfilename.json", chunk_io)
I want only one file with all the chunks appended.
What I have already and works is if I have the whole file in memory and send it at once like this:
string_io = io.BytesIO(all_chunks_together_in_one_string)
ftp = FTP(self.host)
ftp.login(user=self.username, passwd=self.password)
ftp.cwd(self.remote_path)
ftp.storbinary("STOR " + "myfilename.json", string_io )
Bonus
I need this in ftplib but will need it in Paramiko as well for SFTP. If there are any other libraries that this would work better I am open.
How about if I need to zip the file? Can I zip each chunk and send the zip-chunked chunk at a time?

You can implement file-like class that upon calling .read(blocksize) method retrieves data from requests object.
Something like this (untested):
class ChunksGenerator:
i = 0
requests = None
def __init__(self, requests)
self.requests = requests
def read(self, blocksize):
# TODO: somehow detect end-of-file and return false in that case
buf = requests.get(
url=someurl, url_parameters={'offset':self.i, 'limit':blocksize})
self.i += blocksize
return buf
generator = ChunksGenerator(requests)
ftp.storbinary("STOR " + "myfilename.json", generator)
With Paramiko, you can use the same class with SFTPClient.putfo method.

Best way to create a download link for a file in Flask?

In my project, when a user clicks a link, an AJAX request sends the information required to create a CSV. The CSV takes a long time to generate and so I want to be able to include a download link for the generated CSV in the AJAX response. Is this possible?
Most of the answers I've seen return the CSV in the following way:
return Response(
csv,
mimetype="text/csv",
headers={"Content-disposition":
"attachment; filename=myplot.csv"})
However, I don't think this is compatible with the AJAX response I'm sending with:
return render_json(200, {'data': params})
Ideally, I'd like to be able to send the download link in the params dict. But I'm also not sure if this is secure. How is this problem typically solved?

I think one solution may the futures library (pip install futures). The first endpoint can queue up the task and then send the file name back, and then another endpoint can be used to retrieve the file. I also included gzip because it might be a good idea if you are sending larger files. I think more robust solutions use Celery or Rabbit MQ or something along those lines. However, this is a simple solution that should accomplish what you are asking for.
from flask import Flask, jsonify, Response
from uuid import uuid4
from concurrent.futures import ThreadPoolExecutor
import time
import os
import gzip
app = Flask(__name__)
# Global variables used by the thread executor, and the thread executor itself
NUM_THREADS = 5
EXECUTOR = ThreadPoolExecutor(NUM_THREADS)
OUTPUT_DIR = os.path.dirname(os.path.abspath(__file__))
# this is your long running processing function
# takes in your arguments from the /queue-task endpoint
def a_long_running_task(*args):
time_to_wait, output_file_name = int(args[0][0]), args[0][1]
output_string = 'sleeping for {0} seconds. File: {1}'.format(time_to_wait, output_file_name)
print(output_string)
time.sleep(time_to_wait)
filename = os.path.join(OUTPUT_DIR, output_file_name)
# here we are writing to a gzipped file to save space and decrease size of file to be sent on network
with gzip.open(filename, 'wb') as f:
f.write(output_string)
print('finished writing {0} after {1} seconds'.format(output_file_name, time_to_wait))
# This is a route that starts the task and then gives them the file name for reference
#app.route('/queue-task/<wait>')
def queue_task(wait):
output_file_name = str(uuid4()) + '.csv'
EXECUTOR.submit(a_long_running_task, [wait, output_file_name])
return jsonify({'filename': output_file_name})
# this takes the file name and returns if exists, otherwise notifies it is not yet done
#app.route('/getfile/<name>')
def get_output_file(name):
file_name = os.path.join(OUTPUT_DIR, name)
if not os.path.isfile(file_name):
return jsonify({"message": "still processing"})
# read without gzip.open to keep it compressed
with open(file_name, 'rb') as f:
resp = Response(f.read())
# set headers to tell encoding and to send as an attachment
resp.headers["Content-Encoding"] = 'gzip'
resp.headers["Content-Disposition"] = "attachment; filename={0}".format(name)
resp.headers["Content-type"] = "text/csv"
return resp
if __name__ == '__main__':
app.run()

Flask - Handling Form File & Upload to AWS S3 without Saving to File

I am using a Flask app to receive a mutipart/form-data request with an uploaded file (a video, in this example).
I don't want to save the file in the local directory because this app will be running on a server, and saving it will slow things down.
I am trying to use the file object created by the Flask request.files[''] method, but it doesn't seem to be working.
Here is that portion of the code:
#bp.route('/video_upload', methods=['POST'])
def VideoUploadHandler():
form = request.form
video_file = request.files['video_data']
if video_file:
s3 = boto3.client('s3')
s3.upload_file(video_file.read(), S3_BUCKET, 'video.mp4')
return json.dumps('DynamoDB failure')
This returns an error:
TypeError: must be encoded string without NULL bytes, not str
on the line:
s3.upload_file(video_file.read(), S3_BUCKET, 'video.mp4')
I did get this to work by first saving the file and then accessing that saved file, so it's not an issue with catching the request file. This works:
video_file.save(form['video_id']+".mp4")
s3.upload_file(form['video_id']+".mp4", S3_BUCKET, form['video_id']+".mp4")
What would be the best method to handle this file data in memory and pass it to the s3.upload_file() method? I am using the boto3 methods here, and I am only finding examples with the filename used in the first parameter, so I'm not sure how to process this correctly using the file in memory. Thanks!

First you need to be able to access the raw data sent to Flask. This is not as easy as it seems, since you're reading a form. To be able to read the raw stream you can use flask.request.stream, which behaves similarly to StringIO. The trick here is, you cannot call request.form or request.file because accessing those attributes will load the whole stream into memory or into a file.
You'll need some extra work to extract the right part of the stream (which unfortunately I cannot help you with because it depends on how your form is made, but I'll let you experiment with this).
Finally you can use the set_contents_from_file function from boto, since upload_file does not seem to deal with file-like objects (StringIO and such).
Example code:
from boto.s3.key import Key
#bp.route('/video_upload', methods=['POST'])
def VideoUploadHandler():
# form = request.form <- Don't do that
# video_file = request.files['video_data'] <- Don't do that either
video_file_and_metadata = request.stream # This is a file-like object which does not only contain your video file
# This is what you need to implement
video_title, video_stream = extract_title_stream(video_file_and_metadata)
# Then, upload to the bucket
s3 = boto3.client('s3')
bucket = s3.create_bucket(bucket_name, location=boto.s3.connection.Location.DEFAULT)
k = Key(bucket)
k.key = video_title
k.set_contents_from_filename(video_stream)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Django api stream csv file and upload to s3 in a thread - python

Related

Flask file threading 'ValueError: I/O operation on closed file'

Delete file when file download is complete on Python x Django [duplicate]

Python - Upload a in-memory file (generated by API calls) in FTP by chunks

Best way to create a download link for a file in Flask?

Flask - Handling Form File & Upload to AWS S3 without Saving to File

Categories

Resources