Flask file threading 'ValueError: I/O operation on closed file' - python

I am building an API with Flask. When a user uploads a file, it should asynchronously upload the file to an AWS s3 bucket. This is done by threading this task, while the user receives a confirmation that the request was reveiced. (In this case, the files are assessed as low value and it thus not really matters if the thread explodes.) I use a Flaks request to save the file as a variable, which I pass to the threaded function which handles the files. However, when trying to access the file in this handler, I get an 'ValueError: I/O operation on closed file'. What am I doing wrong and how can I solve this? For simplicity, both functions are completely stripped down to the core without error handling.
Flask app route
#app.route('/file/upload', methods=['POST'])
def post_upload_file():
return operator(receive_file)
Called function
def receive_file():
if 'file' in request.files and request.files['file'] != '':
x_file = request.files['file']
if allowed_file(x_file.filename):
fileid = shortuuid.uuid()
extension = x_file.filename.rsplit('.', 1)[1].lower()
newfilename = fileid + '.' + extension
Thread(target = handle_file, args=(fileid, x_file, newfilename)).start()
return {'fileid': fileid}
Threated handler
def handle_file(fileid, x_file, name=None, bucket='XXX'):
name = "file/"+name
s3_client = boto3.client('s3')
s3_client.upload_fileobj(x_file, bucket, name)

The problem was that the file location was not existent anymore at the start of the called handle_file function, since the called function was threaded from the main thread.
Besides, I think it was probably not the best idea to pass large files around in the memory of the API server. I have solved the problem by storing the incoming files in the server space, fetching the location and passing that to the handle_file function. This retrieves the file from the storage and uploads it to the bucket.

Related

Error uploading file to google cloud storage

How should the files on my server be uploaded to google cloud storage?
the code I have tried is given below, however, it throws a type error, saying, the expected type is not byte for:
the expected type is not byte for:
blob.upload_from_file(file.file.read()).
Although upload_from_file requires a binary type.
#app.post("/file/")
async def create_upload_file(files: List[UploadFile] = File(...)):
storage_client = storage.Client.from_service_account_json(path.json)
bucket_name = 'data'
try:
bucket = storage_client.create_bucket(bucket_name)
except Exception:
bucket = storage_client.get_bucket(bucket_name)
for file in files:
destination_file_name = f'{file.filename}'
new_data = models.Data(
path=destination_file_name
)
try:
blob = bucket.blob(destination_file_name)
blob.upload_from_file(file.file.read())
except Exception:
raise HTTPException(
status_code=500,
detail="File upload failed"
)
Option 1
As per the documentation, upload_from_file() supports a file-like object; hence, you could use the .file attribute of UploadFile (which represents a SpooledTemporaryFile instance). For example:
blob.upload_from_file(file.file)
Option 2
You could read the contents of the file and pass them to upload_from_string(), which supports data in bytes or string format. For instance:
blob.upload_from_string(file.file.read())
or, since you defined your endpoint with async def (see this answer for def vs async def):
contents = await file.read()
blob.upload_from_string(contents)
Option 3
For the sake of completeness, upload_from_filename() expects a filename which represents the path to the file. Hence, the No such file or directory error was thrown when you passed file.filename (as mentioned in your comment), as this is not a path to the file. To use that method (as a last resort), you should save the file contents to a NamedTemporaryFile, which "has a visible name in the file system" that "can be used to open the file", and once you are done with it, delete it. Example:
from tempfile import NamedTemporaryFile
import os
contents = file.file.read()
temp = NamedTemporaryFile(delete=False)
try:
with temp as f:
f.write(contents);
blob.upload_from_filename(temp.name)
except Exception:
return {"message": "There was an error uploading the file"}
finally:
#temp.close() # the `with` statement above takes care of closing the file
os.remove(temp.name)
Note 1:
If you are uploading a rather large file to Google Cloud Storage that may require some time to completely upload, and have encountered a timeout error, please consider increasing the amount of time to wait for the server response, by changing the timeout value, which—as shown in upload_from_file() documentation, as well as all other methods described earlier—by default is set to timeout=60 seconds. To change that, use e.g., blob.upload_from_file(file.file, timeout=180), or you could also set timeout=None (meaning that it will wait until the connection is closed).
Note 2:
Since all the above methods from the google-cloud-storage package perform blocking I/O operations—as can been seen in the source code here, here and here—if you have decided to define your create_upload_file endpoint with async def instead of def (have a look at this answer for more details on def vs async def), you should rather run the "upload file" function in a separate thread to ensure that the main thread (where coroutines are run) does not get blocked. You can do that using Starlette's run_in_threadpool, which is also used by FastAPI internally (see here as well). For example:
await run_in_threadpool(blob.upload_from_file, file.file)
Alternatively, you can use asyncio's loop.run_in_executor, as described in this answer and demonstrated in this sample snippet too.
As for Option 3, wehere you need to open a NamedTemporaryFile and write the contents to it, you can do that using the aiofiles library, as demonstrated in Option 2 of this answer, that is, using:
async with aiofiles.tempfile.NamedTemporaryFile("wb", delete=False) as temp:
contents = await file.read()
await temp.write(contents)
#...
and again, run the "upload file" function in an exterrnal threadpool:
await run_in_threadpool(blob.upload_from_filename, temp.name)
Finally, have a look at the answers here and here on how to enclose the I/O operations in try-except-finally blocks, so that you can catch any possible exceptions, as well as close the UploadFile object properly. UploadFile is a temporary file that is deleted from the filesystem when it is closed. To find out where your system keeps the temporary files, see this answer. Note: Starlette, as described here, uses a SpooledTemporaryFile with 1MB max_size, meaning that the data is spooled in memory until the file size exceeds 1MB, at which point the contents are written to the temporary directory. Hence, you will only see the file you uploaded showing up in the temp directory, if it is larger than 1MB and if .close() has not yet been called.

Django api stream csv file and upload to s3 in a thread

I have a Django APIView that generates CSV file stream and returns it using StreamingHttpResponse. I want to upload the generated file to s3 as well in a separate thread as I do not want the response to stall until upload is completed.
I thought of saving the stream in a TemporaryFile object and use that to trigger s3 upload in a separate thread once the stream is complete. On following this approach I cannot use Context Manager as the file is automatically closed when context manager exits. I had to resort to closing the file manually once the upload got completed. I would like to avoid this manual file closing process. How do I solve this?
Here is the view:
class CsvExportView(APIView):
def get(self, request):
response = StreamingHttpResponse(
streaming_content=get_file_stream(),
content_type='text/csv'
)
response['Content-Disposition'] = 'attachment; filename="{}"'.format('some_name')
return response
Here is the get_file_stream() method:
import threading
from tempfile import TemporaryFile
class TempFileManager(object):
def __init__(self):
self._file = TemporaryFile()
def get_file(self, parameter_list):
pass
def write_to_file(self, data)
self._file.write(data)
def close_file(self):
self._file.close()
def get_file_stream():
file_mng = TempFileManager()
# get_csv_stream method is a generator that yields csv file content
# by calling external api and complex data processing
for data in get_csv_stream():
# keep writing to temp file
file_mng.write_to_file(data)
# yield chunks to browser
yield data
# once all data is received
# spawn a thread and do the upload
# complete request lifecycle
t = threading.Thread(target=upload_file_to_s3, args=(file_mng, ))
t.start()
def upload_file_to_s3(self, manager):
# some code to upload file
upload(manager.get_file())
# remember to close file
manager.close_file()
The above approach works fine but seems very cluttered and error-prone. This would be much easier if a context manager is used like this:
def get_file_stream():
with TemporaryFile() as tmp:
for data in get_csv_stream():
# keep writing to temp file
tmp.write(data)
yield data
upload_file_to_s3(tmp)
But I do not want the user to wait until the file is uploaded to s3 so since threading is involved, the context manager might close the file even before the upload process starts.
Does anyone know a better way to handle this situation?

In Flask, how can I send a temporary file and delete it after upload is finished? [duplicate]

I have a Flask view that generates data and saves it as a CSV file with Pandas, then displays the data. A second view serves the generated file. I want to remove the file after it is downloaded. My current code raises a permission error, maybe because after_request deletes the file before it is served with send_from_directory. How can I delete a file after serving it?
def process_data(data)
tempname = str(uuid4()) + '.csv'
data['text'].to_csv('samo/static/temp/{}'.format(tempname))
return file
#projects.route('/getcsv/<file>')
def getcsv(file):
#after_this_request
def cleanup(response):
os.remove('samo/static/temp/' + file)
return response
return send_from_directory(directory=cwd + '/samo/static/temp/', filename=file, as_attachment=True)
after_request runs after the view returns but before the response is sent. Sending a file may use a streaming response; if you delete it before it's read fully you can run into errors.
This is mostly an issue on Windows, other platforms can mark a file deleted and keep it around until it not being accessed. However, it may still be useful to only delete the file once you're sure it's been sent, regardless of platform.
Read the file into memory and serve it, so that's it's not being read when you delete it later. In case the file is too big to read into memory, use a generator to serve it then delete it.
#app.route('/download_and_remove/<filename>')
def download_and_remove(filename):
path = os.path.join(current_app.instance_path, filename)
def generate():
with open(path) as f:
yield from f
os.remove(path)
r = current_app.response_class(generate(), mimetype='text/csv')
r.headers.set('Content-Disposition', 'attachment', filename='data.csv')
return r

Remove file after Flask serves it

I have a Flask view that generates data and saves it as a CSV file with Pandas, then displays the data. A second view serves the generated file. I want to remove the file after it is downloaded. My current code raises a permission error, maybe because after_request deletes the file before it is served with send_from_directory. How can I delete a file after serving it?
def process_data(data)
tempname = str(uuid4()) + '.csv'
data['text'].to_csv('samo/static/temp/{}'.format(tempname))
return file
#projects.route('/getcsv/<file>')
def getcsv(file):
#after_this_request
def cleanup(response):
os.remove('samo/static/temp/' + file)
return response
return send_from_directory(directory=cwd + '/samo/static/temp/', filename=file, as_attachment=True)
after_request runs after the view returns but before the response is sent. Sending a file may use a streaming response; if you delete it before it's read fully you can run into errors.
This is mostly an issue on Windows, other platforms can mark a file deleted and keep it around until it not being accessed. However, it may still be useful to only delete the file once you're sure it's been sent, regardless of platform.
Read the file into memory and serve it, so that's it's not being read when you delete it later. In case the file is too big to read into memory, use a generator to serve it then delete it.
#app.route('/download_and_remove/<filename>')
def download_and_remove(filename):
path = os.path.join(current_app.instance_path, filename)
def generate():
with open(path) as f:
yield from f
os.remove(path)
r = current_app.response_class(generate(), mimetype='text/csv')
r.headers.set('Content-Disposition', 'attachment', filename='data.csv')
return r

Flask - Handling Form File & Upload to AWS S3 without Saving to File

I am using a Flask app to receive a mutipart/form-data request with an uploaded file (a video, in this example).
I don't want to save the file in the local directory because this app will be running on a server, and saving it will slow things down.
I am trying to use the file object created by the Flask request.files[''] method, but it doesn't seem to be working.
Here is that portion of the code:
#bp.route('/video_upload', methods=['POST'])
def VideoUploadHandler():
form = request.form
video_file = request.files['video_data']
if video_file:
s3 = boto3.client('s3')
s3.upload_file(video_file.read(), S3_BUCKET, 'video.mp4')
return json.dumps('DynamoDB failure')
This returns an error:
TypeError: must be encoded string without NULL bytes, not str
on the line:
s3.upload_file(video_file.read(), S3_BUCKET, 'video.mp4')
I did get this to work by first saving the file and then accessing that saved file, so it's not an issue with catching the request file. This works:
video_file.save(form['video_id']+".mp4")
s3.upload_file(form['video_id']+".mp4", S3_BUCKET, form['video_id']+".mp4")
What would be the best method to handle this file data in memory and pass it to the s3.upload_file() method? I am using the boto3 methods here, and I am only finding examples with the filename used in the first parameter, so I'm not sure how to process this correctly using the file in memory. Thanks!
First you need to be able to access the raw data sent to Flask. This is not as easy as it seems, since you're reading a form. To be able to read the raw stream you can use flask.request.stream, which behaves similarly to StringIO. The trick here is, you cannot call request.form or request.file because accessing those attributes will load the whole stream into memory or into a file.
You'll need some extra work to extract the right part of the stream (which unfortunately I cannot help you with because it depends on how your form is made, but I'll let you experiment with this).
Finally you can use the set_contents_from_file function from boto, since upload_file does not seem to deal with file-like objects (StringIO and such).
Example code:
from boto.s3.key import Key
#bp.route('/video_upload', methods=['POST'])
def VideoUploadHandler():
# form = request.form <- Don't do that
# video_file = request.files['video_data'] <- Don't do that either
video_file_and_metadata = request.stream # This is a file-like object which does not only contain your video file
# This is what you need to implement
video_title, video_stream = extract_title_stream(video_file_and_metadata)
# Then, upload to the bucket
s3 = boto3.client('s3')
bucket = s3.create_bucket(bucket_name, location=boto.s3.connection.Location.DEFAULT)
k = Key(bucket)
k.key = video_title
k.set_contents_from_filename(video_stream)

Categories