My Flask application will allow the upload of large files (up to 100 Mb) to my server. I was wondering how Flask managed the chunked file if the client decides to stop the upload half way. I read the documentation about File Upload but wasn't able to find that mentioned.
Does Flask automatically delete the file? How can it know that the user won't retry it? Or do I have to manually delete the aborted files in the temporary folder?
Werkzeug (the library that Flask uses for many tasks including this one) uses a tempfile.TemporaryFile object to receive the WSGI file stream when uploading. The object automatically manages the open file.
The file is immediately deleted on disk; there is no entry in the directory table anymore, but the process retains a file handle
When the TemporaryFile object is cleared (no references remain, usually because the request ended), the file object is closed and the operating system clears the disk space used.
As such, the file data is deleted when a request is aborted.
Flask does not handle the case where a user uploads the file again; there is no standard way to handle that anyway. You'd have to come up with your own solution there.
Related
My application is keeping watch on a set of folders where users can upload files. When a file upload is finished I have to apply a treatment, but I don't know how to detect that a file has not finish to upload.
Any way to detect if a file is not released yet by the FTP server?
There's no generic solution to this problem.
Some FTP servers lock the file being uploaded, preventing you from accessing it, while the file is still being uploaded. For example IIS FTP server does that. Most other FTP servers do not. See my answer at Prevent file from being accessed as it's being uploaded.
There are some common workarounds to the problem (originally posted in SFTP file lock mechanism, but relevant for the FTP too):
You can have the client upload a "done" file once the upload finishes. Make your automated system wait for the "done" file to appear.
You can have a dedicated "upload" folder and have the client (atomically) move the uploaded file to a "done" folder. Make your automated system look to the "done" folder only.
Have a file naming convention for files being uploaded (".filepart") and have the client (atomically) rename the file after upload to its final name. Make your automated system ignore the ".filepart" files.
See (my) article Locking files while uploading / Upload to temporary file name for an example of implementing this approach.
Also, some FTP servers have this functionality built-in. For example ProFTPD with its HiddenStores directive.
A gross hack is to periodically check for file attributes (size and time) and consider the upload finished, if the attributes have not changed for some time interval.
You can also make use of the fact that some file formats have clear end-of-the-file marker (like XML or ZIP). So you know, that the file is incomplete.
Some FTP servers allow you to configure a hook to be called, when an upload is finished. You can make use of that. For example ProFTPD has a mod_exec module (see the ExecOnCommand directive).
I use ftputil to implement this work-around:
connect to ftp server
list all files of the directory
call stat() on each file
wait N seconds
For each file: call stat() again. If result is different, then skip this file, since it was modified during the last seconds.
If stat() result is not different, then download the file.
This whole ftp-fetching is old and obsolete technology. I hope that the customer will use a modern http API the next time :-)
If you are reading files of particular extensions, then use WINSCP for File Transfer. It will create a temporary file with extension .filepart and it will turn to the actual file extension once it fully transfer the file.
I hope, it will help someone.
This is a classic problem with FTP transfers. The only mostly reliable method I've found is to send a file, then send a second short "marker" file just to tell the recipient the transfer of the first is complete. You can use a file naming convention and just check for existence of the second file.
You might get fancy and make the content of the second file a checksum of the first file. Then you could verify the first file. (You don't have the problem with the second file because you just wait until file size = checksum size).
And of course this only works if you can get the sender to send a second file.
One csv file is uploaded to the cloud storage everyday around 0200 hrs but sometime due to job fail or system crash file upload happens very late. So I want to create a cloud function that can trigger my python bq load script whenever the file is uploaded to the storage.
file_name : seller_data_{date}
bucket name : sale_bucket/
The question lacks enough description of the desired usecase and any issues the OP has faced. However, here are a few possible approaches that you might chose from depending on the usecase.
The simple way: Cloud Functions with Storage trigger.
This is probably the simplest and most efficient way of running a Python function whenever a file gets uploaded to your bucket.
The most basic tutorial is this.
The hard way: App Engine with a few tricks.
Having a basic Flask application hosted on GAE (Standard or Flex), with an endpoint specifically to handle this chek of the files existing, download object, manipulate it and then do something.
This route can act as a custom HTTP triggered function, where once it receives a request (could be from a simple curl request, visit from the browser, PubSub event, or even another Cloud Function).
Once it receives a GET (or POST) request, it downloads the object into the /tmp dir, process it and then do something.
The small benefit with GAE over CF is that you can set a minimum of one instance to stay always alive which means you will not have the cold starts, or risk the request timing out before the job is done.
The brutal/overkill way: Clour Run.
Similar approach to App Engine, but with Cloud Run you'll also need to work with the Dockerfile, have in mind that Cloud Run will scale down to zero when there's no usage, and other minor things that apply to building any application on Cloud Run.
########################################
For all the above approaches, some additional things you might want to achieve are the same:
a) Downloading the object and doing some processing on it:
You will have to download it to the /tmp directory as it's the directory for both GAE and CF to store temporary files. Cloud Run is a bit different here but let's not get deep into it as it's an overkill byitself.
However, keep in mind that if your file is large you might cause a high memory usage.
And ALWAYS clean that directory after you have finished with the file. Also when opening a file always use with open ... as it will also make sure to not keep files open.
b) Downloading the latest object in the bucket:
This is a bit tricky and it needs some extra custom code. There are many ways to achieve it, but the one I use (always tho paying close attention to memory usage), is upon the creation of the object I upload to the bucket, I get the current time, use Regex to transform it into something like results_22_6.
What happens now is that once I list the objects from my other script, they are already listed in an accending order. So the last element in the list is the latest object.
So basically what I do then is to check if the filename I have in /tmp is the same as the name of the object[list.length] in the bucket. If yes then do nothing, if no then delete the old one and download the latest one in the bucket.
This might not be optimal, but for me it's kinda preferable.
I am getting a memory error while uploading a CSV file of size around 650 mb with a shape (10882101, 6).
How can i upload such file in postgres using django framework.
You haven't shared much details (error logs, which python package you are using etc).
You might like to read Most efficient way to parse a large .csv in python?, https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d
How I will do it from Django Framework:
I will use Celery to run the job as a background process, as waiting for the file to be uploaded completely before returning a response might give a HTTP timeout.
Celery quickstart with Django
We have a react application communicating with a django backend. Whenever the react application wants to upload a file to the backend, we send a form request with one field being the handle of the file being upload. The field is received on the Django side as an
InMemoryUploadedFile, which is an object with some chunks, which can be processed for example like this:
def save_uploaded_file(uploaded_file, handle):
"""
Saves the uploaded file using the given file handle.
We walk the chunks to avoid reading the whole file in memory
"""
for chunk in uploaded_file.chunks():
handle.write(chunk)
handle.flush()
logger.debug(f'Saved file {uploaded_file.name} with length {uploaded_file.size}')
Now, I am creating some testing framework using requests to drive our API. I am trying to emulate this mechanism, but strangely enough, requests insists on reading from the open handle before sending the request. I am doing:
requests.post(url, data, headers=headers, **kwargs)
with:
data = {'content': open('myfile', 'rb'), ...}
Note that I am not reading from the file, I am just opening it. But requests insists on reading from it, and sends the data embedded, which has several problems:
it can be huge
by being binary data, it corrupts the request
it is not what my application expects
I do not want this: I want requests simply to "stream" that file, not to read it. There is a files parameter, but that will create a multipart with the file embedded in the request, which is again not what I want. I want all fields in the data to be passed in the request, and the content field to be streamed. I know this is possible because:
the browser does it
Postman does it
the django test client does it
How can I force requests to stream a particular file in the data?
Probably, this is no longer relevant, but I will share some information that I found in the documentation.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast. However, if an uploaded file is too large,
Django will write the uploaded file to a temporary file stored in your
system’s temporary directory.
This way, there is no need to create a streaming file upload. Rather, the solution might be to handle (read) the loaded using a buffer.
I want to create a minimal webpage where concurrent users can upload a file and I can process the file (which is expected to take some hours) and email back to the user later on.
Since I am hosting this on AWS, I was thinking of invoking some background process once I receive the file so that even if the user closes the browser window, the processing keeps taking place and I am able to send the results after few hours, all through some pre-written scripts.
Can you please help me with the logistics of how should I do this?
Here's how it might look like (hosting-agnostic):
A user uploads a file on the web server
The file is saved in a storage that can be accessed later by the background jobs
Some metadata (location in the storage, user's email etc) about the file is saved in a DB/message broker
Background jobs tracking the DB/message broker pick up the metadata and start handling the file (this is why it needs to be accessible by it in p.2) and notify the user
More specifically, in case of python/django + aws you might use the following stack:
Lets assume you're using python + django
You can save the uploaded files in a private AWS S3 bucket
Some meta might be saved in the db or use celery + AWS SQS or AWS SQS directly or bring up something like rabbitmq or redis(+pubsub)
Have python code handling the job - depends on what your opt for in p.3. The only requirement is that it can pull data from your S3 bucket. After the job is done notify the user via AWS SES
The simplest single-server setup that doesn't require any intermediate components:
Your python script that simply saves the file in a folder and gives it a name like someuser#yahoo.com-f9619ff-8b86-d011-b42d-00cf4fc964ff
Cron job looking for any files in this folder that would handle found files and notify the user. Notice if you need multiple background jobs running in parallel you'll need to slightly complicate the scheme to avoid race conditions (i.e. rename the file being processed so that only a single job would handle it)
In a prod app you'll likely need something in between depending on your needs