We have a react application communicating with a django backend. Whenever the react application wants to upload a file to the backend, we send a form request with one field being the handle of the file being upload. The field is received on the Django side as an
InMemoryUploadedFile, which is an object with some chunks, which can be processed for example like this:
def save_uploaded_file(uploaded_file, handle):
"""
Saves the uploaded file using the given file handle.
We walk the chunks to avoid reading the whole file in memory
"""
for chunk in uploaded_file.chunks():
handle.write(chunk)
handle.flush()
logger.debug(f'Saved file {uploaded_file.name} with length {uploaded_file.size}')
Now, I am creating some testing framework using requests to drive our API. I am trying to emulate this mechanism, but strangely enough, requests insists on reading from the open handle before sending the request. I am doing:
requests.post(url, data, headers=headers, **kwargs)
with:
data = {'content': open('myfile', 'rb'), ...}
Note that I am not reading from the file, I am just opening it. But requests insists on reading from it, and sends the data embedded, which has several problems:
it can be huge
by being binary data, it corrupts the request
it is not what my application expects
I do not want this: I want requests simply to "stream" that file, not to read it. There is a files parameter, but that will create a multipart with the file embedded in the request, which is again not what I want. I want all fields in the data to be passed in the request, and the content field to be streamed. I know this is possible because:
the browser does it
Postman does it
the django test client does it
How can I force requests to stream a particular file in the data?
Probably, this is no longer relevant, but I will share some information that I found in the documentation.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast. However, if an uploaded file is too large,
Django will write the uploaded file to a temporary file stored in your
system’s temporary directory.
This way, there is no need to create a streaming file upload. Rather, the solution might be to handle (read) the loaded using a buffer.
Related
We need to move our video file storage to AWS S3. The old location is a cdn, so I only have url for each file (1000+ files, > 1TB total file size). Running an upload tool directly on the storage server is not an option.
I already created a tool that downloads the file, uploads file to S3 bucket and updates the DB records with new HTTP url and works perfectly except it takes forever.
Downloading the file takes some time (considering each file close to a gigabyte) and uploading it takes longer.
Is it possible to upload the video file directly from cdn to S3, so I could reduce processing time into half? Something like reading chunk of file and then putting it to S3 while reading next chunk.
Currently I use System.Net.WebClient to download the file and AWSSDK to upload.
PS: I have no problem with internet speed, I run the app on a server with 1GBit network connection.
No, there isn't a way to direct S3 to fetch a resource, on your behalf, from a non-S3 URL and save it in a bucket.
The only "fetch"-like operation S3 supports is the PUT/COPY operation, where S3 supports fetching an object from one bucket and storing it in another bucket (or the same bucket), even across regions, even across accounts, as long as you have a user with sufficient permission for the necessary operations on both ends of the transaction. In that one case, S3 handles all the data transfer, internally.
Otherwise, the only way to take a remote object and store it in S3 is to download the resource and then upload it to S3 -- however, there's nothing preventing you from doing both things at the same time.
To do that, you'll need to write some code, using presumably either asynchronous I/O or threads, so that you can simultaneously be receiving a stream of downloaded data and uploading it, probably in symmetric chunks, using S3's Multipart Upload capability, which allows you to write individual chunks (minimum 5MB each) which, with a final request, S3 will validate and consolidate into a single object of up to 5TB. Multipart upload supports parallel upload of chunks, and allows your code to retry any failed chunks without restarting the whole job, since the individual chunks don't have to be uploaded or received by S3 in linear order.
If the origin supports HTTP range requests, you wouldn't necessarily even need to receive a "stream," you could discover the size of the object and then GET chunks by range and multipart-upload them. Do this operation with threads or asynch I/O handling multiple ranges in parallel, and you will likely be able to copy an entire object faster than you can download it in a single monolithic download, depending on the factors limiting your download speed.
I've achieved aggregate speeds in the range of 45 to 75 Mbits/sec while uploading multi-gigabyte files into S3 from outside of AWS using this technique.
This has been answered by me in this question, here's the gist:
object = Aws::S3::Object.new(bucket_name: 'target-bucket', key: 'target-key')
object.upload_stream do |write_stream|
IO.copy_stream(URI.open('http://example.com/file.ext'), write_stream)
end
This is no 'direct' pull-from-S3, though. At least this doesn't download each file and then uploads in serial, but streams 'through' the client. If you run the above on an EC2 instance in the same region as your bucket, I believe this is as 'direct' as it gets, and as fast as a direct pull would ever be.
if a proxy ( node express ) is suitable for you then the portions of code at these 2 routes could be combined to do a GET POST fetch chain, retreiving then re-posting the response body to your dest. S3 bucket.
step one creates response.body
step two
set the stream in 2nd link to response from the GET op in link 1 and you will upload to dest.bucket the stream ( arrayBuffer ) from the first fetch
I am getting a memory error while uploading a CSV file of size around 650 mb with a shape (10882101, 6).
How can i upload such file in postgres using django framework.
You haven't shared much details (error logs, which python package you are using etc).
You might like to read Most efficient way to parse a large .csv in python?, https://medium.com/casual-inference/the-most-time-efficient-ways-to-import-csv-data-in-python-cc159b44063d
How I will do it from Django Framework:
I will use Celery to run the job as a background process, as waiting for the file to be uploaded completely before returning a response might give a HTTP timeout.
Celery quickstart with Django
My Flask application will allow the upload of large files (up to 100 Mb) to my server. I was wondering how Flask managed the chunked file if the client decides to stop the upload half way. I read the documentation about File Upload but wasn't able to find that mentioned.
Does Flask automatically delete the file? How can it know that the user won't retry it? Or do I have to manually delete the aborted files in the temporary folder?
Werkzeug (the library that Flask uses for many tasks including this one) uses a tempfile.TemporaryFile object to receive the WSGI file stream when uploading. The object automatically manages the open file.
The file is immediately deleted on disk; there is no entry in the directory table anymore, but the process retains a file handle
When the TemporaryFile object is cleared (no references remain, usually because the request ended), the file object is closed and the operating system clears the disk space used.
As such, the file data is deleted when a request is aborted.
Flask does not handle the case where a user uploads the file again; there is no standard way to handle that anyway. You'd have to come up with your own solution there.
I am using Django as a rest server. I suppose to get a POST that contains JSON that I should parse. The client is a salesforce server that is gzipping the request.
To get the request inflated, I use this in VHost:
SetInputFilter DEFLATE
Almost everything looks fine, but when I read request.body or request.read(16000) - input is pretty small - I always see chopped response (5 characters are missing).
Any suggestions where to start debugging?
Technically the WSGI specification doesn't support the concept of mutating input filters as middleware, or even within a underlying web server.
The specific issue is that mutating input filters will change the amount of request content, but will not change the CONTENT_LENGTH value in the WSGI environ dictionary.
The WSGI specification says that a valid WSGI application is only allowed to read up to CONTENT_LENGTH bytes from the request content. As a consequence, in the case of compressed request content, where the final request size will end up being greater that what CONTENT_LENGTH specifies, a web framework is likely to truncate the request input before all data is read.
You can find some details about this issue in:
http://blog.dscpl.com.au/2009/10/details-on-wsgi-10-amendmentsclarificat.html
Although changes in the specification were pushed for, nothing ever happened.
To work around the problem, what you would need to do is implement a WSGI middleware which you would wrap around the Django application, which if it detects by way of headers passed, that the original content had been compressed, but where you know Apache decompressed it, would read all request content until it reach the end of stream marker, ignoring CONTENT_LENGTH, before even passing the request to Django. Having done that, it could then change CONTENT_LENGTH and substitute wsgi.input with a replacement stream which then returns the already read content.
Because the content size could be quite large and of unknown size, reading it all into memory would not necessarily be a good idea. You therefore would likely want to read it in a block at a time and write it out to a temporary file. The wsgi.input would then be replaced with an open file handle on the temporary file and CONTENT_LENGTH replaced with the final size of the file.
If you search properly on the mod_wsgi archives on Google Groups, you should find prior discussions on this and perhaps even some example code.
Is there a way to check the size of the incoming POST in Pyramid, without saving the file to disk and using the os module?
You should be able to check the request.content_length. WSGI does not support streaming the request body so content length must be specified. If you ever access request.body, request.params or request.POST it will read the content and save it to disk.
The best way to handle this, however, is as close to the client as possible. Meaning if you are running behind a proxy of any sort, have that proxy reject requests that are too large. Once it gets to Python, something else may have already stored the request to disk.