Is there a way to check the size of the incoming POST in Pyramid, without saving the file to disk and using the os module?
You should be able to check the request.content_length. WSGI does not support streaming the request body so content length must be specified. If you ever access request.body, request.params or request.POST it will read the content and save it to disk.
The best way to handle this, however, is as close to the client as possible. Meaning if you are running behind a proxy of any sort, have that proxy reject requests that are too large. Once it gets to Python, something else may have already stored the request to disk.
Related
We have a react application communicating with a django backend. Whenever the react application wants to upload a file to the backend, we send a form request with one field being the handle of the file being upload. The field is received on the Django side as an
InMemoryUploadedFile, which is an object with some chunks, which can be processed for example like this:
def save_uploaded_file(uploaded_file, handle):
"""
Saves the uploaded file using the given file handle.
We walk the chunks to avoid reading the whole file in memory
"""
for chunk in uploaded_file.chunks():
handle.write(chunk)
handle.flush()
logger.debug(f'Saved file {uploaded_file.name} with length {uploaded_file.size}')
Now, I am creating some testing framework using requests to drive our API. I am trying to emulate this mechanism, but strangely enough, requests insists on reading from the open handle before sending the request. I am doing:
requests.post(url, data, headers=headers, **kwargs)
with:
data = {'content': open('myfile', 'rb'), ...}
Note that I am not reading from the file, I am just opening it. But requests insists on reading from it, and sends the data embedded, which has several problems:
it can be huge
by being binary data, it corrupts the request
it is not what my application expects
I do not want this: I want requests simply to "stream" that file, not to read it. There is a files parameter, but that will create a multipart with the file embedded in the request, which is again not what I want. I want all fields in the data to be passed in the request, and the content field to be streamed. I know this is possible because:
the browser does it
Postman does it
the django test client does it
How can I force requests to stream a particular file in the data?
Probably, this is no longer relevant, but I will share some information that I found in the documentation.
By default, if an uploaded file is smaller than 2.5 megabytes, Django
will hold the entire contents of the upload in memory. This means that
saving the file involves only a read from memory and a write to disk
and thus is very fast. However, if an uploaded file is too large,
Django will write the uploaded file to a temporary file stored in your
system’s temporary directory.
This way, there is no need to create a streaming file upload. Rather, the solution might be to handle (read) the loaded using a buffer.
I am trying to upload a blob to azure blob storage with python sdk. I want to pass the MD5 hash for validation on the server side after upload.
Here's the code:
blob_service.put_block_blob_from_path(
container_name='container_name',
blob_name='upload_dir/'+object_name,
file_path=object_name,
content_md5=object_md5Hash
)
But I get this error:
AzureHttpError: The MD5 value specified in the request did not match with the MD5 value calculated by the server.
The file is ~200mb and the error throws instantly. Does not upload the file. So I suspect that it may be comparing the supplied hash with perhaps the hash of the first chunk or something.
Any ideas?
This is sort of an SDK bug in that we should throw a better error message rather than hitting the service, but validating the content of a large upload that has to be chunked simply doesn't work. x_ms_blob_content_md5 will store the md5 but the service will not validate it. That is something you could do on download though. content_md5 is validated by the server for the body of a particular request but since there's more than one with chunked blobs it will never work.
So, if the blob is small enough (below BLOB_MAX_DATA_SIZE) to be put in a single request, content_md5 will work fine. Otherwise I'd simply recommend using HTTPS and storing MD5 in x_ms_blob_content_md5 if you think you might want to download with HTTP and validate it on download. HTTPS already provides validation for things like bit flips on the wire so using it for upload/download will do a lot. If you can't upload/download with HTTPS for one reason or another you can consider chunking the blob yourself using the put block and put block list APIs.
FYI: In future versions we do intend to add automatic MD5 calculation for both single put and chunked operations in the library itself which will fully solve this. For the next version, we will add an improved error message if content_md5 is specified for a chunked download.
I reviewed the source code of the function put_block_blob_from_path of the Azure Blob Storage SDK. It explained the case in the function comment, please see the content below and refer to https://github.com/Azure/azure-storage-python/blob/master/azure/storage/blob/blobservice.py.
content_md5:
Optional. An MD5 hash of the blob content. This hash is used to
verify the integrity of the blob during transport. When this header
is specified, the storage service checks the hash that has arrived
with the one that was sent. If the two hashes do not match, the
operation will fail with error code 400 (Bad Request).
I think there're two things going on here.
Bug in SDK - I believe you have discovered a bug in the SDK. I looked at the source code for this function on Github and what I found is that when a large blob is uploaded in chunks, the SDK is first trying to create an empty block blob. With block blobs, this is not required. When it creates the empty block blob, it does not send any data. But you're setting content-md5 and the SDK compares the content-md5 you sent with the content-md5 of empty content and because they don't match, you get an error.
To fix the issue in the interim, please modify the source code in blobservice.py and comment out the following lines of code:
self.put_blob(
container_name,
blob_name,
None,
'BlockBlob',
content_encoding,
content_language,
content_md5,
cache_control,
x_ms_blob_content_type,
x_ms_blob_content_encoding,
x_ms_blob_content_language,
x_ms_blob_content_md5,
x_ms_blob_cache_control,
x_ms_meta_name_values,
x_ms_lease_id,
)
I have created a new issue on Github for this: https://github.com/Azure/azure-storage-python/issues/99.
Incorrect Usage - I noticed that you're passing the md5 hash of the file in content_md5 parameter. This will not work for you. You should actually pass md5 hash in x_ms_blob_content_md5 parameter. So your call should be:
blob_service.put_block_blob_from_path(
container_name='container_name',
blob_name='upload_dir/'+object_name,
file_path=object_name,
x_ms_blob_content_md5=object_md5Hash
)
The webApp I'm currently developing requires large JSON files to requested by the client, built on the server using Python and sent back to the client. The solution is implemented via CGI, and is working correctly in every way.
At this stage I'm just employing various techniques to minimize the size of the resulting JSON objects sent back to the client which are around 5-10mb ( Without going into detail, this is more or less fixed, and cannot be lazy loaded in any way).
The host we're using doesn't support mod_deflate or mod_gzip, so while we can't configure Apache to automatically create gzipped content on the server with .htaccess, I figure we'll still be able to receive it and decode on the client side as long as the Content-encoding header is set correctly.
What I was wondering, is what is the best way to achieve this. Gzipping something in Python is trivial. I already know how to do that, but the problem is:
How do I compress the data in such a way, that printing it to the output stream to send via CGI will be both compressed, and readable to the client?
The files have to be created on the fly, based upon input data, so storing premade and prezipped files is not an option, and they have to be received via xhr in the webApp.
My initial experiments with compressing the JSON string with gzip and io.stringIO, then printing it to the output stream caused it to be printed in Python's normal byte format eg: b'\n\x91\x8c\xbc\xd4\xc6\xd2\x19\x98\x14x\x0f1q!\xdc|C\xae\xe0 and such, which bloated the request to twice it's normal size...
I was wondering if someone could point me in the right direction here with how I could accomplish this, if it is indeed possible.
I hope I've articulated my problem correctly.
Thank you.
I guess you use print() (which first converts its argument to a string before sending it to stdout) or sys.stdout (which only accepts str objects).
To write directly on stdout, you can use sys.stdout.buffer, a file-like object that supports bytes objects:
import sys
import gzip
s = 'foo'*100
sys.stdout.buffer.write(gzip.compress(s.encode()))
Which gives valid gzip data:
$ python3 foo.py | gunzip
foofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoofoo
Thanks for the answers Valentin and Phillip!
I managed to solve the issue, both of you contributed to the final answer. Turns out it was a combination of things.
Here's the final code that works:
response = json.JSONEncoder().encode(loadData)
sys.stdout.write('Content-type: application/octet-stream\n')
sys.stdout.write('Content-Encoding: gzip\n\n')
sys.stdout.flush()
sys.stdout.buffer.write(gzip.compress(response.encode()))
After switching over to sys.stdout instead of using print to print the headers, and flushing the stream it managed to read correctly. Which is pretty curious... Always something more to learn.
Thanks again!
I am using Django as a rest server. I suppose to get a POST that contains JSON that I should parse. The client is a salesforce server that is gzipping the request.
To get the request inflated, I use this in VHost:
SetInputFilter DEFLATE
Almost everything looks fine, but when I read request.body or request.read(16000) - input is pretty small - I always see chopped response (5 characters are missing).
Any suggestions where to start debugging?
Technically the WSGI specification doesn't support the concept of mutating input filters as middleware, or even within a underlying web server.
The specific issue is that mutating input filters will change the amount of request content, but will not change the CONTENT_LENGTH value in the WSGI environ dictionary.
The WSGI specification says that a valid WSGI application is only allowed to read up to CONTENT_LENGTH bytes from the request content. As a consequence, in the case of compressed request content, where the final request size will end up being greater that what CONTENT_LENGTH specifies, a web framework is likely to truncate the request input before all data is read.
You can find some details about this issue in:
http://blog.dscpl.com.au/2009/10/details-on-wsgi-10-amendmentsclarificat.html
Although changes in the specification were pushed for, nothing ever happened.
To work around the problem, what you would need to do is implement a WSGI middleware which you would wrap around the Django application, which if it detects by way of headers passed, that the original content had been compressed, but where you know Apache decompressed it, would read all request content until it reach the end of stream marker, ignoring CONTENT_LENGTH, before even passing the request to Django. Having done that, it could then change CONTENT_LENGTH and substitute wsgi.input with a replacement stream which then returns the already read content.
Because the content size could be quite large and of unknown size, reading it all into memory would not necessarily be a good idea. You therefore would likely want to read it in a block at a time and write it out to a temporary file. The wsgi.input would then be replaced with an open file handle on the temporary file and CONTENT_LENGTH replaced with the final size of the file.
If you search properly on the mod_wsgi archives on Google Groups, you should find prior discussions on this and perhaps even some example code.
I have an endpoint in my Flask application that accepts large data as the content. I would like to ensure that Flask never attempts to process this body, regardless of its content-type, and always ensures I can read it with the Rquest.stream interface.
This applies only to a couple of endpoints, not my entire application.
How can I configure this?
The Werkzeug Request object heavily relies on properties and anything that touches request data is lazily cached; e.g. only when you actually access the .form attribute would any parsing take place, with the result cached.
In other words, don't touch .files, .form, .get_data(), etc. and nothing will be sucked into memory either.