python post large files to django - python

I am trying to find the best way (most efficient way) to post large files from a python application to a Django server.
If I rely on raw_post_data on the Django side then all the content needs to be in RAM before I can read it which doesn't seem efficient at all if the file received is 100s of megs.
Is it better to use the file uploads methods Django has. This means using a multipart/form-data post.
or maybe something better ?
Laurent

I think only files less than 2.5MB are stored in the memory, any file that is larger than 2.5MB is streamed or written to temporary file in temp directory..
reference:
http://simonwillison.net/2008/Jul/1/uploads/ and here http://docs.djangoproject.com/en/dev/topics/http/file-uploads/

If you really want to optimize it and don't want Django to suffer whilst the bytes are being streamed and thus occupying one of the Django threads you can use the nginx upload module
(see also this blog post)

Related

Uploading large files with Django: How should one go about doing this?

I have to upload >= 20 GBs of data with Django.
Should I break the file into chunks and then upload it with some kind of a checksum to maintain integrity or does Django implicitly does it?
Will it be better if I use FTP instead of regular HTTP for such large files?
Django uses so-called Upload Handlers to upload files, and has a related setting called FILE_UPLOAD_MAX_MEMORY_SIZE (default value of 2.5Mb). Files smaller than this threshold will be handled in memory, larger files will be streamed into a temporary file on disk. I haven't yet tried uploading files larger than about 1Gb, but I would expect you can just use django without problems.

python make huge file persist in memory

I have a python script that needs to read a huge file into a var and then search into it and perform other stuff,
the problem is the web server calls this script multiple times and every time i am having a latency of around 8 seconds while the file loads.
Is it possible to make the file persist in memory to have faster access to it atlater times ?
I know i can make the script as a service using supervisor but i can't do that for this.
Any other suggestions please.
PS I am already using var = pickle.load(open(file))
You should take a look at http://docs.h5py.org/en/latest/. It allows to perform various operations on huge files. It's what the NASA uses.
Not an easy problem. I assume you can do nothing about the fact that your web server calls your application multiple times. In that case I see two solutions:
(1) Write TWO separate applications. The first application, A, loads the large file and then it just sits there, waiting for the other application to access the data. "A" provides access as required, so it's basically a sort of custom server. The second application, B, is the one that gets called multiple times by the web server. On each call, it extracts the necessary data from A using some form of interprocess communication. This ought to be relatively fast. The Python standard library offers some tools for interprocess communication (socket, http server) but they are rather low-level. Alternatives are almost certainly going to be operating-system dependent.
(2) Perhaps you can pre-digest or pre-analyze the large file, writing out a more compact file that can be loaded quickly. A similar idea is suggested by tdelaney in his comment (some sort of database arrangement).
You are talking about memory-caching a large array, essentially…?
There are three fairly viable options for large arrays:
use memory-mapped arrays
use h5py or pytables as a back-end
use an array caching-aware package like klepto or joblib.
Memory-mapped arrays index the array in file, as if there were in memory.
h5py or pytables give you fast access to arrays on disk, and also can avoid the load of the entire array into memory. klepto and joblib can store arrays as a collection of "database" entries (typically a directory tree of files on disk), so you can load portions of the array into memory easily. Each have a different use case, so the best choice for you depends on what you want to do. (I'm the klepto author, and it can use SQL database tables as a backend instead of files).

Python Azure blob storage upload file bigger then 64 MB

From the sample code, I can upload 64MB, without any problem:
myblob = open(r'task1.txt', 'r').read()
blob_service.put_blob('mycontainer', 'myblob', myblob, x_ms_blob_type='BlockBlob')
What if I want to upload bigger size?
Thank you
I ran into the same problem a few days ago, and was lucky enough to find this. It breaks up the file into chunks and uploads it for you.
I hope this helps. Cheers!
I'm not a Python programmer. But a few extra tips I can offer (my stuff is all in C):
Use HTTP PUT operations(comp=block option) for as many Blocks (4MB each) as required for your file, and then use a final PUT Block List (comp=blocklist option) that coalesces the Blocks. If your Block uploads fail or you need to abort, the cleanup for deleting the partial set of Blocks previously uploaded is a DELETE command for the file you are looking to create, but this appears supported by the 2013-08-15 version only (Someone from the Azure support should confirm this).
If you need to add Meta information, an additional PUT operation (with the comp=metadata prefix) is what I do when using the Block List method. There might be a more efficient way to tag the meta information without requiring an additional PUT, but I'm not aware of it.
This is good question. Unfortunately I don't see a real implementation for uploading arbitrary large files. So, from what I see there is much more work to do on the Python SDK, unless I am missing something really crucial.
The sample code provided in the documentation indeed uses just a single text file and uploads at once. There is no real code that is yet implemented (from what I see in the SDK Source code) to support upload of larger files.
So, for you, to work with Blobs from Python you need to understand how Azure Blob Storage works. Start here.
Then take a quick look at the REST API documentation for PutBlob operation. It is mentioned in the remarks:
The maximum upload size for a block blob is 64 MB. If your blob is
larger than 64 MB, you must upload it as a set of blocks. For more
information, see the Put Block (REST API) and Put Block List (REST
API) operations. It's not necessary to call Put Blob if you upload the
blob as a set of blocks.
The good news is that PutBlock and PutBlockList is implemented in the Python SDK, but with no sample provided for how to use it. What you have to do is to manually split your file into chunks (blocks) of up to 4 MB each. and then use put_block(self, container_name, blob_name, block, blockid, content_md5=None, x_ms_lease_id=None): function from the python SDK to upload the blocks. Ultimately you will upload the blocks in parallel. Do not forget however that you have to execute also put_block_list(self, container_name, blob_name, block_list, content_md5=None, x_ms_blob_cache_control=None... at the end to commit all blocks uploaded.
Unfortunately I'm not Python expert to help you further, but at least I give you a good picture of the situation.

How can I build and stream a zip/tar archive file by file to the client?

I understand that it's possible to build zip/tar archives "dynamically" when sending them to the browser; one sends the headers then compresses each file and streams those parts to the browser, which can aid in building large archives when server memory is limited.
Is this achievable over WSGI?
In Werkzeug at least, the documentation says
Response can be any kind of iterable or string. If it’s a string it’s
considered being an iterable with one item which is the string passed.
so if you can build a generator or some other kind of iterator to serve up the data in chunks, it will get concatenated together and served as one file. (Note: you'll probably also want to pass the direct_passthrough flag to the Response object to do it through Werkzeug.)
If you can't use Werkzeug, you can probably get started by investigating how Werkzeug does it.
I see another similar question in Create and stream a large archive without storing it in memory or on disk ,the questioner turn to https://github.com/SpiderOak/ZipStream at last

storing uploaded photos and documents - filesystem vs database blob

My specific situation
Property management web site where users can upload photos and lease documents. For every apartment unit, there might be 4 photos, so there won't be an overwhelming number of photo in the system.
For photos, there will be thumbnails of each.
My question
My #1 priority is performance. For the end user, I want to load pages and show the image as fast as possible.
Should I store the images inside the database, or file system, or doesn't matter? Do I need to be caching anything?
Thanks in advance!
While there are exceptions to everything, the general case is that storing images in the file system is your best bet. You can easily provide caching services to the images, you don't need to worry about additional code to handle image processing, and you can easily do maintenance on the images if needed through standard image editing methods.
It sounds like your business model fits nicely into this scenario.
File system. No contest.
The data has to go through a lot more layers when you store it in the db.
Edit on caching:
If you want to cache the file while the user uploads it to ensure the operation finishes as soon as possible, dumping it straight to disk (i.e. file system) is about as quick as it gets. As long as the files aren't too big and you don't have too many concurrent users, you can 'cache' the file in memory, return to the user, then save to disk. To be honest, I wouldn't bother.
If you are making the files available on the web after they have been uploaded and want to cache to improve the performance, file system is still the best option. You'll get caching for free (may have to adjust a setting or two) from your web server. You wont get this if the files are in the database.
After all that it sounds like you should never store files in the database. Not the case, you just need a good reason to do so.
Definitely store your images on the filesystem. One concern that folks don't consider enough when considering these types of things is bloat; cramming images as binary blobs into your database is a really quick way to bloat your DB way up. With a large database comes higher hardware requirements, more difficult replication and backup requirements, etc. Sticking your images on a filesystem means you can back them up / replicate them with many existing tools easily and simply. Storage space is far easier to increase on filesystem than in database, as well.
Comment to the Sheepy's answer.
In common storing files in SQL is better when file size less than 256 kilobytes, and worth when it greater 1 megabyte. So between 256-1024 kilobytes it depends on several factors. Read this to learn more about reasons to use SQL or file systems.
a DB might be faster than a filesystem on some operations, but loading a well-identified chunk of data 100s of KB is not one of them.
also, a good frontend webserver (like nginx) is way faster than any webapp layer you'd have to write to read the blob from the DB. in some tests nginx is roughly on par with memcached for raw data serving of medium-sized files (like big HTMLs or medium-sized images).
go FS. no contest.
Maybe on a slight tangent, but in this video from the MySQL Conference, the presenter talks about how the website smugmug uses MySQL and various other technologies for superior performance. I think the video builds upon some of the answers posted here, but also suggest ways of improving website performance outside the scope of the DB.

Categories