Uploading large files with Django: How should one go about doing this? - python

I have to upload >= 20 GBs of data with Django.
Should I break the file into chunks and then upload it with some kind of a checksum to maintain integrity or does Django implicitly does it?
Will it be better if I use FTP instead of regular HTTP for such large files?

Django uses so-called Upload Handlers to upload files, and has a related setting called FILE_UPLOAD_MAX_MEMORY_SIZE (default value of 2.5Mb). Files smaller than this threshold will be handled in memory, larger files will be streamed into a temporary file on disk. I haven't yet tried uploading files larger than about 1Gb, but I would expect you can just use django without problems.

Related

Identify new files in FTP and write them to AWS S3

I'm currently using ftplib in Python to get some files and write them to S3.
The approach I'm using is to use with open as shown below:
with open('file-name', 'wb') as fp:
ftp.retrbinary('filename', fp.write)
to download files from FTP server and save them in a temporary folder, then upload them to S3.
I wonder if this is the best practice, because the shortcoming about this approach is:
if files are too many&big, I can download them and upload to S3, then delete them from the temp folder,
but the question is if I run this script once a day, I have to download everything again, so how can I check if a file is already been downloaded & existed in S3 so that the script will only process the new-added files in FTP?
Hope this makes sense, would be great if anyone has an example or something, many thanks.
You cache the fact that you processed a given file path to persistent storage (say, a SQLite database). If the file may change after you processed it, you may be able to detect this by also caching the timestamp from FTP.dir() and/or size FTP.size(filename). If that doesn't work, you also cache a checksum (say, SHA256) of the file, then you download the file again to recalculate the checksum to see if the file changed. s3 might support a conditional upload (etag) in which case you would calculate the etag of the file, then upload it with that header set ideally with an 'Expect: 100-continue' header to see if it already got the file before you try upload data.

How can a SQLite database with more entries be compressed to a smaller filesize?

At first some context. Currently I'm running some python scripts which are collecting some data from various sources. Since I expect to get a lot of data I'm a bit worried about how well my machine can handle big file sizes so I keep track of how the database evolves. At the current stage it should be no problem but I noticed that my main database (sqlite3) is not changing in size at all. After some research I found that the filesize might stay the same if the database had more entries before (Source), which most likely happended in the test stage of my scripts.
I'm backing up my database every day at midnight and noticed that the size of the compressed zip file is getting smaller every day. I'm using a shell script for the backup:
zip -r /backup/$(date +\%Y-\%m-\%d).zip /data
The directory /data contains a few other small files, which should not have been modified in any of the compressed zip files.
Why is the file size of the compressed ZIP getting smaller?
If the database file is not growing when you add data, it means that SQLite is reusing free space : database pages which contained rows that were later deleted. These pages are not erased but only marked as free. SQLite does not care about the content (and will eventually overwrite it) but zip still archive everything.
It is possible that the data added are better compressed than the unused data they overwrite.

Manipulating and creating S3 files within Django/python when local system files are needed

I'm using django-storages to store media files in an S3 bucket. However, I am occasionally converting or otherwise fiddling with the file to create new files, and this fiddling has to use files actually on my server (most of the conversion happens using process calls). When done I'd like to save the files back to S3.
In an ideal world, I would not have to do any changes to my functions when moving from local to S3. However, I'm unsure how I would do this considering that I have to create these intermediate local files to fiddle with, and then at the end know that the resultant file (which would also be stored on the local machine) needs to then be copied over to S3.
Best that I can come up with using a pair of context guards, one for the source file and one for the destination file. The source file one would create a temporary file that would get the contents of the source file copied over, and then it would be used, manipulated, etc. The destination file context guard would just get the final desired destination path on S3 and create a temporary local file, then when exiting would create a key in the S3 bucket, copy over the contents of the temporary file, and delete it.
But this seems pretty complicated to me. It also requires me to wrap every single function that manipulates these files in two "with" clauses.
The only other solution I can think of is switching over to utilities that only deal with file-like objects rather than filenames, but this means I can't do subprocess calls.
Take a look at the built-in file storage API - this is exactly the use-case for it.
If you're using django-storages and uploading to S3, then you should have a line in your settings module that looks like
DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage'
When you're developing locally and don't want to upload your media files to S3, in your local settings module, just leave it out so it defaults to django.core.files.storage.FileSystemStorage.
In your application code, for the media files that will get saved to S3 when you move from local development to staging, instantiate a Storage object using the class returned from the get_storage_class function, and use this object to manipulate the file. For the temporary local files you're "fiddling" with, don't use this Storage object (i.e. use Python's built-in file-handling functions) unless it's a file you're going to want to save to S3.
When you're ready to start saving stuff on S3, all you have to do is set DEFAULT_FILE_STORAGE = 'storages.backends.s3boto.S3BotoStorage' again, and your code will work without any other tweaks. When that setting is not set, those media files will get saved to the local filesystem under MEDIA_ROOT, again without any need to change your application logic.

Where does Flask store uploaded files?

Where does Flask store uploaded files before the application code has a chance to save the file? Unless I've missed something it doesn't appear to be showing up in the /tmp directory, which is what I'd have expected, and obviously it's not showing up in the directory I've specified in app.config['UPLOAD_DIRECTORY']. It's not storing it in memory, is it?
Did you check the documentation? It seems pretty clear:
So how exactly does Flask handle uploads? Well it will store them in the webserver’s memory if the files are reasonable small otherwise in a temporary location (as returned by tempfile.gettempdir())

python post large files to django

I am trying to find the best way (most efficient way) to post large files from a python application to a Django server.
If I rely on raw_post_data on the Django side then all the content needs to be in RAM before I can read it which doesn't seem efficient at all if the file received is 100s of megs.
Is it better to use the file uploads methods Django has. This means using a multipart/form-data post.
or maybe something better ?
Laurent
I think only files less than 2.5MB are stored in the memory, any file that is larger than 2.5MB is streamed or written to temporary file in temp directory..
reference:
http://simonwillison.net/2008/Jul/1/uploads/ and here http://docs.djangoproject.com/en/dev/topics/http/file-uploads/
If you really want to optimize it and don't want Django to suffer whilst the bytes are being streamed and thus occupying one of the Django threads you can use the nginx upload module
(see also this blog post)

Categories