S3 and Filepicker.io - Multi-file, zipped download on the fly - python

I am using Ink (Filepicker.io) to perform multi-file uploads and it is working brilliantly.
However, a quick look around the internet shows multi-file downloads are more complicated. While I know it is possible to spin up an EC2 instance to zip on the fly, this would also entail some wait-time on the user's part, and the newly created file would also not be available on my Cloudfront immediately.
Has anybody done this before, and what are the practical UX implications - is the wait time significant enough to negatively affect the user experience?
The obvious solution would be to create the zipped files ahead of time, but this would result in some (unnecessary?) redundancy.
What is the best way to avoid redundant storage while reducing wait times for on-the-fly folder compression?

You can create the ZIP archive on the client side using JavaScript. Check out:
http://stuk.github.io/jszip/

Related

how to efficiently rename a lot of blobs in GCS

Lets say that on Google Cloud Storage I have bucket: bucket1 and inside this bucket I have thousands of blobs I want to rename in this way:
Original blob:
bucket1/subfolder1/subfolder2/data_filename.csv
to: bucket1/subfolder1/subfolder2/data_filename/data_filename_backup.csv
subfolder1, subfolder2 and data_filename.csv - they can have different names, however the way to change names of all blobs is as above.
What is the most efficient way to do this? Can I use Python for that?
You can use whatever programming language you want where Google offers an SDK for working with Cloud Storage. There is not going to be much of an advantage to any particular language you choose.
There is not really an "efficient" way of doing this. What you will end up doing in your code is pretty standard:
List the objects that you want to rename.
Iterate that list.
For each object, change the name.
You will get better performance overall if you run the code in a Google Cloud Shell or other Google Cloud compute environment in the same region as your bucket.
If you have a lot of rename to perform, I recommend to perform the operation concurrently (use several thread and not perform the rename sequentially).
Indeed, you have to know how works CLoud Storage. rename doesn't exist. You can go into the Python library and see what is done: copy then delete.
The copy can take time if your files are large. Delete is pretty fast. But in both case, it's API call and it take time (about 50ms if you are in the same region).
If you can perform 200 or 500 operations concurrently, you will significantly reduce the processing time. It's easier with Go or Node, but you can do the same in Python with await key word.

Treating a directory like a file in Python

We have a tool which is designed to allow vendors to deliver files to a company and update their database. These files (generally of predetermined types) use our web-based transport system, a new record is created in the db for each one, and the files are moved into a new structure when delivered.
We have a new request from a client to use this tool to be able to pass through entire directories without parsing every record. Imagine if the client made digital cars then this tool allows the delivery of the digital nuts and bolts and tracks each part, but they want to also deliver a directory with all of the assets which went into creating a digital bolt without adding each asset as a new record.
The issue is that the original code doesn't have a nice way to handle these passthrough folders, and would require a lot of rewriting to make it work. We'd obviously need to create a new function which happens around the time of the directory walk, which takes out each folder which matches this passthrough and then handles it separately. The problem is that all the tools which do the transport, db entry, and delivery all expect files, not folders.
My thinking: what if we could treat that entire folder as a file? That way the current file-level tools don't need to be modified, we'd just need to add the "conversion" step. After generating the manifest, what if we used a library to turn it into a "file", send that, and then turn it back into a "folder" after ingest. The most obvious way to do that is ZIP files - and the current delivery tool does handle ZIPs - but that is slow and some of these deliveries are very large, which means when transporting if something goes wrong the entire ZIP would fail.
Is there a method which we can use which doesn't necessarily compress the files but just somehow otherwise can treat a directory and all its contents like a file, so the rest of the code doesn't need to be rewritten? Or something else I'm missing entirely?
Thanks!
You could use tar files. Python has great support for it, and it is customary in *nix environments to use them as backup files. For compression you could use Gzip (also supported by the standard library and great for streaming).

Python Unit Testing with PyTables and HDF5

What is a proper way to do Unit Testing with file IO, especially if it involves PyTables and HDF5?
My application evolves around storage and retrieval of python data into and from hdf5 files. So far I simply write the hdf5 files in the unit tests myself and load them for comparison. The problem is that I, of course, cannot be sure when some one else runs the test that he has privileges to actually write files to hard disk. (This probably gets even worse when I want to use automated test frameworks like Jenkins, but I haven't checked that, yet).
What is a proper way to handle these situations? Is it best practice to create a /tmp/ folder at a particular place where write access is very likely to be granted? If so, where is that? Or is there an easy and straight forward way to mock PyTables writing and reading?
Thanks a lot!
How about using the module "tempfile" to create the files?
http://docs.python.org/2/library/tempfile.html
I don't know if it's guaranteed to work on all platforms but I bet it does work on most common ones. It would certainly be better practice than hardcoding "/tmp" as the destination.
Another way would be to create an HDF5 database in memory so that no file I/O is required.
http://pytables.github.io/cookbook/inmemory_hdf5_files.html
I obtained that link by googling "hdf5 in memory" so I can't say for sure how well it works.
I think the best practice would be writing all test cases to run against both an in-memory database and a tempfile database. This way, even if one of the above techniques fails for the user, the rest of the tests will still run. Also you can separately identify whether bugs are related to file-writing or something internal to the database.
Fundmentally, HDF5 and Pytables are I/O libraries. They provide an API for file system manipulation. Therefore if you really want to test PyTables / HDF5 you have to hit the file system. There is no way around this. If a user does not have write access on a system, they cannot run the tests. Or at least they cannot run realistic tests.
You can use the in memory file driver to do testing. This is useful for speeding up most tests and testing higher level functionality. However, even if you go this route you should still have a few tests which actually write out real files. If these fail you know that something is wrong.
Normally, people create the temporary h5 files in the tests directory. But if you are truly worried about the user not having write access to this dir, you should use tempfile.gettempdir() to find their environment's correct /tmp dir. Note that this is cross-platform so should work everywhere. Put the h5 files that you create there and remember to delete them afterwards!

Python Azure blob storage upload file bigger then 64 MB

From the sample code, I can upload 64MB, without any problem:
myblob = open(r'task1.txt', 'r').read()
blob_service.put_blob('mycontainer', 'myblob', myblob, x_ms_blob_type='BlockBlob')
What if I want to upload bigger size?
Thank you
I ran into the same problem a few days ago, and was lucky enough to find this. It breaks up the file into chunks and uploads it for you.
I hope this helps. Cheers!
I'm not a Python programmer. But a few extra tips I can offer (my stuff is all in C):
Use HTTP PUT operations(comp=block option) for as many Blocks (4MB each) as required for your file, and then use a final PUT Block List (comp=blocklist option) that coalesces the Blocks. If your Block uploads fail or you need to abort, the cleanup for deleting the partial set of Blocks previously uploaded is a DELETE command for the file you are looking to create, but this appears supported by the 2013-08-15 version only (Someone from the Azure support should confirm this).
If you need to add Meta information, an additional PUT operation (with the comp=metadata prefix) is what I do when using the Block List method. There might be a more efficient way to tag the meta information without requiring an additional PUT, but I'm not aware of it.
This is good question. Unfortunately I don't see a real implementation for uploading arbitrary large files. So, from what I see there is much more work to do on the Python SDK, unless I am missing something really crucial.
The sample code provided in the documentation indeed uses just a single text file and uploads at once. There is no real code that is yet implemented (from what I see in the SDK Source code) to support upload of larger files.
So, for you, to work with Blobs from Python you need to understand how Azure Blob Storage works. Start here.
Then take a quick look at the REST API documentation for PutBlob operation. It is mentioned in the remarks:
The maximum upload size for a block blob is 64 MB. If your blob is
larger than 64 MB, you must upload it as a set of blocks. For more
information, see the Put Block (REST API) and Put Block List (REST
API) operations. It's not necessary to call Put Blob if you upload the
blob as a set of blocks.
The good news is that PutBlock and PutBlockList is implemented in the Python SDK, but with no sample provided for how to use it. What you have to do is to manually split your file into chunks (blocks) of up to 4 MB each. and then use put_block(self, container_name, blob_name, block, blockid, content_md5=None, x_ms_lease_id=None): function from the python SDK to upload the blocks. Ultimately you will upload the blocks in parallel. Do not forget however that you have to execute also put_block_list(self, container_name, blob_name, block_list, content_md5=None, x_ms_blob_cache_control=None... at the end to commit all blocks uploaded.
Unfortunately I'm not Python expert to help you further, but at least I give you a good picture of the situation.

storing uploaded photos and documents - filesystem vs database blob

My specific situation
Property management web site where users can upload photos and lease documents. For every apartment unit, there might be 4 photos, so there won't be an overwhelming number of photo in the system.
For photos, there will be thumbnails of each.
My question
My #1 priority is performance. For the end user, I want to load pages and show the image as fast as possible.
Should I store the images inside the database, or file system, or doesn't matter? Do I need to be caching anything?
Thanks in advance!
While there are exceptions to everything, the general case is that storing images in the file system is your best bet. You can easily provide caching services to the images, you don't need to worry about additional code to handle image processing, and you can easily do maintenance on the images if needed through standard image editing methods.
It sounds like your business model fits nicely into this scenario.
File system. No contest.
The data has to go through a lot more layers when you store it in the db.
Edit on caching:
If you want to cache the file while the user uploads it to ensure the operation finishes as soon as possible, dumping it straight to disk (i.e. file system) is about as quick as it gets. As long as the files aren't too big and you don't have too many concurrent users, you can 'cache' the file in memory, return to the user, then save to disk. To be honest, I wouldn't bother.
If you are making the files available on the web after they have been uploaded and want to cache to improve the performance, file system is still the best option. You'll get caching for free (may have to adjust a setting or two) from your web server. You wont get this if the files are in the database.
After all that it sounds like you should never store files in the database. Not the case, you just need a good reason to do so.
Definitely store your images on the filesystem. One concern that folks don't consider enough when considering these types of things is bloat; cramming images as binary blobs into your database is a really quick way to bloat your DB way up. With a large database comes higher hardware requirements, more difficult replication and backup requirements, etc. Sticking your images on a filesystem means you can back them up / replicate them with many existing tools easily and simply. Storage space is far easier to increase on filesystem than in database, as well.
Comment to the Sheepy's answer.
In common storing files in SQL is better when file size less than 256 kilobytes, and worth when it greater 1 megabyte. So between 256-1024 kilobytes it depends on several factors. Read this to learn more about reasons to use SQL or file systems.
a DB might be faster than a filesystem on some operations, but loading a well-identified chunk of data 100s of KB is not one of them.
also, a good frontend webserver (like nginx) is way faster than any webapp layer you'd have to write to read the blob from the DB. in some tests nginx is roughly on par with memcached for raw data serving of medium-sized files (like big HTMLs or medium-sized images).
go FS. no contest.
Maybe on a slight tangent, but in this video from the MySQL Conference, the presenter talks about how the website smugmug uses MySQL and various other technologies for superior performance. I think the video builds upon some of the answers posted here, but also suggest ways of improving website performance outside the scope of the DB.

Categories