Guarantee of data integrity with Python Boto and Amazon Glacier concurrent.ConcurrentUploader? - python

I am using Python and Boto in a script to copy several files from my local disks, turn them into .tar files and upload to AWS Glacier.
I based my script on:
http://www.withoutthesarcasm.com/using-amazon-glacier-for-personal-backups/#highlighter_243847
Which uses the concurrent.ConcurrentUploader
I am just curious how sure I can be that the data is all in Glacier after successfully getting an ID back? Does the concurrentUploader do any kind of hash checking to ensure all the bits arrived?
I want to remove files from my local disk but fear I should be implementing some kind of hash check... I am hoping this is happening under the hood. I have tried and successfully retrieved a couple of archives and was able to un-tar. Just trying to be very cautions.
Does anyone know if there is checking under the hood that all pieces of the transfer were successfully uploaded? If not, does anyone have any python example code of how to implement an upload with hash checking?
Many thanks!
Boto Concurrent Uploader Docs:
http://docs.pythonboto.org/en/latest/ref/glacier.html#boto.glacier.concurrent.ConcurrentUploader
UPDATE:
Looking at the actual Boto Code (https://github.com/boto/boto/blob/develop/boto/glacier/concurrent.py) line 132 appears to show that the hashes are computed automatically but I am unclear what the
[None] * total_parts
means. Does this mean that the hashes are indeed calculated or is this left to the user to implement?

Glacier itself is designed to try to make it impossible for any application to complete a multipart upload without an assurance of data integrity.
http://docs.aws.amazon.com/amazonglacier/latest/dev/api-multipart-complete-upload.html
The API call that returns the archive id is sent with the "tree hash" -- a sha256 of the sha256 hashes of each MiB of the uploaded content, calculated as a tree coalescing up to a single hash -- and the total bytes uploaded. If these don't match what was actually uploaded in each part (which were also, meanwhile, being also validated against sha256 hashes and sub-tree-hashes as they were uploaded) then the "complete multipart" operation will fail.
It should be virtually impossible by the design of the Glacier API for an application to "successfully" upload a file that isn't intact and yet return an archive id.

Yes, the concurrent uploader does compute a tree hash of each part and a linear hash of the entire uploaded payload. The line:
[None] * total_parts
just creates a list containing total_parts number of None values. Later, the None values are replaced by the appropriate tree hash for a particular part. Then, finally, the list of tree hash values are used to compute the final linear hash of the entire upload.
So, there are a lot of integrity checks happening as required by the Glacier API.

Related

How can one use the StorageStreamDownloader to stream download from a blob and stream upload to a different blob?

I believe I have a very simple requirement for which a solution has befuddled me. I am new to the azure-python-sdk and have had little success with its new blob streaming functionality.
Some context
I have used the Java SDK for several years now. Each CloudBlockBlob object has a BlobInputStream and a BlobOutputStream object. When a BlobInputStream is opened, one can invoke its many functions (most notably its read() function) to retrieve data in a true-streaming fashion. A BlobOutputStream, once retrieved, has a write(byte[] data) function where one can continuously write data as frequently as they want until the close() function is invoked. So, it was very easy for me to:
Get a CloudBlockBlob object, open it's BlobInputStream and essentially get back an InputStream that was 'tied' to the CloudBlockBlob. It usually maintained 4MB of data - at least, that's what I understood. When some amount of data is read from its buffer, a new (same amount) of data is introduced, so it always has approximately 4MB of new data (until all data is retrieved).
Perform some operations on that data.
Retrieve the CloudBlockBlob object that I am uploading to, get it's BlobOutputStream, and write to it the data I did some operations on.
A good example of this is if I wanted to compress a file. I had a GzipStreamReader class that would accept an BlobInputStream and an BlobOutputStream. It would read data from the BlobInputStream and, whenever it has compressed some amount of data, write to the BlobOutputStream. It could call write() as many times as it wished; when it finishes reading all the daya, it would close both Input and Output streams, and all was good.
Now for Python
Now, the Python SDK is a little different, and obviously for good reason; the io module works differently than Java's InputStream and OutputStream classes (which the Blob{Input/Output}Stream classes inherit from. I have been struggling to understand how streaming truly works in Azure's python SDK. To start out, I am just trying to see how the StorageStreamDownloader class works. It seems like the StorageStreamDownloader is what holds the 'connection' to the BlockBlob object I am reading data from. If I want to put the data in a stream, I would make a new io.BytesIO() and pass that stream to the StorageStreamDownloader's readinto method.
For uploads, I would call the BlobClient's upload method. The upload method accepts a data parameter that is of type Union[Iterable[AnyStr], IO[AnyStr]].
I don't want to go into too much detail about what I understand, because what I understand and what I have done have gotten me nowhere. I am suspicious that I am expecting something that only the Java SDK offers. But, overall, here are the problems I am having:
When I call download_blob, I get back a StorageStreamDownloader with all the data in the blob. Some investigation has shown that I can use the offset and length to download the amount of data I want. Perhaps I can call it once with a download_blob(offset=0, length=4MB), process the data I get back, then again call download_bloc(offset=4MB, length=4MB), process the data, etc. This is unfavorable. The other thing I could do is utilize the max_chunk_get_size parameter for the BlobClient and turn on the validate_content flag (make it true) so that the StorageStreamDownloader only downloads 4mb. But this all results in several problems: that's not really streaming from a stream object. I'll still have to call download and readinto several times. And fine, I would do that, if it weren't for the second problem:
How the heck do I stream an upload? The upload can take a stream. But if the stream doesn't auto-update itself, then I can only upload once, because all the blobs I deal with must be BlockBlobs. The docs for the upload_function function say that I can provide a param overwrite that does:
keyword bool overwrite: Whether the blob to be uploaded should overwrite the current data.
If True, upload_blob will overwrite the existing data. If set to False, the
operation will fail with ResourceExistsError. The exception to the above is with Append
blob types: if set to False and the data already exists, an error will not be raised
and the data will be appended to the existing blob. If set overwrite=True, then the existing
append blob will be deleted, and a new one created. Defaults to False.
And this makes sense because BlockBlobs, once written to, cannot be written to again. So AFAIK, you can't 'stream' an upload. If I can't have a stream object that is directly tied to the blob, or holds all the data, then the upload() function will terminate as soon as it finishes, right?
Okay. I am certain I am missing something important. I am also somewhat ignorant when it comes to the io module in Python. Though I have developed in Python for a long time, I never really had to deal with that module too closely. I am sure I am missing something, because this functionality is very basic and exists in all the other azure SDKs I know about.
To recap
Everything I said above can honestly be ignored, and only this portion read; I am just trying to show I've done some due diligence. I want to know how to stream data from a blob, process the data I get in a stream, then upload that data. I cannot be receiving all the data in a blob at once. Blobs are likely to be over 1GB and all that pretty stuff. I would honestly love some example code that shows:
Retrieving some data from a blob (the data received in one call should not be more than 10MB) in a stream.
Compressing the data in that stream.
Upload the data to a blob.
This should work for blobs of all sizes; whether its 1MB or 10MB or 10GB should not matter. Step 2 can be anything really; it can also be nothing. Just as long as long as data is being downloaded, inserted into a stream, then uploaded, that would be great. Of course, the other extremely important constraint is that the data per 'download' shouldn't be an amount more than 10MB.
I hope this makes sense! I just want to stream data. This shouldn't be that hard.
Edit:
Some people may want to close this and claim the question is a duplicate. I have forgotten to include something very important: I am currently using the newest, mot up-to-date azure-sdk version. My azure-storage-blob package's version is 12.5.0. There have been other questions similar to what I have asked for severely outdated versions. I have searched for other answers, but haven't found any for 12+ versions.
If you want to download azure blob in chunk, process every chunk data and upload every chunk data to azure blob, please refer to the follwing code
import io
import os
from azure.storage.blob import BlobClient, BlobBlock
import uuid
key = '<account key>'
source_blob_client = BlobClient(account_url='https://andyprivate.blob.core.windows.net',
container_name='',
blob_name='',
credential=key,
max_chunk_get_size=4*1024*1024, # the size of chunk is 4M
max_single_get_size=4*1024*1024)
des_blob_client = BlobClient(account_url='https://<account name>.blob.core.windows.net',
container_name='',
blob_name='',
credential=key)
stream = source_blob_client.download_blob()
block_list = []
#read data in chunk
for chunk in stream.chunks():
#process your data
# use the put block rest api to upload the chunk to azure storage
blk_id = str(uuid.uuid4())
des_blob_client.stage_block(block_id=blk_id, data=<the data after you process>)
block_list.append(BlobBlock(block_id=blk_id))
#use the put blobk list rest api to ulpoad the whole chunk to azure storage and make up one blob
des_blob_client.commit_block_list(block_list)
Besides, if you just want to copy one blob from storage place to anoter storage place, you can directly use the method start_copy_from_url

How to get s3 metadata for all keys in a bucket via boto3

I want to fetch all metadata for a bucket with a prefix via Boto. There are a few SO questions that imply this isn't possible via the AWS API. So, two questions:
Is there a good reason this shouldn't be possible via the AWS API?
Although I can't find one in docs, Is there a convenience method for this in Boto?
I'm currently doing this using multithreading, but that seems like overkill, and I'd really rather avoid it if at all possible.
While there isn't a way to do this directly through boto, you could add an inventory configuration on the bucket(s) which generates a daily CSV / ORC file with all file metadata.
Once this has been generated you can then process the output rather than multithreading or any other method that requires a huge number of requests.
See: put_bucket_inventory_configuration
Its worth noting that it can take upto 48 hours for the first one to be generated.

Python Azure blob storage upload file bigger then 64 MB

From the sample code, I can upload 64MB, without any problem:
myblob = open(r'task1.txt', 'r').read()
blob_service.put_blob('mycontainer', 'myblob', myblob, x_ms_blob_type='BlockBlob')
What if I want to upload bigger size?
Thank you
I ran into the same problem a few days ago, and was lucky enough to find this. It breaks up the file into chunks and uploads it for you.
I hope this helps. Cheers!
I'm not a Python programmer. But a few extra tips I can offer (my stuff is all in C):
Use HTTP PUT operations(comp=block option) for as many Blocks (4MB each) as required for your file, and then use a final PUT Block List (comp=blocklist option) that coalesces the Blocks. If your Block uploads fail or you need to abort, the cleanup for deleting the partial set of Blocks previously uploaded is a DELETE command for the file you are looking to create, but this appears supported by the 2013-08-15 version only (Someone from the Azure support should confirm this).
If you need to add Meta information, an additional PUT operation (with the comp=metadata prefix) is what I do when using the Block List method. There might be a more efficient way to tag the meta information without requiring an additional PUT, but I'm not aware of it.
This is good question. Unfortunately I don't see a real implementation for uploading arbitrary large files. So, from what I see there is much more work to do on the Python SDK, unless I am missing something really crucial.
The sample code provided in the documentation indeed uses just a single text file and uploads at once. There is no real code that is yet implemented (from what I see in the SDK Source code) to support upload of larger files.
So, for you, to work with Blobs from Python you need to understand how Azure Blob Storage works. Start here.
Then take a quick look at the REST API documentation for PutBlob operation. It is mentioned in the remarks:
The maximum upload size for a block blob is 64 MB. If your blob is
larger than 64 MB, you must upload it as a set of blocks. For more
information, see the Put Block (REST API) and Put Block List (REST
API) operations. It's not necessary to call Put Blob if you upload the
blob as a set of blocks.
The good news is that PutBlock and PutBlockList is implemented in the Python SDK, but with no sample provided for how to use it. What you have to do is to manually split your file into chunks (blocks) of up to 4 MB each. and then use put_block(self, container_name, blob_name, block, blockid, content_md5=None, x_ms_lease_id=None): function from the python SDK to upload the blocks. Ultimately you will upload the blocks in parallel. Do not forget however that you have to execute also put_block_list(self, container_name, blob_name, block_list, content_md5=None, x_ms_blob_cache_control=None... at the end to commit all blocks uploaded.
Unfortunately I'm not Python expert to help you further, but at least I give you a good picture of the situation.

What's a good strategy for generating unique hash keys for a very large collection of images in Python?

I have a listing of millions of files and am uploading them to Amazon's S3. I need to create unique keys for each of the images. I'd rather not use md5 because it requires scanning the entire file, which can be slow. Additionally, there could be duplicate images, which is allowed in our application. Any suggestion for quickly generating an almost guaranteed to be unique key? Preferably, 32-characters alpha numeric (can be case sensitive). Thanks!
I would not call this a hash, since that implies generating a unique value based on the file contents.
Instead, UUIDs might be what you're after.
Just use an MD5 hash on the actual FILE after it's been uploaded and stored.
http://docs.python.org/library/md5.html
Apply hash to database table or, however you're storing it.
MD5 of datetime.now() (at a time of file upload) will be OK, imho.

Caching system for dynamically created files?

I have a web server that is dynamically creating various reports in several formats (pdf and doc files). The files require a fair amount of CPU to generate, and it is fairly common to have situations where two people are creating the same report with the same input.
Inputs:
raw data input as a string (equations, numbers, and
lists of words), arbitrary length, almost 99% will be less than about 200 words
the version of the report creation tool
When a user attempts to generate a report, I would like to check to see if a file already exists with the given input, and if so return a link to the file. If the file doesn't already exist, then I would like to generate it as needed.
What solutions are already out there? I've cached simple http requests before, but the keys were extremely simple (usually database id's)
If I have to do this myself, what is the best way. The input can be several hundred words, and I was wondering how I should go about transforming the strings into keys sent to the cache.
//entire input, uses too much memory, one to one mapping
cache['one two three four five six seven eight nine ten eleven...']
//short keys
cache['one two'] => 5 results, then I must narrow these down even more
Is this something that should be done in a database, or is it better done within the web app code (python in my case)
Thanks you everyone.
This is what Apache is for.
Create a directory that will have the reports.
Configure Apache to serve files from that directory.
If the report exists, redirect to a URL that Apache will serve.
Otherwise, the report doesn't exist, so create it. Then redirect to a URL that Apache will serve.
There's no "hashing". You have a key ("a string (equations, numbers, and lists of words), arbitrary length, almost 99% will be less than about 200 words") and a value, which is a file. Don't waste time on a hash. You just have a long key.
You can compress this key somewhat by making a "slug" out of it: remove punctuation, replace spaces with _, that kind of thing.
You should create an internal surrogate key which is a simple integer.
You're simply translating a long key to a "report" which either exists as a file or will be created as a file.
The usual thing is to use a reverse proxy like Squid or Varnish

Categories