Uploading image file to Google Bucket in Python - python

I am attempting to create a function in Python in which I pass a filename and an image object, which I want to be uploaded to a Google storage bucket. I have the bucket already created, I have all the credentials in an environment variable, but I'm confused about the whole process.
Currently I have the following setup:
class ImageStorage:
bucket_name = os.getenv('STORAGE_BUCKET_NAME')
project_name = os.getenv('STORAGE_BUCKET_PROJECT_ID')
client = storage.Client(project=project_name)
bucket = client.get_bucket(bucket_name)
def save_image(self, filename, image):
blob = self.bucket.blob(filename)
blob.upload_from_file(image)
But once I run this, I get the error:
total bytes could not be determined. Please pass an explicit size, or supply a chunk size for a streaming transfer.
I'm not sure how I can provide a bytes size of this image object. Do I first need to create a file locally from the image object and then pass onto uploading it?

As per the Github issue, you should provide chunk_size parameter for stream upload.
blob = self.bucket.blob(filename, chunk_size=262144) # 256KB
blob.upload_from_file(image)
chunk_size (int) – The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.

Related

Efficient upload of large amount of images to Azure Storage in Python

I need to find the optimal way to upload a large number of images (up to a few thousand) of size ~6MB per image on average. Our service is written in Python.
We have the following flow:
There is a service that has a single BlobServiceClient created. We are using CertificateCredentials to authenticate
Service is running in a container on Linux and written in Python code
Service is receiving a message that has 6 to 9 images as Numpy ndarray + JSON metadata object for each
every time we get a message we are sending all the files plus JSON files to storage using ThreadPoolExecutor with max_threads = 20
We are NOT using the async version of the library
Trimmed out and simplified code will look like this (below will not work, just an illustration, azurestorageclient is out wrapper around Azure Python SDK. It has single BlobServiceClient instance that we are using to create containers and upload blobs):
def _upload_file(self,
blob_name: str,
data: bytes,
blob_type: BlobType,
length=None):
blob_client = self._upload_container.get_blob_client(blob_name)
return blob_client.upload_blob(data, length=len(data), blob_type=BlobType.BlockBlob)
def _upload(self, executor: ThreadPoolExecutor, storage_client: AzureStorageClient,
image: ndarray, metadata: str) -> (Future, Future):
DEFAULT_LOGGER.info(f"Uploading image blob: {img_blob_name} ...")
img_upload_future = executor.submit(
self.upload_file,
blob_name=img_blob_name, byte_array=image.tobytes(),
content_type="image/jpeg",
overwrite=True,
)
DEFAULT_LOGGER.info(f"Uploading JSON blob: {metadata_blob_name} ...")
metadata_upload_future = executor.submit(
self.upload_file,
blob_name=metadata_blob_name, byte_array=metadata_json_bytes,
content_type="application/json",
overwrite=True,
)
return img_upload_future, metadata_upload_future
def send(storage_client: AzureStorageClient,
image_data: Dict[metadata, ndarray]):
with ThreadPoolExecutor(max_workers=_THREAD_SEND_MAX_WORKERS) as executor:
upload_futures = {
image_metadata: _upload(
executor=executor,
storage_client=storage_client,
image=image,
metadata=metadata
)
for metadata, image in image_data.items()
}
We observe a very bad performance of such a service when uploading files in a slow network with big signal strength fluctuations.
We are now trying to find and measure different options how to improve performance:
We will store files to HDD first and then upload them in bigger chunks from time to time
We think that uploading a single big file should perform better (e.g. 100files into zip/tar file)
We think that reducing the number of parallel jobs when the connection is bad should be also better
We consider using AzCopy instead of Python
Has anyone other suggestions or nice code samples in Python on how to work in such scenarios? Or maybe we should change a service that is used to upload data? For example use ssh to connect to VM and upload files that way (I doubt it will be faster, but got such suggestions).
Mike
According to your situation, I suggest you zip some files as a big file and upload the bigfile in chunks. Regarding how to upload the file in chunks, you can use the method BlobClient.stage_block and BlobClient.commit_block_list to implement it.
For example
block_list=[]
chunk_size=1024
with open('csvfile.csv','rb') as f:
while True:
read_data = f.read(chunk_size)
if not read_data:
break # done
blk_id = str(uuid.uuid4())
blob_client.stage_block(block_id=blk_id,data=read_data)
block_list.append(BlobBlock(block_id=blk_id))
blob_client.commit_block_list(block_list)

Save an image from a Generator object - Python

I made an API call to convert an image into a thumbnail version of itself, and that returned a Generator object. But I don't know how to save that object as an image on my local machine.
I get from the documentation of the API that a "successful response contains the thumbnail image binary", but I don't know how to access it. I was thinking, so I need to convert the binary into a string or list and then convert that into an image by using the Image class from PIL?
I don't know the best way to do it. I know Generators are just iterators that save state, but that doesn't mean much when it comes to image data being in it and accessing the data so that I have a saved image in my local folder.
Here is my code:
computervision_client = ComputerVisionClient(endpoint, CognitiveServicesCredentials(subscription_key))
# Get a local image
local_image_path_thumb = "resources\\objects.jpg"
local_image_thumb = open(local_image_path_objects, "rb")
print("Generating thumbnail from a local image...")
# Call the API with a local image, set the width/height if desired (pixels)
# Returns a Generator object, a thumbnail image binary.
thumb_local = computervision_client.generate_thumbnail_in_stream(100, 100, local_image_thumb, True)
# Save the thumbnail to your local root folder of this project.
# Save to here, somehow: "\\resources\\thumb_local.jpg"
print("Thumbnail saved to local folder.")
Here is the API documentation for the function generate_thumbnail_in_stream.
with open("output_file.png", "wb") as fp:
for chunk in thumb_local:
fp.write(chunk)

Can I download from Google storage blobs into a VM as an n-d array?

I have a Google Cloud Engine VM and am trying to grab data from my cloud storage - which is in the form of a blob, and turn it into a np array with the same shape as it was when stored.
Currently the only way I can get this working is by downloading to file and then loading into a numpy array which seems sub-optimal
I have tried downloading as a string array directly and converting into numpy array but the dimensions are not maintained (they are flattened).
I could move all files to the VM instead but would rather read 'on-the-fly' if possible?
Current code:
def __getitem__(self, index):
index = int(self.indexes[int(index)])
blob = bucket.blob(self.data_path + 'case_'+str(index)+'_volume.npy')
blob.download_to_filename('im.npy')
image = np.load('im.npy')
return image
If you have enough RAM to store the entire file in memory (while it is also loaded into numpy), you can do the read into a BytesIO object, seek back to the beginning of the buffer, then hand it to numpy.load(). Adapt this as necessary to your particular function:
import io
import numpy as np
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('my-bucket')
blob = bucket.blob('my-file.npy')
with io.BytesIO() as in_memory_file:
blob.download_to_file(in_memory_file)
in_memory_file.seek(0)
image = np.load(in_memory_file)
# then, for example:
print(image)
At least for now there doesn't appear to be a way to actually stream the read out of GCS without writing the necessary client library yourself.

Python Flask- how can I read the size of an image before uploading it to Amazon s3

This question might be fairly straightforward, if you have some experience with Python Flask, Boto3, Pillow (a.k.a. PIL).
I'm attempting to receive an incoming image from a client (only allowing .jpg, .jpeg, .tif,) and i'd like to read the dimensions of the image before uploading it to Amazon S3 using Boto3.
The code is fairly straight forward:
file = request.files['file']
# produces an instance of FileStorage
asset = models.Asset(file, AssetType.profile_img, donor.id)
# a model managed by the ORM
img = Image.open(BytesIO(file.stream.read()))
# produces a PIL Image object
size = img.size
# read the size of the Image object
asset.width = size[0]
asset.height = size[1]
# set the size to the ORM
response = s3.Object('my-bucket', asset.s3_key()).put(Body=file)
# upload to S3
Here's the catch, I can either (A) read the image OR (B) upload to s3, but I can't do both. Literally, commenting out one or the other produces the desired operation, but not both in combination.
I've narrowed it down to the upload. It's my belief that somewhere along the line, the file.strea.read() operation is causing an issue with the Boto3 upload, but I can't figure it out. Can you?
Thanks in advance.
You're close - changing the byte source for S3 should do it. Roughly, something like this:
file = request.files['file']
# produces an instance of FileStorage
asset = models.Asset(file, AssetType.profile_img, donor.id)
# a model managed by the ORM
image_bytes = BytesIO(file.stream.read())
# save bytes in a buffer
img = Image.open(image_bytes)
# produces a PIL Image object
size = img.size
# read the size of the Image object
asset.width = size[0]
asset.height = size[1]
# set the size to the ORM
image_bytes.seek(0)
response = s3.Object('my-bucket', asset.s3_key()).put(Body=image_bytes)
# upload to S3
Note the call to seek and the use of BytesIO in the call to S3. I can't overstate how useful BytesIO and StringIO are for doing this sort of thing!

How to upload a file stored in memory (using Python's tempfile) to S3?

I want to upload some video clips to Amazon S3. These videos are generated as intermediate results. So, I prefer to store these small video clips in memory (around 400~500 KB), then upload each of them to S3.
After uploading, the temporary files can be removed from memory. Hence, I want to use tempfile. But there are errors in the following code. Please take a look and how to do it correctly?
#contextmanager
def s3upload(key):
with tempfile.SpooledTemporaryFile(max_size=1021*1000) as buffer:
yield buffer
buffer.seek(0)
# key.send_file(buffer)
k.set_contents_from_file(buffer)
k.set_acl('public-read')
conn = boto.connect_s3()
b = conn.get_bucket('cc_test_s3')
k = Key(b)
k.key = '1.flv'
mime = mimetypes.guess_type('1.flv')[0]
with s3upload(k) as out:
out.write('1.flv')
Output:
The size of file uploaded is 5 KB, which is much less than the actual size of 1.flv (~400 KB).
I would recommend you to use s3fuse , which will basically mount your s3 bucket on your local drive and then you can directly save the files as if you are saving in local directory. for reference you can look at s3fuse - google-code

Categories