I started trying Amazon Rekognition to compare faces called from a Lambda execution. my model starts from user uploading an image and S3 will send an event to trigger a lambda that directly fetches the two closest images in the bucket to compare faces, but I can't read the image from the address S3's URI to Lambda to compare faces which has to create a test to read two images from S3 up. Do you guys have a way to get the URI address from S3 to Lambda to compare faces?
This is my test.
{
"sourceImage": "source.jpg",
"targetImage": "target.jpg"
}
This is the main program
import json
import boto3
s3 = boto3.resource('s3')
def lambda_handler(event, context):
print(event)
dump = json.loads(json.dumps(event))
sourceImage = dump['sourceImage']
targetImage = dump['targetImage']
bucket='your_name'
client = boto3.client('rekognition')
faceComparison= client.compare_faces(
SourceImage={'S3Object': {'Bucket':bucket,'Name':str(sourceImage)}},
TargetImage={'S3Object': {'Bucket':bucket,'Name':str(targetImage)}}
)
res = {
"faceRecognition": faceComparison
}
return res
You cannot use URI to access objects on s3 unless it is a public object/bucket. There are 2 alternatives you can use:
You can use the download_fileobj method from boto3 to get streamingbody of the object and pass that to your function.
You can use download_file method to download the file to /tmp location in lambda and then give path of that file to your function.
If you want to perform a Lambda function that reads objects from Amazon S3, use the Amazon S3 API to read the object bytes and DO NOT use an Object URI.
This AWS tutorial performs a very similar use case:
The Lambda function reads all images in a given Amazon S3 bucket. For each object in the bucket, it passes the image to the Amazon Rekognition service to detect PPE information. The results are stored as records in an Amazon DynamoDB table and then emailed to a recipient.
So instead of comparing faces, it detects PPE gear. Its implemented in Java, but you can port it to your programming language. It will point you in the right direction:
Creating an AWS Lambda function that detects images with Personal Protective Equipment
Related
I have bunch of videos in my s3 bucket and i want to convert their format using python but Currently I'm stuck at one issue. My python script for fetching all objects of bucket is as below.
s3 = boto3.client('s3',
region_name = S3_REGION,
aws_access_key_id = S3_ACCESS_KEY_ID,
aws_secret_access_key = S3_ACCESS_SECRET_KEY)
result = s3.list_objects(Bucket = bucket_name, Prefix='videos/')
for o in result.get('Contents'):
data = s3.get_object(Bucket=bucket_name, Key=o.get('Key'))
And for conversion of video format i have used MoviePy library which convert video format to mp4
import moviepy.editor as moviepy
clip = moviepy.VideoFileClip("video-529.webm")
clip.write_videofile("converted-recording.mp4")
But problem with this library is it need a file only you can not pass s3 object as a file so i don't know how to overcome this issue if anyone have better idea for this then please help me ? How to resolve this ?.
You are correct. Libraries require the video file to be on the 'local disk', so you should use download_file() instead of get_object().
Alternatively, you could use Amazon Elastic Transcoder to transcode the file 'as a service' rather than doing it in your own code. (Charges apply, based on video length.)
I need to find the optimal way to upload a large number of images (up to a few thousand) of size ~6MB per image on average. Our service is written in Python.
We have the following flow:
There is a service that has a single BlobServiceClient created. We are using CertificateCredentials to authenticate
Service is running in a container on Linux and written in Python code
Service is receiving a message that has 6 to 9 images as Numpy ndarray + JSON metadata object for each
every time we get a message we are sending all the files plus JSON files to storage using ThreadPoolExecutor with max_threads = 20
We are NOT using the async version of the library
Trimmed out and simplified code will look like this (below will not work, just an illustration, azurestorageclient is out wrapper around Azure Python SDK. It has single BlobServiceClient instance that we are using to create containers and upload blobs):
def _upload_file(self,
blob_name: str,
data: bytes,
blob_type: BlobType,
length=None):
blob_client = self._upload_container.get_blob_client(blob_name)
return blob_client.upload_blob(data, length=len(data), blob_type=BlobType.BlockBlob)
def _upload(self, executor: ThreadPoolExecutor, storage_client: AzureStorageClient,
image: ndarray, metadata: str) -> (Future, Future):
DEFAULT_LOGGER.info(f"Uploading image blob: {img_blob_name} ...")
img_upload_future = executor.submit(
self.upload_file,
blob_name=img_blob_name, byte_array=image.tobytes(),
content_type="image/jpeg",
overwrite=True,
)
DEFAULT_LOGGER.info(f"Uploading JSON blob: {metadata_blob_name} ...")
metadata_upload_future = executor.submit(
self.upload_file,
blob_name=metadata_blob_name, byte_array=metadata_json_bytes,
content_type="application/json",
overwrite=True,
)
return img_upload_future, metadata_upload_future
def send(storage_client: AzureStorageClient,
image_data: Dict[metadata, ndarray]):
with ThreadPoolExecutor(max_workers=_THREAD_SEND_MAX_WORKERS) as executor:
upload_futures = {
image_metadata: _upload(
executor=executor,
storage_client=storage_client,
image=image,
metadata=metadata
)
for metadata, image in image_data.items()
}
We observe a very bad performance of such a service when uploading files in a slow network with big signal strength fluctuations.
We are now trying to find and measure different options how to improve performance:
We will store files to HDD first and then upload them in bigger chunks from time to time
We think that uploading a single big file should perform better (e.g. 100files into zip/tar file)
We think that reducing the number of parallel jobs when the connection is bad should be also better
We consider using AzCopy instead of Python
Has anyone other suggestions or nice code samples in Python on how to work in such scenarios? Or maybe we should change a service that is used to upload data? For example use ssh to connect to VM and upload files that way (I doubt it will be faster, but got such suggestions).
Mike
According to your situation, I suggest you zip some files as a big file and upload the bigfile in chunks. Regarding how to upload the file in chunks, you can use the method BlobClient.stage_block and BlobClient.commit_block_list to implement it.
For example
block_list=[]
chunk_size=1024
with open('csvfile.csv','rb') as f:
while True:
read_data = f.read(chunk_size)
if not read_data:
break # done
blk_id = str(uuid.uuid4())
blob_client.stage_block(block_id=blk_id,data=read_data)
block_list.append(BlobBlock(block_id=blk_id))
blob_client.commit_block_list(block_list)
I botched a Firebase cloud function and accidentally created 1.9 million images stored in gs://myapp.appspot.com//tmp/. That double slash is accurate--the server was writing to /tmp/, which I guess results in the path mentioned above.
I'm now wanting to delete those files (they're all nonsense). I tried using the Python wrapper like so:
export GOOGLE_APPLICATION_CREDENTIALS="../secrets/service-account.json"
Then:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('tmp')
blobs = bucket.list_blobs(bucket='tmp', prefix='')
for blob in blobs:
print(' * deleting', blob)
blob.delete()
But this throws:
google.api_core.exceptions.Forbidden: 403 GET https://storage.googleapis.com/storage/v1/b/tmp?projection=noAcl: firebase-adminsdk-yr6f8#myapp.iam.gserviceaccount.com does not have storage.buckets.get access to tmp.
Does anyone know how to allow the admin credentials to delete from /tmp/? Any pointers would be hugely helpful!
I was able to reproduce this problem with gsutil command:
gsutil cp ~/<example-file> gs://<my-project-name>.appspot.com//tmp/
First of all, in my Firebase console I am able to do it with one tick (whole folder) not sure if you consider this.
Anyway if you want to have it done with API I have found following solution.
I think (comparing to my test) bucket name should be: myapp.appspot.com
If you print the blobs in python you will get something like this: <Blob: <my-project-name>.appspot.com, /tmp/<example-file>, 1584700014023619>
The 2nd value is name property of blob object. I noticed that in this situation its blobs name starts with /tmp/
Code that works on my side is:
from google.cloud import storage
storage_client = storage.Client()
bucket = storage_client.get_bucket('myapp.appspot.com')
blobs = bucket.list_blobs()
for blob in blobs:
if blob.name.startswith('/tmp/'):
print(' * deleting', blob)
blob.delete()
I don't think its very elegant solution, but for the one time fix maybe be good enough.
I hope it will help!
I am attempting to create a function in Python in which I pass a filename and an image object, which I want to be uploaded to a Google storage bucket. I have the bucket already created, I have all the credentials in an environment variable, but I'm confused about the whole process.
Currently I have the following setup:
class ImageStorage:
bucket_name = os.getenv('STORAGE_BUCKET_NAME')
project_name = os.getenv('STORAGE_BUCKET_PROJECT_ID')
client = storage.Client(project=project_name)
bucket = client.get_bucket(bucket_name)
def save_image(self, filename, image):
blob = self.bucket.blob(filename)
blob.upload_from_file(image)
But once I run this, I get the error:
total bytes could not be determined. Please pass an explicit size, or supply a chunk size for a streaming transfer.
I'm not sure how I can provide a bytes size of this image object. Do I first need to create a file locally from the image object and then pass onto uploading it?
As per the Github issue, you should provide chunk_size parameter for stream upload.
blob = self.bucket.blob(filename, chunk_size=262144) # 256KB
blob.upload_from_file(image)
chunk_size (int) – The size of a chunk of data whenever iterating (in bytes). This must be a multiple of 256 KB per the API specification.
(cross posted to boto-users)
Given an image ID, how can I delete it using boto?
You use the deregister() API.
There are a few ways of getting the image id (i.e. you can list all images and search their properties, etc)
Here is a code fragment which will delete one of your existing AMIs (assuming it's in the EU region)
connection = boto.ec2.connect_to_region('eu-west-1', \
aws_access_key_id='yourkey', \
aws_secret_access_key='yoursecret', \
proxy=yourProxy, \
proxy_port=yourProxyPort)
# This is a way of fetching the image object for an AMI, when you know the AMI id
# Since we specify a single image (using the AMI id) we get a list containing a single image
# You could add error checking and so forth ... but you get the idea
images = connection.get_all_images(image_ids=['ami-cf86xxxx'])
images[0].deregister()
(edit): and in fact having looked at the online documentation for 2.0, there is another way.
Having determined the image ID, you can use the deregister_image(image_id) method of boto.ec2.connection ... which amounts to the same thing I guess.
With newer boto (Tested with 2.38.0), you can run:
ec2_conn = boto.ec2.connect_to_region('xx-xxxx-x')
ec2_conn.deregister_image('ami-xxxxxxx')
or
ec2_conn.deregister_image('ami-xxxxxxx', delete_snapshot=True)
The first will delete the AMI, the second will also delete the attached EBS snapshot
For Boto2, see katriels answer. Here, I am assuming you are using Boto3.
If you have the AMI (an object of class boto3.resources.factory.ec2.Image), you can call its deregister function. For example, to delete an AMI with a given ID, you can use:
import boto3
ec2 = boto3.resource('ec2')
ami_id = 'ami-1b932174'
ami = list(ec2.images.filter(ImageIds=[ami_id]).all())[0]
ami.deregister(DryRun=True)
If you have the necessary permissions, you should see an Request would have succeeded, but DryRun flag is set exception. To get rid of the example, leave out DryRun and use:
ami.deregister() # WARNING: This will really delete the AMI
This blog post elaborates on how to delete AMIs and snapshots with Boto3.
Script delates the AMI and associated Snapshots with it. Make sure you have right privileges to run this script.
Inputs - Please pass region and AMI ids(n) as inputs
import boto3
import sys
def main(region,images):
region = sys.argv[1]
images = sys.argv[2].split(',')
ec2 = boto3.client('ec2', region_name=region)
snapshots = ec2.describe_snapshots(MaxResults=1000,OwnerIds=['self'])['Snapshots']
# loop through list of image IDs
for image in images:
print("====================\nderegistering {image}\n====================".format(image=image))
amiResponse = ec2.deregister_image(DryRun=True,ImageId=image)
for snapshot in snapshots:
if snapshot['Description'].find(image) > 0:
snap = ec2.delete_snapshot(SnapshotId=snapshot['SnapshotId'],DryRun=True)
print("Deleting snapshot {snapshot} \n".format(snapshot=snapshot['SnapshotId']))
main(region,images)
using the EC2.Image resource you can simply call deregister():
Example:
for i in ec2res.images.filter(Owners=['self']):
print("Name: {}\t Id: {}\tState: {}\n".format(i.name, i.id, i.state))
i.deregister()
See this for using different filters:
What are valid values documented for ec2.images.filter command?
See also: https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/ec2.html#EC2.Image.deregister