I want to create a pool with a function calling the boto3 api and using a different bucket name for each thread:
my function is:
def face_reko(source_data, target_data):
bucket = s3.Bucket(bucket_name)
for key in bucket.objects.all():
key.delete()
s3_client.put_object(Bucket=bucket_name, Key=target_img, Body=target_data)
s3_client.put_object(Bucket=bucket_name, Key=source_img, Body=source_data)
response = reko.compare_faces(
SourceImage={
'S3Object': {
'Bucket': bucket_name,
'Name' : source_img
}
},
TargetImage={
'S3Object' : {
'Bucket' : bucket_name,
'Name' : target_img
}
}
)
if len(response['FaceMatches']) > 0:
return True
else:
return False
So basically it deletes all in the bucket, upload 2 new image then use the Rekognition api to compare the 2 images. Since I can't create the same image twice in the same bucket, i'd like to create a bucket for each thread then pass a constant to the function for the bucket name instead of the bucket_name const.
So finally I've been able to find a way to work around my issue. Instead of maping my function to a pool, I've created a Worker class defined like this:
class Worker():
def __init__(self, proc_number, splited_list):
self.pool = Pool(proc_number)
self.proc_number = proc_number
self.splited_list = splited_list
def callback(self, result):
if result:
self.pool.terminate()
def do_job(self):
for i in range(self.proc_number):
self.pool.apply_async(face_reko, args=(self.splited_list[i],source_data, i), callback=self.callback)
self.pool.close()
self.pool.join()
So the Worker obj is construct with a number of processes and a list of list (splitted_list is my main list splited in number_of_proc). Then when the do_job function is called, the pool starts by creating processes with an id i that can be use in the face_reko function. The pool stops when face_reko returns True. To start the Worker.pool, just create a Worker and call the do_job function like so:
w = Worker(proc_number=proc_number, splited_list=splited_list)
w.do_job()
Hoping it'll help someone else!
Related
This is the same as this question, but I also want to limit the depth returned.
Currently, all answers return all the objects after the specified prefix. I want to see just what's in the current hierarchy level.
Current code that returns everything:
self._session = boto3.Session(
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
)
self._session.resource("s3")
bucket = self._s3.Bucket(bucket_name)
detections_contents = bucket.objects.filter(Prefix=prefix)
for object_summary in detections_contents:
print(object_summary.key)
How to see only the files and folders directly under prefix? How to go n levels deep?
I can parse everything locally, and this is clearly not what I am looking for here.
There is no definite way to do this using list objects without getting all the objects in the dir.
But there is a way using s3 select which uses sql query like format to get n levels deep to get the file content as well as to get object keys.
If you are fine with writing sql then use this.
reference doc
import boto3
import json
s3 = boto3.client('s3')
bucket_name = 'my-bucket'
prefix = 'my-directory/subdirectory/'
input_serialization = {
'CompressionType': 'NONE',
'JSON': {
'Type': 'LINES'
}
}
output_serialization = {
'JSON': {}
}
# Set the SQL expression to select the key field for all objects in the subdirectory
expression = 'SELECT s.key FROM S3Object s WHERE s.key LIKE \'' + prefix + '%\''
response = s3.select_object_content(
Bucket=bucket_name,
ExpressionType='SQL',
Expression=expression,
InputSerialization=input_serialization,
OutputSerialization=output_serialization
)
# The response will contain a Payload field with the selected data
payload = response['Payload']
for event in payload:
if 'Records' in event:
records = event['Records']['Payload']
data = json.loads(records.decode('utf-8'))
# The data will be a list of objects, each with a "key" field representing the file name
for item in data:
print(item['key'])
There is not built in way with the Boto3 or S3 APIs to do this. You'll need some version of processing each level and asking in turn for a list of objects at that level:
import boto3
s3 = boto3.client('s3')
max_depth = 2
paginator = s3.get_paginator('list_objects_v2')
# Track all prefixes to show with a list
common_prefixes = [(0, "")]
while len(common_prefixes) > 0:
# Pull out the next prefix to show
current_depth, current_prefix = common_prefixes.pop(0)
# Loop through all of the items using a paginator to handle common prefixes with more
# than a thousand items
for page in paginator.paginate(Bucket=bucket_name, Prefix=current_prefix, Delimiter='/'):
for cur in page.get("CommonPrefixes", []):
# Show each common prefix, here just use a format like AWS CLI does
print(" " * 27 + f"PRE {cur['Prefix']}")
if current_depth < max_depth:
# This is below the max depth we want to show, so
# add it to the list to be shown
common_prefixes.append((current_depth + 1, cur['Prefix']))
for cur in page.get("Contents", []):
# Show each item sharing this common prefix using a format like the AWS CLI
print(f"{cur['LastModified'].strftime('%Y-%m-%d %H:%M:%S')}{cur['Size']:11d} {cur['Key']}")
I need to create a celery group task where I want to wait for until it has finished, but the docs are not clear to me how to achieve this:
this is my current state:
def import_media(request):
keys = []
for obj in s3_resource.Bucket(env.str('S3_BUCKET')).objects.all():
if obj.key.endswith(('.m4v', '.mp4', '.m4a', '.mp3')):
keys.append(obj.key)
for key in keys:
url = s3_client.generate_presigned_url(
ClientMethod='get_object',
Params={'Bucket': env.str('S3_BUCKET'), 'Key': key},
ExpiresIn=86400,
)
if not Files.objects.filter(descriptor=strip_descriptor_url_scheme(url)).exists():
extract_descriptor.apply_async(kwargs={"descriptor": str(url)})
return None
Now I need to create a new task inside the group for every URL I have, how can I do that?
I Now managed to get my flow working like this:
#require_http_methods(("GET"))
def import_media(request):
keys = []
urls = []
for obj in s3_resource.Bucket(env.str('S3_BUCKET')).objects.all():
if obj.key.endswith(('.m4v', '.mp4', '.m4a', '.mp3')):
keys.append(obj.key)
for key in keys:
url = s3_client.generate_presigned_url(
ClientMethod='get_object',
Params={'Bucket': env.str('S3_BUCKET'), 'Key': key},
ExpiresIn=86400,
)
if not Files.objects.filter(descriptor=strip_descriptor_url_scheme(url)).exists():
new_file = Files.objects.create(descriptor=strip_descriptor_url_scheme(url))
new_file.save()
urls.append(url)
workflow = (
group([extract_descriptor.s(url) for url in urls]).delay()
)
workflow.get(timeout=None, interval=0.5)
print("hello - Further processing here")
return None
Any suggestions to optimize this? At least now its working nice!
Thanks in advance
https://docs.celeryproject.org/en/latest/userguide/canvas.html#groups
A group runs all tasks despite whether or not any fail, chaining runs the next task if the previous succeeds. What you could do is rather than call apply_async each time in the for loop. You can use the signature method which applys the args but doesnt execute the task, until your ready.
from celery import group
...
all_urls= []
for key in keys:
url = s3_client.generate_presigned_url(
ClientMethod='get_object',
Params={'Bucket': env.str('S3_BUCKET'), 'Key': key},
ExpiresIn=86400,
)
if not Files.objects.filter(descriptor=strip_descriptor_url_scheme(url)).exists():
all_urls.append(url)
g = group(extract_descriptor.s(kwargs={"descriptor": str(url)}) for url
in all_urls) # create group
result = g() # you may need to call g.apply_sync(), but this executes all tasks in group
result.ready() # have all subtasks completed?
result.successful() # were all subtasks successful?
is it possible to use upload_fileobj() with python generator like:
def example():
gen = (range(3))
for i in gen:
yield BytesIO(i)
then, by using boto3 we must write in the same file (the same key file)
for i in example():
client.upload_fileobj(i)
Thanks a lot for your response
It sounds like you are trying to save all the bytes yielded by the generator to the same object on S3. If so, you can do that with a multipart upload. However, each part must be at least 5 MB (which presumably your generator would yield 5 MB chunks).
import boto3
client = boto3.client("s3")
my_bucket = "examplebucket"
my_key = "largeobject"
# Initiate multipart upload
response = client.create_multipart_upload(
Bucket=my_bucket,
Key=my_key,
)
# Record the UploadId
upload_id = response["UploadId"]
# Upload parts and keep track of them for completion
parts = []
for i, chunk in enumerate(example()):
response = client.upload_part(
Body=chunk,
Bucket=my_bucket,
Key=my_key,
PartNumber=i,
UploadId=upload_id,
)
parts.append({
"ETag": response["ETag"],
"PartNumber": i
})
# Complete the multipart upload
client.complete_multipart_upload(
Bucket=my_bucket,
Key=my_Key,
MultipartUpload={
'Parts': parts
},
UploadId=upload_id
)
I have a Beam pipeline that queries BigQuery and then upload results to BigTable. I'd like to scale out my BigTable instance (from 1 to 10 nodes) before my pipeline starts and then scale back down (from 10 to 1 node) after the results are loaded in to BigTable. Is there any mechanism to do this with Beam?
I'd essentially like to either have two separate transforms one at the beginning of the pipeline and one at the end that scale up and down the nodes, respectively. Or, have a DoFn that only triggers setup() and teardown() on one worker.
I've attempted to use the setup() and teardown() of the DoFn lifecycle functions. But, these functions get executed once per worker (and I use hundreds of workers), so it will attempt to scale up and down BigTable multiple times (and hit the instance and cluster write quotas for the day). So that doesn't really work with my use case. In any case here's a snippet of a BigTableWriteFn I've been experimenting with:
class _BigTableWriteFn(beam.DoFn):
def __init__(self, project_id, instance_id, table_id, cluster_id, node_count):
beam.DoFn.__init__(self)
self.beam_options = {
'project_id': project_id,
'instance_id': instance_id,
'table_id': table_id,
'cluster_id': cluster_id,
'node_count': node_count
}
self.table = None
self.initial_node_count = None
self.batcher = None
self.written = Metrics.counter(self.__class__, 'Written Row')
def setup(self):
client = Client(project=self.beam_options['project_id'].get(), admin=True)
instance = client.instance(self.beam_options['instance_id'].get())
node_count = self.beam_options['node_count'].get()
cluster = instance.cluster(self.beam_options['cluster_id'].get())
self.initial_node_count = cluster.serve_nodes
if node_count != self.initial_node_count: # I realize this logic is flawed since the cluster.serve_nodes will change after the first setup() call, but I first thought setup() and teardown() was run once for the whole transform...
cluster.serve_nodes = node_count
cluster.update()
## other life cycle methods in between but aren't important to the question
def teardown(self):
client = Client(project=self.beam_options['project_id'].get(), admin=True)
instance = client.instance(self.beam_options['instance_id'].get())
cluster = instance.cluster(self.beam_options['cluster_id'].get())
if cluster.serve_nodes != self.initial_node_count: # I realize this logic is flawed since the cluster.serve_nodes will change after the first setup() call, but I first thought setup() and teardown() was run once for the whole transform...
cluster.serve_nodes = self.initial_node_count
cluster.update()
I'm also using RuntimeValueProvider parameters for the bigtable ids (project_id, instance_id, cluster_id, etc), so I feel whatever type of transform I do to scale I'll need to use a DoFn.
Any help would be much appreciated!
So I came up with a hacky approach, but it works.
During the setup() of my WriteFn I get the clusters.serve_nodes count (this will obviously change after the first worker calls setup()) and scale out the cluster if it's not a the desired count. And in the process() function I yield this count. I then do a beam.CombineGlobally and find the Smallest(1) of those counts. I then pass this to another DoFn that scales the cluster to that minimal count.
Here's some code snippets of what I'm doing.
class _BigTableWriteFn(beam.DoFn):
""" Creates the connector can call and add_row to the batcher using each
row in beam pipe line
"""
def __init__(self, project_id, instance_id, table_id, cluster_id, node_count):
""" Constructor of the Write connector of Bigtable
Args:
project_id(str): GCP Project of to write the Rows
instance_id(str): GCP Instance to write the Rows
table_id(str): GCP Table to write the `DirectRows`
cluster_id(str): GCP Cluster to write the scale
node_count(int): Number of nodes to scale to before writing
"""
beam.DoFn.__init__(self)
self.beam_options = {
'project_id': project_id,
'instance_id': instance_id,
'table_id': table_id,
'cluster_id': cluster_id,
'node_count': node_count
}
self.table = None
self.current_node_count = None
self.batcher = None
self.written = Metrics.counter(self.__class__, 'Written Row')
def __getstate__(self):
return self.beam_options
def __setstate__(self, options):
self.beam_options = options
self.table = None
self.current_node_count = None
self.batcher = None
self.written = Metrics.counter(self.__class__, 'Written Row')
def setup(self):
client = Client(project=self.beam_options['project_id'].get(), admin=True)
instance = client.instance(self.beam_options['instance_id'].get())
cluster = instance.cluster(self.beam_options['cluster_id'].get())
cluster.reload()
desired_node_count = self.beam_options['node_count'].get()
self.current_node_count = cluster.serve_nodes
if desired_node_count != self.current_node_count:
cluster.serve_nodes = desired_node_count
cluster.update()
def start_bundle(self):
if self.table is None:
client = Client(project=self.beam_options['project_id'].get())
instance = client.instance(self.beam_options['instance_id'].get())
self.table = instance.table(self.beam_options['table_id'].get())
self.batcher = self.table.mutations_batcher()
def process(self, row):
self.written.inc()
# You need to set the timestamp in the cells in this row object,
# when we do a retry we will mutating the same object, but, with this
# we are going to set our cell with new values.
# Example:
# direct_row.set_cell('cf1',
# 'field1',
# 'value1',
# timestamp=datetime.datetime.now())
self.batcher.mutate(row)
# return the initial node count so we can find the minimum value and scale down BigTable latter
if self.current_node_count:
yield self.current_node_count
def finish_bundle(self):
self.batcher.flush()
self.batcher = None
class _BigTableScaleNodes(beam.DoFn):
def __init__(self, project_id, instance_id, cluster_id):
""" Constructor of the Scale connector of Bigtable
Args:
project_id(str): GCP Project of to write the Rows
instance_id(str): GCP Instance to write the Rows
cluster_id(str): GCP Cluster to write the scale
"""
beam.DoFn.__init__(self)
self.beam_options = {
'project_id': project_id,
'instance_id': instance_id,
'cluster_id': cluster_id,
}
self.cluster = None
def setup(self):
if self.cluster is None:
client = Client(project=self.beam_options['project_id'].get(), admin=True)
instance = client.instance(self.beam_options['instance_id'].get())
self.cluster = instance.cluster(self.beam_options['cluster_id'].get())
def process(self, min_node_counts):
if len(min_node_counts) > 0 and self.cluster.serve_nodes != min_node_counts[0]:
self.cluster.serve_nodes = min_node_counts[0]
self.cluster.update()
def run():
custom_options = PipelineOptions().view_as(CustomOptions)
pipeline_options = PipelineOptions()
p = beam.Pipeline(options=pipeline_options)
(p
| 'Query BigQuery' >> beam.io.Read(beam.io.BigQuerySource(query=QUERY, use_standard_sql=True))
| 'Map Query Results to BigTable Rows' >> beam.Map(to_direct_rows)
| 'Write BigTable Rows' >> beam.ParDo(_BigTableWriteFn(
custom_options.bigtable_project_id,
custom_options.bigtable_instance_id,
custom_options.bigtable_table_id,
custom_options.bigtable_cluster_id,
custom_options.bigtable_node_count))
| 'Find Global Min Node Count' >> beam.CombineGlobally(beam.combiners.Smallest(1))
| 'Scale Down BigTable' >> beam.ParDo(_BigTableScaleNodes(
custom_options.bigtable_project_id,
custom_options.bigtable_instance_id,
custom_options.bigtable_cluster_id))
)
result = p.run()
result.wait_until_finish()
If you are running the dataflow job not as a template but as a jar in a VM or pod, then you can do this before and after the pipeline starts by executing bash commands from java. Refer this - https://stackoverflow.com/a/26830876/6849682
Command to execute -
gcloud bigtable clusters update CLUSTER_ID --instance=INSTANCE_ID --num-nodes=NUM_NODES
But if you are running as template then, the template file won't consider anything other than what's between pipeline start and end
I want to create a python script where I can pass arguments/inputs to specify instance type and later attach an extra EBS (if needed).
ec2 = boto3.resource('ec2','us-east-1')
hddSize = input('Enter HDD Size if you want extra space ')
instType = input('Enter the instance type ')
def createInstance():
ec2.create_instances(
ImageId=AMI,
InstanceType = instType,
SubnetId='subnet-31d3ad3',
DisableApiTermination=True,
SecurityGroupIds=['sg-sa4q36fc'],
KeyName='key'
)
return instanceID; ## I know this does nothing
def createEBS():
ebsVol = ec2.Volume(
id = instanceID,
volume_type = 'gp2',
size = hddSize
)
Now, can ec2.create_instances() return ID or do I have to do an iteration of reservations?
or do I do an ec2.create(instance_id) / return instance_id? The documentation isn't specifically clear here.
in boto3, create_instances returns a list so to get instance id that was created in the request, following works:
ec2_client = boto3.resource('ec2','us-east-1')
response = ec2_client.create_instances(ImageId='ami-12345', MinCount=1, MaxCount=1)
instance_id = response[0].instance_id
The docs state that the call to create_instances()
https://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances
Returns list(ec2.Instance). So you should be able to get the instance ID(s) from the 'id' property of the object(s) in the list.
you can the following
def createInstance():
instance = ec2.create_instances(
ImageId=AMI,
InstanceType = instType,
SubnetId='subnet-31d3ad3',
DisableApiTermination=True,
SecurityGroupIds=['sg-sa4q36fc'],
KeyName='key'
)
# return response
return instance.instance_id
actually create_instances returns an ec2.instance instance