Celery how to create a task group in a for loop - python

I need to create a celery group task where I want to wait for until it has finished, but the docs are not clear to me how to achieve this:
this is my current state:
def import_media(request):
keys = []
for obj in s3_resource.Bucket(env.str('S3_BUCKET')).objects.all():
if obj.key.endswith(('.m4v', '.mp4', '.m4a', '.mp3')):
keys.append(obj.key)
for key in keys:
url = s3_client.generate_presigned_url(
ClientMethod='get_object',
Params={'Bucket': env.str('S3_BUCKET'), 'Key': key},
ExpiresIn=86400,
)
if not Files.objects.filter(descriptor=strip_descriptor_url_scheme(url)).exists():
extract_descriptor.apply_async(kwargs={"descriptor": str(url)})
return None
Now I need to create a new task inside the group for every URL I have, how can I do that?
I Now managed to get my flow working like this:
#require_http_methods(("GET"))
def import_media(request):
keys = []
urls = []
for obj in s3_resource.Bucket(env.str('S3_BUCKET')).objects.all():
if obj.key.endswith(('.m4v', '.mp4', '.m4a', '.mp3')):
keys.append(obj.key)
for key in keys:
url = s3_client.generate_presigned_url(
ClientMethod='get_object',
Params={'Bucket': env.str('S3_BUCKET'), 'Key': key},
ExpiresIn=86400,
)
if not Files.objects.filter(descriptor=strip_descriptor_url_scheme(url)).exists():
new_file = Files.objects.create(descriptor=strip_descriptor_url_scheme(url))
new_file.save()
urls.append(url)
workflow = (
group([extract_descriptor.s(url) for url in urls]).delay()
)
workflow.get(timeout=None, interval=0.5)
print("hello - Further processing here")
return None
Any suggestions to optimize this? At least now its working nice!
Thanks in advance

https://docs.celeryproject.org/en/latest/userguide/canvas.html#groups
A group runs all tasks despite whether or not any fail, chaining runs the next task if the previous succeeds. What you could do is rather than call apply_async each time in the for loop. You can use the signature method which applys the args but doesnt execute the task, until your ready.
from celery import group
...
all_urls= []
for key in keys:
url = s3_client.generate_presigned_url(
ClientMethod='get_object',
Params={'Bucket': env.str('S3_BUCKET'), 'Key': key},
ExpiresIn=86400,
)
if not Files.objects.filter(descriptor=strip_descriptor_url_scheme(url)).exists():
all_urls.append(url)
g = group(extract_descriptor.s(kwargs={"descriptor": str(url)}) for url
in all_urls) # create group
result = g() # you may need to call g.apply_sync(), but this executes all tasks in group
result.ready() # have all subtasks completed?
result.successful() # were all subtasks successful?

Related

Use list items in variable in python requests url

I am trying to make a call to an API and then grab event_ids from the data. I then want to use those event ids as variables in another request, then parse that data. Then loop back and make another request using the next event id in the event_id variable for all the IDs.
so far i have the following
def nba_odds():
url = "https://xxxxx.com.au/sports/summary/basketball?api_key=xxxxx"
response = requests.get(url)
data = response.json()
event_ids = []
for event in data['Events']:
if event['Country'] == 'USA' and event['League'] == 'NBA':
event_ids.append(event['EventID'])
# print(event_ids)
game_url = f'https://xxxxx.com.au/sports/detail/{event_ids}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
print(game_url)
that gives me the result below in the terminal.
https://xxxxx.com.au/sports/detail/['dbx-1425135', 'dbx-1425133', 'dbx-1425134', 'dbx-1425136', 'dbx-1425137', 'dbx-1425138', 'dbx-1425139', 'dbx-1425140', 'anyvsany-nba01-1670043600000000000', 'dbx-1425141', 'dbx-1425142', 'dbx-1425143', 'dbx-1425144', 'dbx-1425145', 'dbx-1425148', 'dbx-1425149', 'dbx-1425147', 'dbx-1425146', 'dbx-1425150', 'e95270f6-661b-46dc-80b9-cd1af75d38fb', '0c989be7-0802-4683-8bb2-d26569e6dcf9']?api_key=779ac51a-2fff-4ad6-8a3e-6a245a0a4cbb
the URL above format should look like
https://xxxx.com.au/sports/detail/dbx-1425135
If anyone can point me in the right direction it would be appreciated.
thanks.
you need to loop over the event ID's again to call the API with one event_id if it is not supporting multiple event_ids like:
all_events_response = []
for event_id in event_ids
game_url = f'https://xxxxx.com.au/sports/detail/{event_id}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
all_events_response.append(game_data)
print(game_url)
You can find list of json responses under all_events_response
event_ids is an entire list of event ids. You make a single URL with the full list converted to its string view (['dbx-1425135', 'dbx-1425133', ...]). But it looks like you want to get information on each event in turn. To do that, put the second request in the loop so that it runs for every event you find interesting.
def nba_odds():
url = "https://xxxxx.com.au/sports/summary/basketball?api_key=xxxxx"
response = requests.get(url)
data = response.json()
event_ids = []
for event in data['Events']:
if event['Country'] == 'USA' and event['League'] == 'NBA':
event_id = event['EventID']
# print(event_id)
game_url = f'https://xxxxx.com.au/sports/detail/{event_id}?api_key=xxxxx'
game_response = requests.get(game_url)
game_data = game_response.json()
# do something with game_data - it will be overwritten
# on next round in the loop
print(game_url)

SSM Describe Instance Information can only accept 1 value in the filter

I'm trying to list all of my SSM instances (both EC2 and Managed instances), but it seems that I can't do it all in one filter?
I'm using the paginator function to get the information on my instances, and then use this filter:
paginator = ssm_client.get_paginator('describe_instance_information')
response_iterator = paginator.paginate(
Filters=[
{
'Key': 'ResourceType',
'Values': [
'ManagedInstance'],
},
],
PaginationConfig={
# 'MaxItems': 100,
}
)
This filter only gets the ManagedInstance list.
'Key': 'ResourceType',
'Values': ['ManagedInstance'],
Since the Values value accepts a list, it seems dumb that it can't take more than one value. If I use something like this:
'Key': 'ResourceType',
'Values': ['ManagedInstance', 'EC2Instance'],
Then I would get this error:
botocore.errorfactory.InvalidInstanceInformationFilterValue: An error occurred (InvalidInstanceInformationFilterValue) when calling the DescribeInstanceInformation operation: ResourceType filter may contain only one value.
Later in my script, I'm looping over that response_iterator variable. I'm not sure what my workaround should be if I want to loop over all of my instances (both EC2 and Managed).
My loop looks something like this:
for item in response_iterator:
for instance in item['InstanceInformationList']:
if instance.get('PingStatus') == 'Online':
instanceName = instance.get('ComputerName')
etc
What is my best option to bypass this restriction?
Also, is this a boto3 limitation or is it coming from the AWS SDK? I haven't been able to figure it out.
Edit:
So one possible solution was the following:
import boto3
ssm_client = boto3.client('ssm')
ec2_client = boto3.client('ec2')
combined = []
rtypes = ['ManagedInstance', 'EC2Instance']
for rtype in rtypes:
paginator = ssm_client.get_paginator('describe_instance_information')
response_iterator = paginator.paginate(
Filters=[
{
'Key': 'ResourceType',
'Values': [rtype],
},
],
PaginationConfig={
# 'MaxItems': 10,
}
)
combined.append(list(response_iterator))
for item in combined:
for instance in item[0]['InstanceInformationList']:
if instance.get('PingStatus') == 'Online':
instanceName = instance.get('ComputerName')
print(instanceName)
It seems like it only prints 10 instances per rtype which makes me think that the paginator doesn't do it's magic here. It's as if I'm using the regular boto3 describe_instance_information function which in fact, does only return the first page of the SSM instances.
You can run the paginator twice and create a list with the results. The iterate over the results. I only have one EC2 instance running so validating the nested if statement at the end is difficult to do. But this should get you started.
combined = []
rtypes = ['ManagedInstance', 'EC2Instance']
for rtype in rtypes:
paginator = ssm_client.get_paginator('describe_instance_information')
response_iterator = paginator.paginate(
Filters=[
{
'Key': 'ResourceType',
'Values': [
rtype],
},
],
PaginationConfig={
# 'MaxItems': 100,
}
)
combined.append(list(response_iterator))
print(combined)
for r in combined:
# print(r)
if len(r[0]['InstanceInformationList']) > 0:
# print(r[0]['InstanceInformationList'])
for instance in r[0]['InstanceInformationList']:
if instance.get('PingStatus') == 'Online':
instanceName = instance.get('ComputerName')
print(instanceName)
Looks like I found a workaround. Instead of using the built in paginator, I found another method of creating the pages yourself:
import boto3
ssm_client = boto3.client('ssm')
ec2_client = boto3.client('ec2')
def fetch_instance_pages():
token = ''
while True:
# for i in range(1): # Number of pages - 10 instances per page
page = ssm_client.describe_instance_information(NextToken=token)
yield page
token = page.get('NextToken')
if not token:
break
def fetch_instances():
for page in fetch_instance_pages():
for instance in page['InstanceInformationList']:
yield instance
counter = 0
for instance in fetch_instances():
if instance.get('PingStatus') == 'Online':
counter+=1
instanceName = instance.get('ComputerName')
print(counter, instanceName)
Obviously you don't need the counter, it's just there to convenience when you look at the print.

Python multiprocessing with different constant for each thread

I want to create a pool with a function calling the boto3 api and using a different bucket name for each thread:
my function is:
def face_reko(source_data, target_data):
bucket = s3.Bucket(bucket_name)
for key in bucket.objects.all():
key.delete()
s3_client.put_object(Bucket=bucket_name, Key=target_img, Body=target_data)
s3_client.put_object(Bucket=bucket_name, Key=source_img, Body=source_data)
response = reko.compare_faces(
SourceImage={
'S3Object': {
'Bucket': bucket_name,
'Name' : source_img
}
},
TargetImage={
'S3Object' : {
'Bucket' : bucket_name,
'Name' : target_img
}
}
)
if len(response['FaceMatches']) > 0:
return True
else:
return False
So basically it deletes all in the bucket, upload 2 new image then use the Rekognition api to compare the 2 images. Since I can't create the same image twice in the same bucket, i'd like to create a bucket for each thread then pass a constant to the function for the bucket name instead of the bucket_name const.
So finally I've been able to find a way to work around my issue. Instead of maping my function to a pool, I've created a Worker class defined like this:
class Worker():
def __init__(self, proc_number, splited_list):
self.pool = Pool(proc_number)
self.proc_number = proc_number
self.splited_list = splited_list
def callback(self, result):
if result:
self.pool.terminate()
def do_job(self):
for i in range(self.proc_number):
self.pool.apply_async(face_reko, args=(self.splited_list[i],source_data, i), callback=self.callback)
self.pool.close()
self.pool.join()
So the Worker obj is construct with a number of processes and a list of list (splitted_list is my main list splited in number_of_proc). Then when the do_job function is called, the pool starts by creating processes with an id i that can be use in the face_reko function. The pool stops when face_reko returns True. To start the Worker.pool, just create a Worker and call the do_job function like so:
w = Worker(proc_number=proc_number, splited_list=splited_list)
w.do_job()
Hoping it'll help someone else!

List out auto scaling group names with a specific application tag using boto3

I was trying to fetch auto scaling groups with Application tag value as 'CCC'.
The list is as below,
gweb
prd-dcc-eap-w2
gweb
prd-dcc-emc
gweb
prd-dcc-ems
CCC
dev-ccc-wer
CCC
dev-ccc-gbg
CCC
dev-ccc-wer
The script I coded below gives output which includes one ASG without CCC tag.
#!/usr/bin/python
import boto3
client = boto3.client('autoscaling',region_name='us-west-2')
response = client.describe_auto_scaling_groups()
ccc_asg = []
all_asg = response['AutoScalingGroups']
for i in range(len(all_asg)):
all_tags = all_asg[i]['Tags']
for j in range(len(all_tags)):
if all_tags[j]['Key'] == 'Name':
asg_name = all_tags[j]['Value']
# print asg_name
if all_tags[j]['Key'] == 'Application':
app = all_tags[j]['Value']
# print app
if all_tags[j]['Value'] == 'CCC':
ccc_asg.append(asg_name)
print ccc_asg
The output which I am getting is as below,
['prd-dcc-ein-w2', 'dev-ccc-hap', 'dev-ccc-wfd', 'dev-ccc-sdf']
Where as 'prd-dcc-ein-w2' is an asg with a different tag 'gweb'. And the last one (dev-ccc-msp-agt-asg) in the CCC ASG list is missing. I need output as below,
dev-ccc-hap-sdf
dev-ccc-hap-gfh
dev-ccc-hap-tyu
dev-ccc-mso-hjk
Am I missing something ?.
In boto3 you can use Paginators with JMESPath filtering to do this very effectively and in more concise way.
From boto3 docs:
JMESPath is a query language for JSON that can be used directly on
paginated results. You can filter results client-side using JMESPath
expressions that are applied to each page of results through the
search method of a PageIterator.
When filtering with JMESPath expressions, each page of results that is
yielded by the paginator is mapped through the JMESPath expression. If
a JMESPath expression returns a single value that is not an array,
that value is yielded directly. If the result of applying the JMESPath
expression to a page of results is a list, then each value of the list
is yielded individually (essentially implementing a flat map).
Here is how it looks like in Python code with mentioned CCP value for Application tag of Auto Scaling Group:
import boto3
client = boto3.client('autoscaling')
paginator = client.get_paginator('describe_auto_scaling_groups')
page_iterator = paginator.paginate(
PaginationConfig={'PageSize': 100}
)
filtered_asgs = page_iterator.search(
'AutoScalingGroups[] | [?contains(Tags[?Key==`{}`].Value, `{}`)]'.format(
'Application', 'CCP')
)
for asg in filtered_asgs:
print asg['AutoScalingGroupName']
Elaborating on Michal Gasek's answer, here's an option that filters ASGs based on a dict of tag:value pairs.
def get_asg_name_from_tags(tags):
asg_name = None
client = boto3.client('autoscaling')
while True:
paginator = client.get_paginator('describe_auto_scaling_groups')
page_iterator = paginator.paginate(
PaginationConfig={'PageSize': 100}
)
filter = 'AutoScalingGroups[]'
for tag in tags:
filter = ('{} | [?contains(Tags[?Key==`{}`].Value, `{}`)]'.format(filter, tag, tags[tag]))
filtered_asgs = page_iterator.search(filter)
asg = filtered_asgs.next()
asg_name = asg['AutoScalingGroupName']
try:
asgX = filtered_asgs.next()
asgX_name = asg['AutoScalingGroupName']
raise AssertionError('multiple ASG\'s found for {} = {},{}'
.format(tags, asg_name, asgX_name))
except StopIteration:
break
return asg_name
eg:
asg_name = get_asg_name_from_tags({'Env':env, 'Application':'app'})
It expects there to be only one result and checks this by trying to use next() to get another. The StopIteration is the "good" case, which then breaks out of the paginator loop.
I got it working with below script.
#!/usr/bin/python
import boto3
client = boto3.client('autoscaling',region_name='us-west-2')
response = client.describe_auto_scaling_groups()
ccp_asg = []
all_asg = response['AutoScalingGroups']
for i in range(len(all_asg)):
all_tags = all_asg[i]['Tags']
app = False
asg_name = ''
for j in range(len(all_tags)):
if 'Application' in all_tags[j]['Key'] and all_tags[j]['Value'] in ('CCP'):
app = True
if app:
if 'Name' in all_tags[j]['Key']:
asg_name = all_tags[j]['Value']
ccp_asg.append(asg_name)
print ccp_asg
Feel free to ask if you have any doubts.
The right way to do this isn't via describe_auto_scaling_groups at all but via describe_tags, which will allow you to make the filtering happen on the server side.
You can construct a filter that asks for tag application instances with any of a number of values:
Filters=[
{
'Name': 'key',
'Values': [
'Application',
]
},
{
'Name': 'value',
'Values': [
'CCC',
]
},
],
And then your results (in Tags in the response) are all the times when a matching tag is applied to an autoscaling group. You will have to make the call multiple times, passing back NextToken every time there is one, to go through all the pages of results.
Each result includes an ASG ID that the matching tag is applied to. Once you have all the ASG IDs you are interested in, then you can call describe_auto_scaling_groups to get their names.
yet another solution, in my opinion simple enough to extend:
client = boto3.client('autoscaling')
search_tags = {"environment": "stage"}
filtered_asgs = []
response = client.describe_auto_scaling_groups()
for group in response['AutoScalingGroups']:
flattened_tags = {
tag_info['Key']: tag_info['Value']
for tag_info in group['Tags']
}
if search_tags.items() <= flattened_tags.items():
filtered_asgs.append(group)
print(filtered_asgs)

context for using `yield` keyword in python

I have the following program to scrap data from a website. I want to improve the below code by using a generator with a yield instead of calling generate_url and call_me multiple times sequentially. The purpose of this exersise is to properly understand yield and the context in which it can be used.
import requests
import shutil
start_date='03-03-1997'
end_date='10-04-2015'
yf_base_url ='http://real-chart.finance.yahoo.com/table.csv?s=%5E'
index_list = ['BSESN','NSEI']
def generate_url(index, start_date, end_date):
s_day = start_date.split('-')[0]
s_month = start_date.split('-')[1]
s_year = start_date.split('-')[2]
e_day = end_date.split('-')[0]
e_month = end_date.split('-')[1]
e_year = end_date.split('-')[2]
if (index == 'BSESN') or (index == 'NSEI'):
url = yf_base_url + index + '&a={}&b={}&c={}&d={}&e={}&f={}'.format(s_day,s_month,s_year,e_day,e_month,e_year)
return url
def callme(url,index):
print('URL {}'.format(url))
r = requests.get(url, verify=False,stream=True)
if r.status_code!=200:
print "Failure!!"
exit()
else:
r.raw.decode_content = True
with open(index + "file.csv", 'wb') as f:
shutil.copyfileobj(r.raw, f)
print "Success"
if __name__ == '__main__':
url = generate_url(index_list[0],start_date,end_date)
callme(url,index_list[0])
url = generate_url(index_list[1],start_date,end_date)
callme(url,index_list[1])
There are multiple options. You could use yield to iterate over URL's. Or over request objects.
If your index_list were long, I would suggest yielding URLs.
Because then you could use multiprocessing.Pool to map a function that does a request and saves the output over these URLs. That would execute them in parallel, potentially making it a lot faster (assuming that you have enough network bandwidth, and that yahoo finance doesn't throttle connections).
yf ='http://real-chart.finance.yahoo.com/table.csv?s=%5E'
'{}&a={}&b={}&c={}&d={}&e={}&f={}'
index_list = ['BSESN','NSEI']
def genurl(symbols, start_date, end_date):
# assemble the URLs
s_day, s_month, s_year = start_date.split('-')
e_day, e_month, e_year = end_date.split('-')
for s in symbols:
url = yf.format(s, s_day,s_month,s_year,e_day,e_month,e_year)
yield url
def download(url):
# Do the request, save the file
p = multiprocessing.Pool()
rv = p.map(download, genurl(index_list, '03-03-1997', '10-04-2015'))
If I understand you correctly, what you want to know is how to change the code so that you can replace the last part by
if __name__ == '__main__':
for url in generate_url(index_list,start_date,end_date):
callme(url,index)
If this is correct, you need to change generate_url, but not callme. Changing generate_url is rather mechanical. Make the first parameter index_list instead of index, wrap the function body in a for index in index_list loop, and change return url to yield url.
You don't need to change callme because you never want to say something like for call in callme(...). You won't do anything with it but a normal function call.

Categories