Django, Celery with Recursion and Twitter API - python

I'm working with Django 1.4 and Celery 3.0 (rabbitmq) to build an assemblage of tasks for sourcing and caching queries to Twitter API 1.1. One thing I am trying to implement is chain of tasks, the last of which makes a recursive call to the task two nodes back, based on responses so far and response data in most recently retrieved response. Concretely, this allows the app to traverse a user timeline (up to 3200 tweets), taking into account that any given request can only yield at most 200 tweets (limitation on Twitter API).
Key components of my tasks.py can be seen here, but before pasting, I'll show the chain i'm calling from my Python shell (but that will ultimately be launched via user inputs in the final web app). Given:
>>request(twitter_user_id='#1010101010101#,
total_requested=1000,
max_id = random.getrandbits(128) #e.g. arbitrarily large number)
I call:
>> res = (twitter_getter.s(request) |
pre_get_tweets_for_user_id.s() |
get_tweets_for_user_id.s() |
timeline_recursor.s()).apply_async()
The critical thing is that timeline_recursor can initiate a variable number of get_tweets_for_user_id subtasks. When timeline_recursor is in its base case, it should return a response dict as defined here:
#task(rate_limit=None)
def timeline_recursor(request):
previous_tweets=request.get('previous_tweets', None) #If it's the first time through, this will be None
if not previous_tweets:
previous_tweets = [] #so we initiate to empty array
tweets = request.get('tweets', None)
twitter_user_id=request['twitter_user_id']
previous_max_id=request['previous_max_id']
total_requested=request['total_requested']
pulled_in=request['pulled_in']
remaining_requested = total_requested - pulled_in
if previous_max_id:
remaining_requested += 1 #this is because cursored results will always have one overlapping id
else:
previous_max_id = random.getrandbits(128) # for first time through loop
new_max_id = min([tweet['id'] for tweet in tweets])
test = lambda x, y: x<y
if remaining_requested < 0: #because we overshoot by requesting batches of 200
remaining_requested = 0
if tweets:
previous_tweets.extend(tweets)
if tweets and remaining_requested and (pulled_in > 1) and test(new_max_id, previous_max_id):
request = dict(user_pk=user_pk,
twitter_user_id=twitter_user_id,
max_id = new_max_id,
total_requested = remaining_requested,
tweets=previous_tweets)
#problem happens in this part of the logic???
response = (twitter_getter_config.s(request) | get_tweets_for_user_id.s() | timeline_recursor.s()).apply_async()
else: #if in base case, combine all tweets pulled in thus far and send back as "tweets" -- to be
#saved in db or otherwise consumed
response = dict(
twitter_user_id=twitter_user_id,
total_requested = total_requested,
tweets=previous_tweets)
return response
My expected response for res.result is therefore a dictionary comprised of a twitter user id, a requested number of tweets, and the set of tweets pulled in across successive calls.
However, all is not well in recursive task land. When i run the chain identified above, if I enter res.status right after initiating chain, it indicates "SUCCESS", even though in the log view of my celery worker, I can see that chained recursive calls to the twitter api are being made as expected, with the correct parameters. I can also immediately run result.result even as chained tasks are being executed. res.result yields an AsyncResponse instance id. Even after recursively chained tasks have finished running, res.result remains an AsyncResult id.
On the other hand, I can access my set of full tweets by going to res.result.result.result.result['tweets']. I can deduce that each of the chained chained subtasks is indeed occuring, I just don't understand why res.result doesn't have the expected result. The recursive returns that should be happening when timeline_recursor gets its base case don't appear to be propagating as expected.
Any thoughts on what can be done? Recursion in Celery can get quite powerful, but to me at least, it's not totally apparent how we should be thinking of recursion and recursive functions that utilize Celery and how this affects the logic of return statements in chained tasks.
Happy to clarify as needed, and thanks in advance for any advice.

what does apply_async return ( as in type of object )?
i don't know celery, but in Twisted and many other async frameworks... a call to something like that would immediately return ( usually True or perhaps an object that can track state ) as the tasks are deferred into the queue.
again, Not knowing celery , i would guess that this is happening:
you are: defining response immediately as the async deferred task, but then trying to act on it as if results have come in
you want to be: defining a callback routine to run on the results and return a value, once the task has been completed
looking at the celery docs, apply_async accepts callbacks via link - and i couldn't find any example of someone trying to capture a return value from it.

Related

generate dynamic task using hooks without running them in backend

I have a simple dag - that takes argument from mysql db - (like sql, subject)
Then I have a function creating report out and send to particular email.
Here is code snippet.
def s_report(k,**kwargs):
body_sql = list2[k][4]
request1 = "({})".format(body_sql)
dwh_hook = SnowflakeHook(snowflake_conn_id="snowflake_conn")
df1 = dwh_hook.get_pandas_df(request1)
df2 = df1.to_html()
body_Text = list2[k][3]
html_content = f"""HI Team, Please find report<br><br>
{df2} <br> </br>
<b>Thank you!</b><br>
"""
return EmailOperator(task_id="send_email_snowflake{}".format(k), to=list2[k][1],
subject=f"{list2[k][2]}", html_content=html_content, dag=dag)
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
The s_report is getting generated dynamically, But the real problem is hook is continously submitting query in backend, While dag is stopped still its submitting query in backend.
I can use pythonoperator, but its not generating dynamic task.
A couple of things:
By looking at your code, in particular the lines:
for j in range(len(list)):
mysql_list >> [ s_report(j)] >> end_operator
we can determine that if your first task succeeds, namely, mysql_list, then the tasks downstream to it, namely, the s_report calls should begin executing. You have precisely len(list) of them. Within each s_report call there is exactly one dwh_hook.get_pandas_df(request) call, so I believe your DAG should be making len(list) calls of this type provided mysql_list task succeeds.
As for the mismatch you see in your Snowflake logs, I can't advise you here. I'd need more details. Keep in mind that the call get_pandas_df might have a retry mechanism (i.e. if cannot reach snowflake, retry) which might explain why your Snowflake logs show a bunch of requests.
If your DAG finishes successfully, (i.e. end_operator tasks finishes successfully), you are correct. There should be no requests in your Snowflake logs that came post-DAG end.
If you want more insight as to how your DAG interacts with your Snowflake resource, I'd suggest having a single s_report task like so:
mysql_list >> [ s_report(0)] >> end_operator
and see the behaviour in the logs.

Python3.8 Asyncio - Return Results from List of Dictionaries

I am really really struggling to figure out how to use asyncio to return a bunch of results from a bunch of AWS Lambda calls, here is my example.
My team owns a bunch of AWS accounts. For the sake of time, I want to run an async of AWS lambda functions to process the information of each account, and return the results. I'm trying to understand how I can create an async of sending a whole bunch of accounts quickly rather than doing it one at a time. Here is my example code.
def call_lambda(acct):
aws_lambda = boto3.client('lambda', region_name='us-east-2')
aws_payload = json.dumps(acct)
response = aws_lambda.invoke(
FunctionName='MyLambdaName',
Payload=aws_payload,
)
return json.loads(response['Payload'].read())
def main():
scan_time = datetime.datetime.utcnow()
accounts = []
scan_data = []
account_data = account_parser()
for account_info in account_data:
account_info['scan_time'] = scan_time
for account in account_data:
scan_data.append(call_lambda(account))
I am struggling to figure out how to do this in an asyncio style. I originally managed to pull it off using concurrent futures threadpoolexecutor, but I ran into some issues with performance, but here is what I had.
executor = concurrent.futures.ThreadPoolExecutor(max_workers=50)
sg_data = executor.map(call_lambda, account_data)
So this worked, but not well, and I was told to do asyncio instead. I read these following articles but I am still just lost as to how to make this work. I know AWS Lambda itself is asynchronous, and should work fine without a coroutine.
The tl;dr is I want to kick off call_lambda(acct) for every single Dict in my List (account_data is a list of dictionaries) and then return all the results into one big list of Dict again. (this eventually gets written into an CSV, company policy issues for why not going into a database).
I have read the following, still confused...
https://stackabuse.com/python-async-await-tutorial/
Lambda invocations are by default synchronous (RequestResponse), so you have to specify the InvocationType to Event. By doing this though you don't get your response to collate the account info you desire. Hence the desire to use async or something similar.
response = client.invoke(
FunctionName='string',
InvocationType='Event'|'RequestResponse'|'DryRun',
LogType='None'|'Tail',
ClientContext='string',
Payload=b'bytes'|file,
Qualifier='string'
)
I've not implemented async in lambda, but as alternate solution, if you can create an s3 file, then have each invocation of FunctionName='MyLambdaName' just update the file, you might get what you need that way.

Where does a dynamodb2 batch begin and end?

I am trying to move my python code from using dynamodb to dynamodb2 to have access to the global secondary index capability. One concept that to me is a lot less clear in ddb2 compared to ddb is that of a batch. Here's once version of my new code which was basically modified from my original ddb code:
item_pIds = []
batch = table.batch_write()
count = 0
while True:
m = inq.read()
count = count + 1
mStr = json.dumps(m)
pid = m['primaryId']
if pid in item_pIds:
print "pid=%d already exists in the batch, ignoring" % pid
continue
item_pIds.append(pid)
sid = m['secondaryId']
item_data = {"primaryId" : pid, "secondaryId"] : sid, "message"] : mStr}
batch.put_item(data=item_data)
if count >= 25:
batch = table.batch_write()
count = 0
item_pIds = []
So what I am doing here is I am getting (JSON) messages from a queue. Each message has a primaryId and a secondaryId. The secondaryId is not unique in that I might get several messages at about the same time that have the same. The primaryId is sort of unique. That is, if I get a set of messages at about the same time that have the same primaryId, it's bad. However, from time to time, say once in a few hours I may get a message that need to override an existing message with the same primaryId. So this seems to align well with the statement from the dynamodb2 documentation page similar to that of ddb:
DynamoDB’s maximum batch size is 25 items per request. If you attempt to put/delete more than that, the context manager will batch as many as it can up to that number, then flush them to DynamoDB and continue batching as more calls come in.
However, what I noticed is that a large chunk of messages that I get through the queue never make it to the database. That is, when I try to retrieve them later, they are not there. So I was told that a better way of handling batch writes is by doing something like this:
with table.batch_write() as batch:
while True:
m = inq.read()
mStr = json.dumps(m)
pid = m['primaryId']
sid = m['secondaryId']
item_data = {"primaryId" : pid, "secondaryId"] : sid, "message"] : mStr}
batch.put_item(data=item_data)
That is, I only call batch_write() once similar to how I would open a file only once and then write into it continuously. But in this case, I don't understand what the "rule of 25 max" means. When does a batch start and end? And how do I check for duplicate primaryIds? That is, remembering all messages that I ever received through the queue is not realistic since (i) I have too many of them (the system runs 24/7) and (ii) as I stated before, occasional repeated ids are OK.
Sorry for the long message.
A batch will start whenever the request is sent and end when the last request in the batch is completed.
As with any RESTful API, every request comes with a cost, meaning how much/many resources it will take to complete said request. With the batch_write() class in DynamoDB2, they are wrapping the requests in a group and creating a queue to process them, which will reduce the cost as they are no longer individual requests.
The batch_write() class returns a context manager that handles the individual requests and what you get back slightly resembles a Table object but only has the put_item and delete_item requests.
DynamoDB's max batch size is 25, just like you've read. From the comments in the source code:
DynamoDB's maximum batch size is 25 items per request. If you attempt
to put/delete more than that, the context manager will batch as many
as it can up to that number, then flush them to DynamoDB & continue
batching as more calls come in.
You can also read about migrating, batches in particular, from DynamoDB to DynamoDB2 here.

get_rate_status "remaining hits" does not decrease when I make GET calls anymore

I have a script written to iterate through a list of Twitter userIDs and save the lists of follower_ids to a file. I have used it several times with no problems. For long lists, I have added this piece of code to check the rate limit before every GET request and sleep if I'm about to be rate-limited:
rate_limit_json = api.rate_limit_status()
remaining_hits = rate_limit_json["remaining_hits"]
print 'you have', remaining_hits, 'API calls remaining until next hour'
if remaining_hits < 2:
dtcode = datetime.utcnow()
unixtime = calendar.timegm(dtcode.utctimetuple())
sleeptime = rate_limit_json['reset_time_in_seconds'] - unixtime + 10
print 'waiting ', sleeptime, 'seconds'
time.sleep(sleeptime)
else:
pass
I have this Oauth blurb set up at the top of the script:
auth = tweepy.OAuthHandler('xxxx', 'xxxx')
auth.set_access_token('xxxx', 'xxxxx')
api = tweepy.API(auth)
The call I'm repeatedly making is:
follower_cursors = tweepy.Cursor(api.followers_ids)
So now, no matter how many calls I make, my "remaining hits" stays at 150 until Twitter returns a "Rate limit exceeded. Clients may not make more than 350 requests per hour."
It seems like the rate limit checker is reporting my unauthorized, IP address's rate limit (150), but my calls are counting against my app's rate limit (350).
How can I adjust my rate limit checker to actually check the app's rate limit again?
Repeated creating a Cursor instance doesn't use up any API quote, as it doesn't actually make a request to twitter. A Cursor is just a Tweepy wrapper around different pagination methods. Calling the next or prev methods, of iterating over it, cause actual API calls.
Twitter no longer has a concept of unauthenticated API calls (the v1.1 API always requires authentication). I think the problem lies with your use of Cursor. Try creating just once instance, and calling the next method repeatedly (make sure to catch a StopIteration exception, as next is a generator).

Find out whether celery task exists

Is it possible to find out whether a task with a certain task id exists? When I try to get the status, I will always get pending.
>>> AsyncResult('...').status
'PENDING'
I want to know whether a given task id is a real celery task id and not a random string. I want different results depending on whether there is a valid task for a certain id.
There may have been a valid task in the past with the same id but the results may have been deleted from the backend.
Celery does not write a state when the task is sent, this is partly an optimization (see the documentation).
If you really need it, it's simple to add:
from celery import current_app
# `after_task_publish` is available in celery 3.1+
# for older versions use the deprecated `task_sent` signal
from celery.signals import after_task_publish
# when using celery versions older than 4.0, use body instead of headers
#after_task_publish.connect
def update_sent_state(sender=None, headers=None, **kwargs):
# the task may not exist if sent using `send_task` which
# sends tasks by name, so fall back to the default result backend
# if that is the case.
task = current_app.tasks.get(sender)
backend = task.backend if task else current_app.backend
backend.store_result(headers['id'], None, "SENT")
Then you can test for the PENDING state to detect that a task has not (seemingly)
been sent:
>>> result.state != "PENDING"
AsyncResult.state returns PENDING in case of unknown task ids.
PENDING
Task is waiting for execution or unknown. Any task id that is not
known is implied to be in the pending state.
http://docs.celeryproject.org/en/latest/userguide/tasks.html#pending
You can provide custom task ids if you need to distinguish unknown ids from existing ones:
>>> from tasks import add
>>> from celery.utils import uuid
>>> r = add.apply_async(args=[1, 2], task_id="celery-task-id-"+uuid())
>>> id = r.task_id
>>> id
'celery-task-id-b774c3f9-5280-4ebe-a770-14a6977090cd'
>>> if not "blubb".startswith("celery-task-id-"): print "Unknown task id"
...
Unknown task id
>>> if not id.startswith("celery-task-id-"): print "Unknown task id"
...
Right now I'm using following scheme:
Get task id.
Set to memcache key like 'task_%s' % task.id message 'Started'.
Pass task id to client.
Now from client I can monitor task status(set from task messages to memcache).
From task on ready - set to memcache key message 'Ready'.
From client on task ready - start special task that will delete key from memcache and do necessary cleaning actions.
You need to call .get() on the AsyncTask object you create to actually fetch the result from the backend.
See the Celery FAQ.
To further clarify on my answer.
Any string is technically a valid ID, there is no way to validate the task ID. The only way to find out if a task exists is to ask the backend if it knows about it and to do that you must use .get().
This introduces the problem that .get() blocks when the backend doesn't have any information about the task ID you supplied, this is by design to allow you to start a task and then wait for its completion.
In the case of the original question I'm going to assume that the OP wants to get the state of a previously completed task. To do that you can pass a very small timeout and catch timeout errors:
from celery.exceptions import TimeoutError
try:
# fetch the result from the backend
# your backend must be fast enough to return
# results within 100ms (0.1 seconds)
result = AsyncResult('blubb').get(timeout=0.1)
except TimeoutError:
result = None
if result:
print "Result exists; state=%s" % (result.state,)
else:
print "Result does not exist"
It should go without saying that this only work if your backend is storing results, if it's not there's no way to know if a task ID is valid or not because nothing is keeping a record of them.
Even more clarification.
What you want to do cannot be accomplished using the AMQP backend because it does not store results, it forwards them.
My suggestion would be to switch to a database backend so that the results are in a database that you can query outside of the existing celery modules. If no tasks exist in the result database you can assume the ID is invalid.
So I have this idea:
import project.celery_tasks as tasks
def task_exist(task_id):
found = False
# tasks is my imported task module from celery
# it is located under /project/project, where the settings.py file is located
i = tasks.app.control.inspect()
s = i.scheduled()
for e in s:
if task_id in s[e]:
found = True
break
a = i.active()
if not found:
for e in a:
if task_id in a[e]:
found = True
break
r = i.reserved()
if not found:
for e in r:
if task_id in r[e]:
found = True
break
# if checking the status returns pending, yet we found it in any queues... it means it exists...
# if it returns pending, yet we didn't find it on any of the queues... it doesn't exist
return found
According to https://docs.celeryproject.org/en/stable/userguide/monitoring.html the different types of queue inspections are:
active,
scheduled,
reserved,
revoked,
registered,
stats,
query_task,
so pick and choose as you please.
And there might be a better way to go about checking the queues for their tasks, but this should work for me, for now.
Try
AsyncResult('blubb').state
that may work.
It should return something different.
Please correct me if i'm wrong.
if built_in_status_check(task_id) == 'pending'
if registry_exists(task_id) == true
print 'Pending'
else
print 'Task does not exist'

Categories