Celery task PENDING - python

I basically want to be able to create a chord consisting of a group of chains. Beyond the fact that I can't seem to make that work is the fact that all the sub-chains have to complete before the chord callback is fired.
So my thought was to create a while loop like:
data = [foo.delay(i) for i in bar]
complete = {}
L = len(data)
cnt = 0
while cnt != L:
for i in data:
ID = i.task_id
try:
complete[ID]
except KeyError:
if i.status == 'SUCCESS':
complete[ID] = run_hourly.delay(i.result)
cnt += 1
if cnt >= L:
return complete.values()
So that when a task was ready it could be acted on without having to wait on other tasks to be complete.
The problem I'm having is that the status of some tasks never get past the 'PENDING' state.
All tasks will reach the 'SUCCESS' state if I add a time.sleep(x) line to the for loop but with a large number of sub tasks in data that solution becomes grossly inefficient.
I'm using memcached as my results backend and rabbitmq. My guess is that the speed of the for loop that iterates over data and calling attributes of it's tasks creates a race condition that breaks the connection to celery's messaging which leaves these zombie tasks that stay in the 'PENDING' state. But then again I could be completely wrong and it certainly wouldn't be the first time..
My questions
Why is time.sleep(foo) needed to avoid a perpetually PENDING task when iterating over a list of just launched tasks?
When a celery task is performing a loop is it blocking? When I try to shutdown the worker that gets stuck in an infinite loop I am unable to do so and have to manually find the python process running the worker and kill it. If I leave the worker to run eventually the python process running it will start to consume several gigs of memory, growing exponentially and what seems to be without bound.
Any insight on this matter would be appreciated. I'm also open to suggestions on ways to avoid the while loop entirely.
I appreciate your time. Thank you.

My chord consisting of a group of chains was being constructed and executed from within a celery task. Which will create problems if you need to access the results of those tasks. Below is a summary of what I was trying to do, what I ended up doing, and what I think I learned in the processes so maybe it can help someone else.
--common_tasks.py--
from cel_test.celery import app
#app.task(ignore_result=False)
def qry(sql, writing=False):
import psycopg2
conn = psycopg2.connect(dbname='space_test', user='foo', host='localhost')
cur = conn.cursor()
cur.execute(sql)
if writing == True:
cur.close()
conn.commit()
conn.close()
return
a = cur.fetchall()
cur.close()
conn.close()
return a
--foo_tasks.py --
from __future__ import absolute_import
from celery import chain, chord, group
import celery
from cel_test.celery import app
from weather.parse_weather import parse_weather
from cel_test.common_tasks import qry
import cPickle as PKL
import csv
import requests
#app.task(ignore_results=False)
def write_the_csv(data, file_name, sql):
with open(file_name, 'wb') as fp:
a = csv.writer(fp, delimiter=',')
for i in data:
a.writerows(i)
qry.delay(sql, True)
return True
#app.task(ignore_result=False)
def idx_point_qry(DIR='/bar/foo/'):
with open(''.join([DIR, 'thing.pkl']), 'rb') as f:
a = PKL.load(f)
return [(i, a[i]) for i in a]
#app.task(ignore_results=False)
def load_page(url):
page = requests.get(url, **{'timeout': 30})
if page.status_code == 200:
return page.json()
#app.task(ignore_results=False)
def run_hourly(page, info):
return parse_weather(page, info).hourly()
#app.task(ignore_results=False)
def pool_loop(info):
data = []
for iz in info:
a = chain(load_page.s(url, iz), run_hourly.s())()
data.append(a)
return data
#app.task(ignore_results=False)
def run_update(file_name, sql, writing=True):
chain(idx_point_qry.s(), pool_loop.s(file_name, sql), write_the_csv.s(writing, sql))()
return True
--- separate.py file ---
from foo_tasks import *
def update_hourly_weather():
fn = '/weather/csv_data/hourly_weather.csv'
run_update.delay(fn, "SELECT * from hourly_weather();")
return True
update_hourly_weather()
I tried 30 or so combinations of .py files listed above along with several combinations of the code existing in them. I tried chords, groups, chains, launching tasks from different tasks, combining tasks.
A few combinations did end up working but I had to call .get() directly in the wrtie_the_csv task on data but celery was throwing a warning that in 4.0 calling get() in a task would raise an error so I thought I ought not to be doing that..
In essence my problem was (and still is) poor design of tasks and the flow between them. Which was leading me to the problem of having synchronize tasks from within other tasks.
The while loop I proposed in my question was an attempt to asynchronously launch tasks when another tasks status had become COMPLETE and then launch another task when that task had become complete and so on.. as opposed to synchronously letting celery do so through a call to chord or chain. What I seemed to find (and I'm not sure I'm right about this) was that from within a celery task you don't have access to the scope needed to discern such things. The following is a quote from the state portion in the celery documentation on tasks “asserting the world is the responsibility of the task”.
I was launching tasks that the task had no idea existed and as such was unable to know anything about them.
My solution was the to load and iterate through the .pkl file and launch a group of chains with a callback synchronously. I basically replaced the pool_loop task with the code below and launched the tasks synchronously instead of asynchronously.
--the_end.py--
from foo_tasks import *
from common_tasks import *
def update_hourly_weather():
fn = '/weather/csv_data/hourly_weather.csv'
dd = []
idx = idx_point_qry()
for i in idx:
dd.append(chain(load_page.s(i, run_hourly.s(i)))
vv = chord(dd)(write_the_csv.s(fn, "SELECT * from hourly_weather();"))
return
update_hourly_weather()

Related

Multithreading / Multiprocessing with a for-loop in Python3

I have this task which is sort of I/O bound and CPU bound at the same time.
Basically I am getting a list of queries from a user, google search them (via custom-search-api), store each query results in a .txt file, and storing all results in a results.txt file.
I was thinking that maybe parallelism might be an advantage here.
My whole task is wrapped with an Object which has 2 member fields which I am supposed to use across all threads/processes (a list and a dictionary).
Therefore, when I use multiprocessing I get weird results (I assume that it is because of my shared resources).
i.e:
class MyObject(object):
_my_list = []
_my_dict = {}
_my_dict contains key:value pairs of "query_name":list().
_my_list is a list of queries to search in google. It is safe to assume that it is not written into.
For each query : I search it on google, grab the top results and store it in _my_dict
I want to do this in parallel. I thought that threading may be good but it seems that they slow the work..
how I attempted to do it (this is the method which is doing the entire job per query):
def _do_job(self, query):
""" search the query on google (via http)
save results on a .txt file locally. """
this is the method which is supposed to execute all jobs for all queries in parallel:
def find_articles(self):
p = Pool(processes=len(self._my_list))
p.map_async(self._do_job, self._my_list)
p.close()
p.join()
self._create_final_log()
The above execution does not work, I get corrupted results...
When I use multithreading however, the results are fine, but very slow:
def find_articles(self):
thread_pool = []
for vendor in self._vendors_list:
self._search_validate_cache(vendor)
thread = threading.Thread(target=self._search_validate_cache, args=. (vendor,))
thread_pool.append(thread)
thread.start()
for thread in thread_pool:
thread.join()
self._create_final_log()
Any help would be appreciated, thanks!
I have encountered this while doing similar projects in the past (multiprocessing doesn't work efficiently, single-threaded is too slow, starting a thread per query is too fast and is bottlenecked). I found an efficient way to complete a task like this is to create a thread pool with a limited amount of threads. Logically, the fastest way to complete this task is to use as many network resources as possible without a bottleneck, which is why the threads active at one time actively making requests are capped.
In your case, cycling a list of queries with a thread pool with a callback function would be a quick and easy way to go through all the data. Obviously, there is a lot of factors that affect that such as network speed and finding the correct size threadpool to avoid a bottlneck, but overall I've found this to work well.
import threading
class MultiThread:
def __init__(self, func, list_data, thread_cap=10):
"""
Parameters
----------
func : function
Callback function to multi-thread
threads : int
Amount of threads available in the pool
list_data : list
List of data to multi-thread index
"""
self.func = func
self.thread_cap = thread_cap
self.thread_pool = []
self.current_index = -1
self.total_index = len(list_data) - 1
self.complete = False
self.list_data = list_data
def start(self):
for _ in range(self.thread_cap):
thread = threading.Thread(target=self._wrapper)
self.thread_pool += [thread]
thread.start()
def _wrapper(self):
while not self.complete:
if self.current_index < self.total_index:
self.current_index += 1
self.func(self.list_data[self.current_index])
else:
self.complete = True
def wait_on_completion(self):
for thread in self.thread_pool:
thread.join()
import requests #, time
_my_dict = {}
base_url = "https://www.google.com/search?q="
s = requests.sessions.session()
def example_callback_func(query):
global _my_dict
# code to grab data here
r = s.get(base_url+query)
_my_dict[query] = r.text # whatever parsed results
print(r, query)
#start_time = time.time()
_my_list = ["examplequery"+str(n) for n in range(100)]
mt = MultiThread(example_callback_func, _my_list, thread_cap=30)
mt.start()
mt.wait_on_completion()
# output queries to file
#print("Time:{:2f}".format(time.time()-start_time))
You could also open the file and output whatever you need to as you go, or output data at the end. Obviously, my replica here isn't exactly what you need, but it's a solid boilerplate with a lightweight function I made that will greatly reduce the time it takes. It uses a thread pool to call a callback to a default function that takes a single parameter (the query).
In my test here, it completed cycling 100 queries in ~2 seconds. I could definitely play with the thread cap and get the timings lower before I find the bottleneck.

Is there any way to understand when all tasks are finished?

Let's say I add 100 push tasks (as group 1) to my tasks-queue. Then I add another 200 tasks (as group 2) to the same queue. How can I understand if all tasks of group 1 are finished?
Looks like QueueStatistics will not help here. tag works only with pull queues.
And I can not have separate queues (since I may have hundreds of groups).
I would probably solve it by using a sharded counter in datastore like #mgilson said and decorate my deferred functions to run a callback when the tasks are done running.
I think something like this is what you are looking for if you include the code at https://cloud.google.com/appengine/articles/sharding_counters?hl=en and write a decriment function to complement the increment one.
import random
import time
from google.appengine.ext import deferred
def done_work():
logging.info('work done!')
def worker(callback=None):
def fst(f):
def snd(*args, **kwargs):
key = kwargs['shard_key']
del kwargs['shard_key']
retval = f(*args, **kwargs)
decriment(key)
if get_count(key) == 0:
callback()
return retval
return snd
return fst
def func(n):
# do some work
time.sleep(random.randint(1, 10) / 10.0)
logging.info('task #{:d}'.format(n))
def make_some_tasks():
func = worker(callback=done_work)(func)
key = random.randint(0, 1000)
for n in xrange(0, 100):
increment(key)
deferred.defer(func, n, shard_key=key)
Tasks are not guaranteed to run only once, occasionally even successfully executed tasks may be repeated. Here's such an example: GAE deferred task retried due to "instance unavailable" despite having already succeeded.
Because of this using a counter incremented at task enqueueing and decremented at task completion wouldn't work - it would be decremented twice in such a duplicate execution case, throwing the whole computation off.
The only reliable way of keeping track of task completion (that I can think of) is to independently track each individual enqueued task. You can do that using the task names (either specified or auto-assigned after successful enqueueing) - they are unique for a given queue. Task names to be tracked can be kept in task lists persisted in the datastore, for example.
Note: this is just the theoretical answer I got to when I asked myself the same question, I didn't get to actually test it.

Process Celery Tasks results in arrival order

I am new to celery, i have an "celery-server", like the code below which returns results after some time, depending on calculation. I have emulated this behaviour in this simple program below with the sleep function. What i want is to process the early returning results before the "heavy results". I have written a simple programm, see snipped below, which intentionally creates the "heavy load" task as first call.
Note that the subsequent calls create "lighter" tasks and therefore the celery server returns them earlier. Therefore i want to process the returning results based on the order they arrive at the client. Right now (see client code) it waits until the heavy tasks has returned.
But with the examples from celery docs, i am supposed to wait for results by checking the id, or poll for them (which is dumb, because celery client has to check the id of the "first" arrived result somehow i guess).
How can I process the results of celery in the order they arrive at the client? I don't want to poll in an endless loop for "result.ready()" as this completely screws up IMHO somehow the sense of aync processing.
Found no solution in the docs. What i want to do is "get first arrived resulted and get id", compare this to my "result.id" (did i send the task?) and then process accordingly.
#
# Name this code "tasks.py" and run it with:
# celery worker -A tasks --loglevel=info
#
from celery import Celery
import time
app = Celery('tasks', backend='amqp', broker='amqp://guest:guest#127.0.0.1:5672/%2F')
#app.task()
def add(x,y):
print("x=%s y=%s" % (x,y))
time.sleep(x)
return x + y
Second programm the client: This works like celery docs, howerver celery has already completed 0,1,2 (and therefore the client should work on it).
#!/usr/bin/python3
from tasks import add
results = []
max = 4
for i in range(0,max):
print(max-(i+1))
result = add.delay(max-(i+1),0)
results.append(result)
print("")
for i in range(0,max):
result = results[i].get(timeout=10)
print(result)
Result: (the last 4 numbers should appear in arrival order which would be 0,1,2,3 )
3
2
1
0
3
2
1
0
You should implement a callback rather than looping through the results in the order that they were sent to the queue:
http://celery.readthedocs.org/en/latest/userguide/calling.html#linking-callbacks-errbacks
In tasks.py:
#app.task()
def process_add(result):
print(result)
In client.py:
from tasks import add, process_add
results = []
max = 4
for i in range(0,max):
print(max-(i+1))
add.apply_async((max-(i+1), 0), link=process_add.s())

Show a progress bar for my multithreaded process

I have a simple Flask web app that make many HTTP requests to an external service when a user push a button. On the client side I have an angularjs app.
The server side of the code look like this (using multiprocessing.dummy):
worker = MyWorkerClass()
pool = Pool(processes=10)
result_objs = [pool.apply_async(worker.do_work, (q,))
for q in queries]
pool.close() # Close pool
pool.join() # Wait for all task to finish
errors = not all(obj.successful() for obj in result_objs)
# extract result only from successful task
items = [obj.get() for obj in result_objs if obj.successful()]
As you can see I'm using apply_async because I want to later inspect each task and extract from them the result only if the task didn't raise any exception.
I understood that in order to show a progress bar on client side, I need to publish somewhere the number of completed tasks so I made a simple view like this:
#app.route('/api/v1.0/progress', methods=['GET'])
def view_progress():
return jsonify(dict(progress=session['progress']))
That will show the content of a session variable. Now, during the process, I need to update that variable with the number of completed tasks (the total number of tasks to complete is fixed and known).
Any ideas about how to do that? I working in the right direction?
I'have seen similar questions on SO like this one but I'm not able to adapt the answer to my case.
Thank you.
For interprocess communication you can use a multiprocessiong.Queue and your workers can put_nowait tuples with progress information on it while doing their work. Your main process can update whatever your view_progress is reading until all results are ready.
A bit like in this example usage of a Queue, with a few adjustments:
In the writers (workers) I'd use put_nowait instead of put because working is more important than waiting to report that you are working (but perhaps you judge otherwise and decide that informing the user is part of the task and should never be skipped).
The example just puts strings on the queue, I'd use collections.namedtuples for more structured messages. On tasks with many steps, this enables you to raise the resolution of you progress report, and report more to the user.
In general the approach you are taking is okay, I do it in a similar way.
To calculate the progress you can use an auxiliary function that counts the completed tasks:
def get_progress(result_objs):
done = 0
errors = 0
for r in result_objs:
if r.ready():
done += 1
if not r.successful():
errors += 1
return (done, errors)
Note that as a bonus this function returns how many of the "done" tasks ended in errors.
The big problem is for the /api/v1.0/progress route to find the array of AsyncResult objects.
Unfortunately AsyncResult objects cannot be serialized to a session, so that option is out. If your application supports a single set of async tasks at a time then you can just store this array as a global variable. If you need to support multiple clients, each with a different set of async tasks, then you will need figure out a strategy to keep client session data in the server.
I implemented the single client solution as a quick test. My view functions are as follows:
results = None
#app.route('/')
def index():
global results
results = [pool.apply_async(do_work) for n in range(20)]
return render_template('index.html')
#app.route('/api/v1.0/progress')
def progress():
global results
total = len(results)
done, errored = get_progress(results)
return jsonify({'total': total, 'done': done, 'errored': errored})
I hope this helps!
I think you should be able to update the number of completed tasks using multiprocessing.Value and multiprocessing.Lock.
In your main code, use:
processes=multiprocessing.Value('i', 10)
lock=multiprocessing.Lock()
And then, when you call worker.dowork, pass a lock object and the value to it:
worker.dowork(lock, processes)
In your worker.dowork code, decrease "processes" by one when the code is finished:
lock.acquire()
processes.value-=1
lock.release()
Now, "processes.value" should be accessible from your main code, and be equal to the number of remaining processes. Make sure you acquire the lock before acessing processes.value, and release the lock afterwards

Assistance with Python multithreading

Currently, i have a list of url to grab contents from and is doing it serially. I would like to change it to grabbing them in parallel. This is a psuedocode. I will like to ask is the design sound? I understand that .start() starts the thread, however, my database is not updated. Do i need to use q.get() ? thanks
import threading
import Queue
q = Queue.Queue()
def do_database(url):
""" grab url then input to database """
webdata = grab_url(url)
try:
insert_data_into_database(webdata)
except:
....
else:
< do I need to do anything with the queue after each db operation is done?>
def put_queue(q, url ):
q.put( do_database(url) )
for myfiles in currentdir:
url = myfiles + some_other_string
t=threading.Thread(target=put_queue,args=(q,url))
t.daemon=True
t.start()
It's odd that you're putting stuff into q but never taking anything out of q. What is the purpose of q? In addition, since do_database() doesn't return anything, sure looks like the only thing q.put(do_database(url)) does is put None into q.
The usual way these things work, a description of work to do is added to a queue, and then a fixed number of threads take turns pulling things off the queue. You probably don't want to create an unbounded number of threads ;-)
Here's a pretty complete - but untested - sketch:
import threading
import Queue
NUM_THREADS = 5 # whatever
q = Queue.Queue()
END_OF_DATA = object() # a unique object
class Worker(threading.Thread):
def run(self):
while True:
url = q.get()
if url is END_OF_DATA:
break
webdata = grab_url(url)
try:
# Does your database support concurrent updates
# from multiple threads? If not, need to put
# this in a "with some_global_mutex:" block.
insert_data_into_database(webdata)
except:
#....
threads = [Worker() for _ in range(NUM_THREADS)]
for t in threads:
t.start()
for myfiles in currentdir:
url = myfiles + some_other_string
q.put(url)
# Give each thread an END_OF_DATA marker.
for _ in range(NUM_THREADS):
q.put(END_OF_DATA)
# Shut down cleanly. `daemon` is way overused.
for t in threads:
t.join()
You should do this with asynchronous programming rather than threads. Threading in Python is problematic (see: Global Interpreter Lock), and anyway you're not trying to achieve multicore performance here. You just need a way to multiplex potentially long-running I/O. For that you can use a single thread and an event-driven library such as Twisted.
Twisted comes with HTTP functionality, so you can issue many concurrent requests and react (by populating your database) when results come in. Be aware that this model of programming may take a little getting used to, but it will give you good performance if the number of requests you're making is not astronomical (i.e. if you can get it all done on one machine, which it seems is your intention).
For DB, You have to commit before your changes become effective. But, commit for every insert is not optimal. Commit after bulk changes gives much better performance.
For parallel, Python isn't born for this. For your use-case, i suppose using python with gevent would be a painless solution.
Here is a much more efficient pseudo implementation FYI:
import gevent
from gevent.monkey import patch_all
patch_all() # to use with urllib, etc
from gevent.queue import Queue
def web_worker(q, url):
grab_something
q.push(result)
def db_worker(q):
buf = []
while True:
buf.append(q.get())
if len(buf) > 20:
insert_stuff_in_buf_to_db
db_commit
buf = []
def run(urls):
q = Queue()
gevent.spawn(db_worker, q)
for url in urls:
gevent.spawn(web_worker, q, url)
run(urls)
plus, since this implementation is totally single threaded, you can safely manipulate shared data between workers like queue, db connection, global variables etc.

Categories