could anyone please provide on how to achieve below scenario ?
2 queues - destination queue, response queue
thread picks task up from destination queue
finds out needs more details
submits new task to destination queue
waits for his request to be processed and result appear in response queue
or
monitors response queue for response to his task but does not actually pick any response so it is available to the other threads waiting for other responses ?
thank you
If a threads waits for a specific task completion, i.e it shouldn't pick any completed task except that one it put, you can use locks to wait for the task:
def run(self):
# get a task, do somethings, put a new task
newTask.waitFor()
...
class Task:
...
def waitFor(self):
self._lock.acquire()
def complete(self):
self._lock.release()
def failedToComplete(self, err):
self._error = err
self._lock.release()
This will help to avoid time.sleep()-s on response queue monitoring. Task completion errors handling should be considered here. But this is uncommon approach. Is it some specific algorithm where the thread which puts a new task, should wait for it? Even so, you can implement that logic into a Task class, and not in the thread that processes it. And why the thread picks a task from the destination queue and puts a new task back to the destination queue? If you have n steps of processing, you can use n queues for it. A group of threads serves the first queue, gets a task, processes it, puts the result (a new task) to the next queue. The group of final response-handler threads gets a response and sends it back to the client. The tasks encapsulate details concerning themselves, the threads don't distinguish a task from another. And there is not need to wait for a particular task.
Related
My understanding of how a ThreadPoolExecutor works is that when I call #submit, tasks are assigned to threads until all available threads are busy, at which point the executor puts the tasks in a queue awaiting a thread becoming available.
The behavior I want is to block when there is not a thread available, to wait until one becomes available and then only submit my task.
The background is that my tasks are coming from a queue, and I only want to pull messages off my queue when there are threads available to work on these messages.
In an ideal world, I'd be able to provide an option to #submit to tell it to block if a thread is not available, rather than putting them in a queue.
However, that option does not exist. So what I'm looking at is something like:
with concurrent.futures.ThreadPoolExecutor(max_workers=CONCURRENCY) as executor:
while True:
wait_for_available_thread(executor)
message = pull_from_queue()
executor.submit(do_work_for_message, message)
And I'm not sure of the cleanest implementation of wait_for_available_thread.
Honestly, I'm surprised this isn't actually in concurrent.futures, as I would have thought the pattern of pulling from a queue and submitting to a thread pool executor would be relatively common.
One approach might be to keep track of your currently running threads via a set of Futures:
active_threads = set()
def pop_future(future):
active_threads.pop(future)
with concurrent.futures.ThreadPoolExecutor(max_workers=CONCURRENCY) as executor:
while True:
while len(active_threads) >= CONCURRENCY:
time.sleep(0.1) # or whatever
message = pull_from_queue()
future = executor.submit(do_work_for_message, message)
active_threads.add(future)
future.add_done_callback(pop_future)
A more sophisticated approach might be to have the done_callback be the thing that triggers a queue pull, rather than polling and blocking, but then you need to fall back to polling the queue if the workers manage to get ahead of it.
In the book "Mastering Concurrency in Python", chapter 6 "Working with Processes in Python": "Message passing between several workers", example 7, there is implementation of task queue. Here is its code: https://github.com/PacktPublishing/Mastering-Concurrency-in-Python/blob/master/Chapter06/example7.py
The author states the problem with this example:
Everything seems to be working, but if we look closely at the messages our processes have printed out, we will notice that most of the tasks were executed by either Consumer-2 or Consumer-3, and that Consumer-4 executed only one task while Consumer-1 failed to execute any. What happened here?
I tried to understand the explanation of the problem that author gives and it looks like it is wrong:
Essentially, when one of our consumers—let's say Consumer-3—finished executing a task,
it tried to look for another task to execute immediately after. Most of the time, it would get priority over other consumers, since it was already being run by the main program. So
while Consumer-2 and Consumer-3 were constantly finishing their tasks' executions and
picking up other tasks to execute, Consumer-4 was only able to "squeeze" itself in once,
and Consumer-1 failed to do this altogether.
To address this issue, a technique has been developed, to stop consumers from immediately taking the next item from the task queue, called poison pill. The idea is that, after setting up the real tasks in the task queue, we also add in dummy tasks that contain "stop" values
and that will have the current consumer hold and allow other consumers to get the next
item in the task queue first; hence the name "poison pill."
The problem seems different: as consumers are started before queue is filled with tasks, most of consumers exit without processing queue, because they find it empty here. Adding poison pills helps because it eliminates the race between empty() and get() calls - and not because it lowers priority of consumers. Poison pills are not in effect before queue has no more tasks, that is why poison pills cannot influence which consumer will take the task.
Moreover, it seems to be a bug in this example: if other consumer steals task from the queue between our consumer calls to empty() and get(), then our consumer will block indefinitely on get(), which actually happens on my laptop.
Who can validate please.
I am currently building a python app which should trigger functions at given timestamps entered by the user (not entered in chronological order).
I ran into a problem because I don't want my program to be busy-waiting checking if a new timestamp has been entered that must be added to the timer queue but also not creating a whole bunch of threads for every time a new timestamp is creating with its single purpose waiting for that timestamp.
What I thought of is putting it all together in one thread and doing something like an interruptable sleep, but I can't think of another way besides this:
while timer_not_depleted:
sleep(1)
if something_happened:
break
which is essentially busy-waiting.
So any suggestions on realizing an interruptable sleep?
Your intuition of using threads is correct. The following master-worker construction can work:
The master thread spawns a worker thread that waits for "jobs";
The two threads share a Queue - whenever a new job needs to be scheduled, the master thread puts a job specification into the queue;
Meanwhile, the worker thread does the following:
Maintain a separate list of future jobs to run and keep track of how long to keep sleeping until the next job runs;
Continue listening to new jobs by calling Queue.get(block=True, timeout=<time-to-next-job>);
In this case, if no new jobs are scheduled until the timeout, Queue.get will raise Empty exception - and at this point the worker thread should run the scheduled function and get back to polling. If a new job is scheduled in the meantime, Queue.get returns the new job, such that you can update the timeout value and then get back to waiting.
I'd like to suggest select.
Call it with a timeout equal to the delay to the nearest event (heap queue is a good data structure to maintain a queue of future timestamps) and provide a socket (as an item in the rlist arg), where your program listens on for updates from the user.
The select call returns when the socket has incoming data or when the timeout has occurred.
My celery task is doing time-consuming calculations on some database-stored entity. Workflow is like this: get information from database, compile it to some serializable object, save object. Other tasks are doing other calculations (like rendering images) on loaded object.
But serialization is time-consuming, so i'd like to have one task per one entity running for a while, which holds serialized object in memory and process client requests, delivered through messaging queue (redis pubsub). If no requests for a while, task exits. After that, if client need some job to be done, it runs another task, which loads object, process it and stay tuned for a while for other jobs. This task should check at startup, if it only one worker on this particular entity to avoid collisions. So what is best strategy to check is there another task running for this entity?
1) First idea is to send message to some channel associated with entity, and wait for response. Bad idea, target task can be busy with calculations and waiting for response with timeout is just wasting time.
2) Store celery task-id in db is even worse - task can be killed, but record will stay, so we need to ensure that target task is alive.
3) Third idea is to inspect workers for running tasks, checking it state for entity id (which task will provide at startup). Also seems, that some collisions can happens, i.e. if several tasks are scheduled, but not runing yet.
For now I think idea 1 is the best with modifications like this: task will send message to entity channel on startup with it's startup time, but then immediately starts working, not waiting for response. Then it checks message queue and if someone is respond they compare timestamps and task with bigger timestamp quits. Seems complicated enough, are there better solution?
Final solution is to start supervisor thread in task, which reply to 'discover' message from competing tasks.
So workflow is like that.
Task starts, then subscribes to Redis PubSub channel with entity ID
Task sends 'discover' message to channel
Task wait a little bit
Task search 'reply' in incoming messages in channel, if found exits.
Task starts supervisor thread, which reply by 'reply' to all incoming 'discover' messages
This works fine except several tasks start simultaneouly, i.e. after worker restart. To avoid this need to make subscription proccess atomic, using Redis lock:
class RedisChannel:
def __init__(self, channel_id):
self.channel_id = channel_id
self.redis = StrictRedis()
self.channel = self.redis.pubsub()
with self.redis.lock(channel_id):
self.channel.subscribe(channel_id)
I have to spawn celery tasks, which have to have some namespace (for example user id).
So I'm spawn it by
scrapper_start.apply_async((request.user.id,), queue=account.Account_username)
app.control.add_consumer(account.Account_username, reply=True)
And tasks spawns recursively, from other task.
Now I have to check, if tasks of queue are executing. Tried to check list length in redis, it return true number before celery start executing.
How to solve this problem. I need only to check, if queue or consumer is executing or already empty. Thanks
If you just want to inspect the queue, you do this from command line itself.
from celery.task.control import inspect
i = inspect('scrapper_start')
i.active() # get a list of active tasks
In addition to checking which are currently executing, you can also do the following.
i.registered() # get a list of tasks registered
i.scheduled # get a list of tasks waiting
i.reserved() #tasks that has been received, but waiting to be executed
This command line inspection is good if you want to check once in a while.
For some reason, if you want to monitor them continuously, you can use Flower which provides a beautiful interface to monitor workers.