How to create redis workers dynamically without blocking the main thread? - python

I want to have a queue - worker management tool, that allows adding new queues, and registering jobs to those queues, with workers spawned to handle those jobs.
I have this code so far:
from redis import Redis
from rq import Queue, Retry, Worker
class WorkerPool: # TODO: find a better name
def __init__(self):
self._queues = {}
self._workers = []
self._redis_conn = Redis()
def _get_queue(self, name):
try:
return self._queues[name]
except KeyError:
new_queue = Queue(name, connection=self._redis_conn)
self._queues[name] = new_queue
new_worker = Worker([new_queue], connection=self._redis_conn, name=name)
new_worker.work() # Blocking :(
return new_queue
def add_job(self, queue, func, *func_args):
q = self._get_queue(queue)
job = q.enqueue(func, *func_args, retry=Retry(max=3))
return job
As can be seen - the work() function blocks execution, while I want it to work in the background. I guess I can just create another thread here - and call work() from one thread, while the main thread returns the job, however, this seems a bit awkward to me. Is there a built-in Redis (or other known module) solution for this?
PS, better names for my class are welcome :)
This is my take on multiprocessing it (threading won't work due to signals sent from illegal threads):
import multiprocessing as mp
from redis import Redis
from rq import Queue, Retry, Worker
class WorkerPool: # TODO: find a better name
def __init__(self):
self._queues = {}
self._worker_procs = []
self._redis_conn = Redis()
def __del__(self):
for proc in self._worker_procs:
proc.kill()
def _get_queue(self, name):
try:
return self._queues[name]
except KeyError:
new_queue = Queue(name, connection=self._redis_conn)
self._queues[name] = new_queue
new_worker = Worker([new_queue], connection=self._redis_conn, name=name)
worker_process = mp.Process(target=new_worker.work)
worker_process.start()
self._worker_procs.append(worker_process)
return new_queue
def add_job(self, queue, func, *func_args):
q = self._get_queue(queue)
job = q.enqueue(func, *func_args, retry=Retry(max=3))
return job
Not sure how good this is, but it seems to do what I want for now

If you only need small-scale multiprocessing, tied to one main process, all running on the one machine, take a look at the multiprocessing module and the concurrent.futures module and their Pool and ProcessPoolExecutor objects. Unless you have specific requirements, it's probably better to use the Pool or ProcessPoolExecutor rather than start up Process objects manually. (In that case Redis may or may not be overkill.)
If your needs are larger-scale, workers across multiple machines, there's a whole category of software for running these; RabbitMQ is one widely-known one, but it's just one of several, each with its own strengths and weaknesses. Each of the cloud providers (if you're in the cloud) also has its own offering for this functionality. You probably want to read up on the features of several of the off-the-shelf solutions, decide which one is a good match, then set that up.
That said, I have in the past implemented a custom Redis-based queueing system; sometimes you really do need something not provided by any of the existing solutions. In that situation, the design will be heavily influenced by what features you do need. In my case, it was fine-grained priorities...

Related

python design pattern queue with workers

I'm currently working on a project that involves three components,
an observer that check for changes in a directory, a worker and an command line interface.
What I want to achieve is:
The observer, when a change happens send a string to the worker (add a job to the worker's queue).
The worker has a queue of jobs and forever works on his queue.
Now I want the possibility to run a python script to check the status of the worker (number of active jobs, errors and so on)
I don't know how to achieve this with python in terms of which component to use and how to link the three components.
I though as a singleton worker where the observer add a job to a queue but 1) I was not able to write a working code and 2) How can I fit the checker in?
Another solution that I thought of may be multiple child processes from a father that has the queue but I'm a bit lost...
Thanks for any advices
I'd use some kind of observer pattern or publish-subscribe pattern. For the former you can use for example the Python version of ReactiveX. But for a more basic example let's stay with the Python core. Parts of your program can subscribe to the worker and receive updates from the process via queues for example.
import itertools as it
from queue import Queue
from threading import Thread
import time
class Observable(Thread):
def __init__(self):
super().__init__()
self._observers = []
def notify(self, msg):
for obs in self._observers:
obs.put(msg)
def subscribe(self, obs):
self._observers.append(obs)
class Observer(Thread):
def __init__(self):
super().__init__()
self.updates = Queue()
class Watcher(Observable):
def run(self):
for i in it.count():
self.notify(i)
time.sleep(1)
class Worker(Observable, Observer):
def run(self):
while True:
task = self.updates.get()
self.notify((str(task), 'start'))
time.sleep(1)
self.notify((str(task), 'stop'))
class Supervisor(Observer):
def __init__(self):
super().__init__()
self._statuses = {}
def run(self):
while True:
status = self.updates.get()
print(status)
self._statuses[status[0]] = status[1]
# Do something based on status updates.
if status[1] == 'stop':
del self._statuses[status[0]]
watcher = Watcher()
worker = Worker()
supervisor = Supervisor()
watcher.subscribe(worker.updates)
worker.subscribe(supervisor.updates)
supervisor.start()
worker.start()
watcher.start()
However many variations are possible and you can check the various patterns which suits you most.

How to use "with" with a list of objects

Suppose I have a class that will spawn a thread and implements .__enter__ and .__exit__ so I can use it as such:
with SomeAsyncTask(params) as t:
# do stuff with `t`
t.thread.start()
t.thread.join()
.__exit__ might perform certain actions for clean-up purposes (ie. removing temp files, etc.)
That works all fine until I have a list of SomeAsyncTasks that I would like to start all at once.
list_of_async_task_params = [params1, params2, ...]
How should I use with on the list? I'm hoping for something like this:
with [SomeAsyncTask(params) for params in list_of_async_task_params] as tasks:
# do stuff with `tasks`
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()
I think contextlib.ExitStack is exactly what you're looking for. It's a way of combining an indeterminate number of context managers into a single one safely (so that an exception while entering one context manager won't cause it to skip exiting the ones it's already entered successfully).
The example from the docs is pretty instructive:
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
# All opened files will automatically be closed at the end of
# the with statement, even if attempts to open files later
# in the list raise an exception
This can adapted to your "hoped for" code pretty easily:
import contextlib
with contextlib.ExitStack() as stack:
tasks = [stack.enter_context(SomeAsyncTask(params))
for params in list_of_async_task_params]
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()
Note: Somehow I missed the fact that your Thread subclass was also a context manager itself—so the code below doesn't make that assumption. Nevertheless, it might be helpful when using more "generic" kinds of threads (where using something like contextlib.ExitStack wouldn't be an option).
Your question is a little light on details—so I made some up—however this might be close to what you want. It defines a AsyncTaskListContextManager class that has the necessary __enter__() and __exit__() methods required to support the context manager protocol (and associated with statements).
import threading
from time import sleep
class SomeAsyncTask(threading.Thread):
def __init__(self, name, *args, **kwargs):
super().__init__(*args, **kwargs)
self.name = name
self.status_lock = threading.Lock()
self.running = False
def run(self):
with self.status_lock:
self.running = True
while True:
with self.status_lock:
if not self.running:
break
print('task {!r} running'.format(self.name))
sleep(.1)
print('task {!r} stopped'.format(self.name))
def stop(self):
with self.status_lock:
self.running = False
class AsyncTaskListContextManager:
def __init__(self, params_list):
self.threads = [SomeAsyncTask(params) for params in params_list]
def __enter__(self):
for thread in self.threads:
thread.start()
return self
def __exit__(self, *args):
for thread in self.threads:
if thread.is_alive():
thread.stop()
thread.join() # wait for it to terminate
return None # allows exceptions to be processed normally
params = ['Fee', 'Fie', 'Foe']
with AsyncTaskListContextManager(params) as task_list:
for _ in range(5):
sleep(1)
print('leaving task list context')
print('end-of-script')
Output:
task 'Fee' running
task 'Fie' running
task 'Foe' running
task 'Foe' running
task 'Fee' running
task 'Fie' running
... etc
task 'Fie' running
task 'Fee' running
task 'Foe' running
leaving task list context
task 'Foe' stopped
task 'Fie' stopped
task 'Fee' stopped
end-of-script
#martineau answer should work. Here's a more generic method that should work for other cases. Note that exceptions are not handled in __exit__(). If one __exit__() function fails, the rest won't be called. A generic solution would probably throw an aggregate exception and allow you to handle it. Another corner case is when you your second manager's __enter__() method throws an exception. The first manager's __exit__() will not be called in that case.
class list_context_manager:
def __init__(self, managers):
this.managers = managers
def __enter__(self):
for m in self.managers:
m.__enter__()
return self.managers
def __exit__(self):
for m in self.managers:
m.__exit__()
It can then be used like in your question:
with list_context_manager([SomeAsyncTask(params) for params in list_of_async_task_params]) as tasks:
# do stuff with `tasks`
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()

Self-joining thread pool: where's my race condition?

Since I use a similar pattern in my work a lot, I decided to write a class that abstracts very simple worker concurrency via job queue / threading. I know there are already things out there that solve this, but I also wanted to use this as an opportunity to hone my multithreading skills.
The main challenge I've given myself is that I want this to be able to let processes finish, even if they are not explicitly blocked by Queue.join(). "A process finishing" is defined by the input function returning a value (or None). The way I have attempted to accomplish this is by having each job create it's own results queue rq, which is then checked by _wait_for_results in a non-daemon thread, which blocks the automatic exit of all other daemonized threads until rq is filled by the worker in add_to_queue.
Here is the full class:
class EasyPool(object):
def __init__(self, concurrency, always_finish=True):
def add_to_queue(q):
while True:
func_data, rq = q.get()
func, args, kwargs = func_data
if not args:
args = []
if not kwargs:
kwargs = {}
result = func(*args, **kwargs)
rq.put(result)
q.task_done()
self.rqs = []
self.always_finish = always_finish
self.q = Queue(maxsize=0)
self.workers = []
for i in range(concurrency):
worker = Thread(target=add_to_queue, args=(self.q,))
self.workers.append(worker)
worker.setDaemon(True)
worker.start()
def _wait_for_results(self, rq):
rq.not_empty.acquire()
rq.not_empty.wait()
rq.not_empty.notify()
rq.not_empty.release()
def add_job(self, func, *args, **kwargs):
rq = Queue()
if self.always_finish:
blocker = Thread(target=self._wait_for_results, args=(rq,))
blocker.setDaemon(False)
blocker.start()
to_add = []
[ to_add.append(i) if i else to_add.append(None) for i in [func, args, kwargs] ]
self.q.put((to_add, rq))
return rq.get
When a job is created via the .add_job instance method, it immediately returns a promise-like object, which is a reference to the .get method of the results queue. The problem I'm facing is that there seems to be a race condition between this .get and the _wait_for_results method. I think the answer probably involves a Lock or a Condition, but I'm not really sure. Any help is much appreciated :)

Python interprocess communication with idle processes

I have an idle background process to process data in a queue, which I've implemented in the following way. The data passed in this example is just an integer, but I will be passing lists with up to 1000 integers and putting up to 100 lists on the queue per sec. Is this the correct approach, or should I be looking at more elaborate RPC and server methods?
import multiprocessing
import Queue
import time
class MyProcess(multiprocessing.Process):
def __init__(self, queue, cmds):
multiprocessing.Process.__init__(self)
self.q = queue
self.cmds = cmds
def run(self):
exit_flag = False
while True:
try:
obj = self.q.get(False)
print obj
except Queue.Empty:
if exit_flag:
break
else:
pass
if not exit_flag and self.cmds.poll():
cmd = self.cmds.recv()
if cmd == -1:
exit_flag = True
time.sleep(.01)
if __name__ == '__main__':
queue = multiprocessing.Queue()
proc2main, main2proc = multiprocessing.Pipe(duplex=False)
p = MyProcess(queue, proc2main)
p.start()
for i in range(5):
queue.put(i)
main2proc.send(-1)
proc2main.close()
main2proc.close()
# Wait for the worker to finish
queue.close()
queue.join_thread()
p.join()
It depends on how long it will take to process the data. I can't tell because I don't have a sample of the data, but in general it is better to move to more elaborate RPC and server methods when you need things like load balancing, guaranteed uptime, or scalability. Just remember that these things will add complexity, which may make your application harder to deploy, debug, and maintain. It will also increase the latency that it takes to process a task (which might or might not be a concern to you).
I would test it with some sample data, and determine if you need the scalability that multiple servers provide.

Python Process Pool non-daemonic?

Would it be possible to create a python Pool that is non-daemonic? I want a pool to be able to call a function that has another pool inside.
I want this because deamon processes cannot create process. Specifically, it will cause the error:
AssertionError: daemonic processes are not allowed to have children
For example, consider the scenario where function_a has a pool which runs function_b which has a pool which runs function_c. This function chain will fail, because function_b is being run in a daemon process, and daemon processes cannot create processes.
The multiprocessing.pool.Pool class creates the worker processes in its __init__ method, makes them daemonic and starts them, and it is not possible to re-set their daemon attribute to False before they are started (and afterwards it's not allowed anymore). But you can create your own sub-class of multiprocesing.pool.Pool (multiprocessing.Pool is just a wrapper function) and substitute your own multiprocessing.Process sub-class, which is always non-daemonic, to be used for the worker processes.
Here's a full example of how to do this. The important parts are the two classes NoDaemonProcess and MyPool at the top and to call pool.close() and pool.join() on your MyPool instance at the end.
#!/usr/bin/env python
# -*- coding: UTF-8 -*-
import multiprocessing
# We must import this explicitly, it is not imported by the top-level
# multiprocessing module.
import multiprocessing.pool
import time
from random import randint
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
def _get_daemon(self):
return False
def _set_daemon(self, value):
pass
daemon = property(_get_daemon, _set_daemon)
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class MyPool(multiprocessing.pool.Pool):
Process = NoDaemonProcess
def sleepawhile(t):
print("Sleeping %i seconds..." % t)
time.sleep(t)
return t
def work(num_procs):
print("Creating %i (daemon) workers and jobs in child." % num_procs)
pool = multiprocessing.Pool(num_procs)
result = pool.map(sleepawhile,
[randint(1, 5) for x in range(num_procs)])
# The following is not really needed, since the (daemon) workers of the
# child's pool are killed when the child is terminated, but it's good
# practice to cleanup after ourselves anyway.
pool.close()
pool.join()
return result
def test():
print("Creating 5 (non-daemon) workers and jobs in main process.")
pool = MyPool(5)
result = pool.map(work, [randint(1, 5) for x in range(5)])
pool.close()
pool.join()
print(result)
if __name__ == '__main__':
test()
I had the necessity to employ a non-daemonic pool in Python 3.7 and ended up adapting the code posted in the accepted answer. Below there's the snippet that creates the non-daemonic pool:
import multiprocessing.pool
class NoDaemonProcess(multiprocessing.Process):
#property
def daemon(self):
return False
#daemon.setter
def daemon(self, value):
pass
class NoDaemonContext(type(multiprocessing.get_context())):
Process = NoDaemonProcess
# We sub-class multiprocessing.pool.Pool instead of multiprocessing.Pool
# because the latter is only a wrapper function, not a proper class.
class NestablePool(multiprocessing.pool.Pool):
def __init__(self, *args, **kwargs):
kwargs['context'] = NoDaemonContext()
super(NestablePool, self).__init__(*args, **kwargs)
As the current implementation of multiprocessing has been extensively refactored to be based on contexts, we need to provide a NoDaemonContext class that has our NoDaemonProcess as attribute. NestablePool will then use that context instead of the default one.
That said, I should warn that there are at least two caveats to this approach:
It still depends on implementation details of the multiprocessing package, and could therefore break at any time.
There are valid reasons why multiprocessing made it so hard to use non-daemonic processes, many of which are explained here. The most compelling in my opinion is:
As for allowing children threads to spawn off children of its own using
subprocess runs the risk of creating a little army of zombie
'grandchildren' if either the parent or child threads terminate before
the subprocess completes and returns.
As of Python 3.8, concurrent.futures.ProcessPoolExecutor doesn't have this limitation. It can have a nested process pool with no problem at all:
from concurrent.futures import ProcessPoolExecutor as Pool
from itertools import repeat
from multiprocessing import current_process
import time
def pid():
return current_process().pid
def _square(i): # Runs in inner_pool
square = i ** 2
time.sleep(i / 10)
print(f'{pid()=} {i=} {square=}')
return square
def _sum_squares(i, j): # Runs in outer_pool
with Pool(max_workers=2) as inner_pool:
squares = inner_pool.map(_square, (i, j))
sum_squares = sum(squares)
time.sleep(sum_squares ** .5)
print(f'{pid()=}, {i=}, {j=} {sum_squares=}')
return sum_squares
def main():
with Pool(max_workers=3) as outer_pool:
for sum_squares in outer_pool.map(_sum_squares, range(5), repeat(3)):
print(f'{pid()=} {sum_squares=}')
if __name__ == "__main__":
main()
The above demonstration code was tested with Python 3.8.
A limitation of ProcessPoolExecutor, however, is that it doesn't have maxtasksperchild. If you need this, consider the answer by Massimiliano instead.
Credit: answer by jfs
The multiprocessing module has a nice interface to use pools with processes or threads. Depending on your current use case, you might consider using multiprocessing.pool.ThreadPool for your outer Pool, which will result in threads (that allow to spawn processes from within) as opposed to processes.
It might be limited by the GIL, but in my particular case (I tested both), the startup time for the processes from the outer Pool as created here far outweighed the solution with ThreadPool.
It's really easy to swap Processes for Threads. Read more about how to use a ThreadPool solution here or here.
On some Python versions replacing standard Pool to custom can raise error: AssertionError: group argument must be None for now.
Here I found a solution that can help:
class NoDaemonProcess(multiprocessing.Process):
# make 'daemon' attribute always return False
#property
def daemon(self):
return False
#daemon.setter
def daemon(self, val):
pass
class NoDaemonProcessPool(multiprocessing.pool.Pool):
def Process(self, *args, **kwds):
proc = super(NoDaemonProcessPool, self).Process(*args, **kwds)
proc.__class__ = NoDaemonProcess
return proc
I have seen people dealing with this issue by using celery's fork of multiprocessing called billiard (multiprocessing pool extensions), which allows daemonic processes to spawn children. The walkaround is to simply replace the multiprocessing module by:
import billiard as multiprocessing
The issue I encountered was in trying to import globals between modules, causing the ProcessPool() line to get evaluated multiple times.
globals.py
from processing import Manager, Lock
from pathos.multiprocessing import ProcessPool
from pathos.threading import ThreadPool
class SingletonMeta(type):
def __new__(cls, name, bases, dict):
dict['__deepcopy__'] = dict['__copy__'] = lambda self, *args: self
return super(SingletonMeta, cls).__new__(cls, name, bases, dict)
def __init__(cls, name, bases, dict):
super(SingletonMeta, cls).__init__(name, bases, dict)
cls.instance = None
def __call__(cls,*args,**kw):
if cls.instance is None:
cls.instance = super(SingletonMeta, cls).__call__(*args, **kw)
return cls.instance
def __deepcopy__(self, item):
return item.__class__.instance
class Globals(object):
__metaclass__ = SingletonMeta
"""
This class is a workaround to the bug: AssertionError: daemonic processes are not allowed to have children
The root cause is that importing this file from different modules causes this file to be reevalutated each time,
thus ProcessPool() gets reexecuted inside that child thread, thus causing the daemonic processes bug
"""
def __init__(self):
print "%s::__init__()" % (self.__class__.__name__)
self.shared_manager = Manager()
self.shared_process_pool = ProcessPool()
self.shared_thread_pool = ThreadPool()
self.shared_lock = Lock() # BUG: Windows: global name 'lock' is not defined | doesn't affect cygwin
Then import safely from elsewhere in your code
from globals import Globals
Globals().shared_manager
Globals().shared_process_pool
Globals().shared_thread_pool
Globals().shared_lock
I have written a more expanded wrapper class around pathos.multiprocessing here:
https://github.com/JamesMcGuigan/python2-timeseries-datapipeline/blob/master/src/util/MultiProcessing.py
As a side note, if your usecase just requires async multiprocess map as a performance optimization, then joblib will manage all your process pools behind the scenes and allow this very simple syntax:
squares = Parallel(-1)( delayed(lambda num: num**2)(x) for x in range(100) )
https://joblib.readthedocs.io/
This presents a workaround for when the error is seemingly a false-positive. As also noted by James, this can happen to an unintentional import from a daemonic process.
For example, if you have the following simple code, WORKER_POOL can inadvertently be imported from a worker, leading to the error.
import multiprocessing
WORKER_POOL = multiprocessing.Pool()
A simple but reliable approach for a workaround is:
import multiprocessing
import multiprocessing.pool
class MyClass:
#property
def worker_pool(self) -> multiprocessing.pool.Pool:
# Ref: https://stackoverflow.com/a/63984747/
try:
return self._worker_pool # type: ignore
except AttributeError:
# pylint: disable=protected-access
self.__class__._worker_pool = multiprocessing.Pool() # type: ignore
return self.__class__._worker_pool # type: ignore
# pylint: enable=protected-access
In the above workaround, MyClass.worker_pool can be used without the error. If you think this approach can be improved upon, let me know.
Since Python version 3.7 we can create non-daemonic ProcessPoolExecutor
Using if __name__ == "__main__": is necessary while using multiprocessing.
from concurrent.futures import ProcessPoolExecutor as Pool
num_pool = 10
def main_pool(num):
print(num)
strings_write = (f'{num}-{i}' for i in range(num))
with Pool(num) as subp:
subp.map(sub_pool,strings_write)
return None
def sub_pool(x):
print(f'{x}')
return None
if __name__ == "__main__":
with Pool(num_pool) as p:
p.map(main_pool,list(range(1,num_pool+1)))
Here is how you can start a pool, even if you are in a daemonic process already. This was tested in python 3.8.5
First, define the Undaemonize context manager, which temporarily deletes the daemon state of the current process.
class Undaemonize(object):
'''Context Manager to resolve AssertionError: daemonic processes are not allowed to have children
Tested in python 3.8.5'''
def __init__(self):
self.p = multiprocessing.process.current_process()
if 'daemon' in self.p._config:
self.daemon_status_set = True
else:
self.daemon_status_set = False
self.daemon_status_value = self.p._config.get('daemon')
def __enter__(self):
if self.daemon_status_set:
del self.p._config['daemon']
def __exit__(self, type, value, traceback):
if self.daemon_status_set:
self.p._config['daemon'] = self.daemon_status_value
Now you can start a pool as follows, even from within a daemon process:
with Undaemonize():
pool = multiprocessing.Pool(1)
pool.map(... # you can do something with the pool outside of the context manager
While the other approaches here aim to create pool that is not daemonic in the first place, this approach allows you to start a pool even if you are in a daemonic process already.

Categories