Self-joining thread pool: where's my race condition?

Self-joining thread pool: where's my race condition? - python

Since I use a similar pattern in my work a lot, I decided to write a class that abstracts very simple worker concurrency via job queue / threading. I know there are already things out there that solve this, but I also wanted to use this as an opportunity to hone my multithreading skills.
The main challenge I've given myself is that I want this to be able to let processes finish, even if they are not explicitly blocked by Queue.join(). "A process finishing" is defined by the input function returning a value (or None). The way I have attempted to accomplish this is by having each job create it's own results queue rq, which is then checked by _wait_for_results in a non-daemon thread, which blocks the automatic exit of all other daemonized threads until rq is filled by the worker in add_to_queue.
Here is the full class:
class EasyPool(object):
def __init__(self, concurrency, always_finish=True):
def add_to_queue(q):
while True:
func_data, rq = q.get()
func, args, kwargs = func_data
if not args:
args = []
if not kwargs:
kwargs = {}
result = func(*args, **kwargs)
rq.put(result)
q.task_done()
self.rqs = []
self.always_finish = always_finish
self.q = Queue(maxsize=0)
self.workers = []
for i in range(concurrency):
worker = Thread(target=add_to_queue, args=(self.q,))
self.workers.append(worker)
worker.setDaemon(True)
worker.start()
def _wait_for_results(self, rq):
rq.not_empty.acquire()
rq.not_empty.wait()
rq.not_empty.notify()
rq.not_empty.release()
def add_job(self, func, *args, **kwargs):
rq = Queue()
if self.always_finish:
blocker = Thread(target=self._wait_for_results, args=(rq,))
blocker.setDaemon(False)
blocker.start()
to_add = []
[ to_add.append(i) if i else to_add.append(None) for i in [func, args, kwargs] ]
self.q.put((to_add, rq))
return rq.get
When a job is created via the .add_job instance method, it immediately returns a promise-like object, which is a reference to the .get method of the results queue. The problem I'm facing is that there seems to be a race condition between this .get and the _wait_for_results method. I think the answer probably involves a Lock or a Condition, but I'm not really sure. Any help is much appreciated :)

Related

How to create redis workers dynamically without blocking the main thread?

I want to have a queue - worker management tool, that allows adding new queues, and registering jobs to those queues, with workers spawned to handle those jobs.
I have this code so far:
from redis import Redis
from rq import Queue, Retry, Worker
class WorkerPool: # TODO: find a better name
def __init__(self):
self._queues = {}
self._workers = []
self._redis_conn = Redis()
def _get_queue(self, name):
try:
return self._queues[name]
except KeyError:
new_queue = Queue(name, connection=self._redis_conn)
self._queues[name] = new_queue
new_worker = Worker([new_queue], connection=self._redis_conn, name=name)
new_worker.work() # Blocking :(
return new_queue
def add_job(self, queue, func, *func_args):
q = self._get_queue(queue)
job = q.enqueue(func, *func_args, retry=Retry(max=3))
return job
As can be seen - the work() function blocks execution, while I want it to work in the background. I guess I can just create another thread here - and call work() from one thread, while the main thread returns the job, however, this seems a bit awkward to me. Is there a built-in Redis (or other known module) solution for this?
PS, better names for my class are welcome :)
This is my take on multiprocessing it (threading won't work due to signals sent from illegal threads):
import multiprocessing as mp
from redis import Redis
from rq import Queue, Retry, Worker
class WorkerPool: # TODO: find a better name
def __init__(self):
self._queues = {}
self._worker_procs = []
self._redis_conn = Redis()
def __del__(self):
for proc in self._worker_procs:
proc.kill()
def _get_queue(self, name):
try:
return self._queues[name]
except KeyError:
new_queue = Queue(name, connection=self._redis_conn)
self._queues[name] = new_queue
new_worker = Worker([new_queue], connection=self._redis_conn, name=name)
worker_process = mp.Process(target=new_worker.work)
worker_process.start()
self._worker_procs.append(worker_process)
return new_queue
def add_job(self, queue, func, *func_args):
q = self._get_queue(queue)
job = q.enqueue(func, *func_args, retry=Retry(max=3))
return job
Not sure how good this is, but it seems to do what I want for now

If you only need small-scale multiprocessing, tied to one main process, all running on the one machine, take a look at the multiprocessing module and the concurrent.futures module and their Pool and ProcessPoolExecutor objects. Unless you have specific requirements, it's probably better to use the Pool or ProcessPoolExecutor rather than start up Process objects manually. (In that case Redis may or may not be overkill.)
If your needs are larger-scale, workers across multiple machines, there's a whole category of software for running these; RabbitMQ is one widely-known one, but it's just one of several, each with its own strengths and weaknesses. Each of the cloud providers (if you're in the cloud) also has its own offering for this functionality. You probably want to read up on the features of several of the off-the-shelf solutions, decide which one is a good match, then set that up.
That said, I have in the past implemented a custom Redis-based queueing system; sometimes you really do need something not provided by any of the existing solutions. In that situation, the design will be heavily influenced by what features you do need. In my case, it was fine-grained priorities...

How to use "with" with a list of objects

Suppose I have a class that will spawn a thread and implements .__enter__ and .__exit__ so I can use it as such:
with SomeAsyncTask(params) as t:
# do stuff with `t`
t.thread.start()
t.thread.join()
.__exit__ might perform certain actions for clean-up purposes (ie. removing temp files, etc.)
That works all fine until I have a list of SomeAsyncTasks that I would like to start all at once.
list_of_async_task_params = [params1, params2, ...]
How should I use with on the list? I'm hoping for something like this:
with [SomeAsyncTask(params) for params in list_of_async_task_params] as tasks:
# do stuff with `tasks`
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()

I think contextlib.ExitStack is exactly what you're looking for. It's a way of combining an indeterminate number of context managers into a single one safely (so that an exception while entering one context manager won't cause it to skip exiting the ones it's already entered successfully).
The example from the docs is pretty instructive:
with ExitStack() as stack:
files = [stack.enter_context(open(fname)) for fname in filenames]
# All opened files will automatically be closed at the end of
# the with statement, even if attempts to open files later
# in the list raise an exception
This can adapted to your "hoped for" code pretty easily:
import contextlib
with contextlib.ExitStack() as stack:
tasks = [stack.enter_context(SomeAsyncTask(params))
for params in list_of_async_task_params]
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()

Note: Somehow I missed the fact that your Thread subclass was also a context manager itself—so the code below doesn't make that assumption. Nevertheless, it might be helpful when using more "generic" kinds of threads (where using something like contextlib.ExitStack wouldn't be an option).
Your question is a little light on details—so I made some up—however this might be close to what you want. It defines a AsyncTaskListContextManager class that has the necessary __enter__() and __exit__() methods required to support the context manager protocol (and associated with statements).
import threading
from time import sleep
class SomeAsyncTask(threading.Thread):
def __init__(self, name, *args, **kwargs):
super().__init__(*args, **kwargs)
self.name = name
self.status_lock = threading.Lock()
self.running = False
def run(self):
with self.status_lock:
self.running = True
while True:
with self.status_lock:
if not self.running:
break
print('task {!r} running'.format(self.name))
sleep(.1)
print('task {!r} stopped'.format(self.name))
def stop(self):
with self.status_lock:
self.running = False
class AsyncTaskListContextManager:
def __init__(self, params_list):
self.threads = [SomeAsyncTask(params) for params in params_list]
def __enter__(self):
for thread in self.threads:
thread.start()
return self
def __exit__(self, *args):
for thread in self.threads:
if thread.is_alive():
thread.stop()
thread.join() # wait for it to terminate
return None # allows exceptions to be processed normally
params = ['Fee', 'Fie', 'Foe']
with AsyncTaskListContextManager(params) as task_list:
for _ in range(5):
sleep(1)
print('leaving task list context')
print('end-of-script')
Output:
task 'Fee' running
task 'Fie' running
task 'Foe' running
task 'Foe' running
task 'Fee' running
task 'Fie' running
... etc
task 'Fie' running
task 'Fee' running
task 'Foe' running
leaving task list context
task 'Foe' stopped
task 'Fie' stopped
task 'Fee' stopped
end-of-script

#martineau answer should work. Here's a more generic method that should work for other cases. Note that exceptions are not handled in __exit__(). If one __exit__() function fails, the rest won't be called. A generic solution would probably throw an aggregate exception and allow you to handle it. Another corner case is when you your second manager's __enter__() method throws an exception. The first manager's __exit__() will not be called in that case.
class list_context_manager:
def __init__(self, managers):
this.managers = managers
def __enter__(self):
for m in self.managers:
m.__enter__()
return self.managers
def __exit__(self):
for m in self.managers:
m.__exit__()
It can then be used like in your question:
with list_context_manager([SomeAsyncTask(params) for params in list_of_async_task_params]) as tasks:
# do stuff with `tasks`
for task in tasks:
task.thread.start()
for task in tasks:
task.thread.join()

asynchronous post request in python

I have a python script which has a line that makes a post request as shown below:
rsp = requests.post(img_url, data=img_json_data, headers=img_headers)
print rsp # just for debugging
But suppose I don't want my script to keep waiting for the response, but instead run the above lines asynchronously in parallel to the rest of the code. What would be the easiest way to do so?

This is a class that allow easy parallel execution on multiple workers.
Basically it creates worker threads, that wait for job in a Queue.
Once you put a task they execute it and put the results in another Queue.
join() will wait until everything is done, then we empty the results queue and return as an array.
from Queue import Queue
import logging
from threading import Thread
logger = logging.getLogger(__name__)
class Parallel(object):
def __init__(self, thread_num=10):
# create queues
self.tasks_queue = Queue()
self.results_queue = Queue()
# create a threading pool
self.pool = []
for i in range(thread_num):
worker = Worker(i, self.tasks_queue, self.results_queue)
self.pool.append(worker)
worker.start()
logger.debug('Created %s workers',thread_num)
def add_task(self, task_id, func, *args, **kwargs):
"""
Add task to queue, they will be started as soon as added
:param func: function to execute
:param args: args to transmit
:param kwargs: kwargs to transmit
"""
logger.debug('Adding one task to queue (%s)', func.__name__)
# add task to queue
self.tasks_queue.put_nowait((task_id, func, args, kwargs))
pass
def get_results(self):
logger.debug('Waiting for processes to ends')
self.tasks_queue.join()
logger.debug('Processes terminated, fetching results')
results = []
while not self.results_queue.empty():
results.append(self.results_queue.get())
logger.debug('Results fetched, returning data')
return dict(results)
class Worker(Thread):
def __init__(self, thread_id, tasks, results):
super(Worker, self).__init__()
self.id = thread_id
self.tasks = tasks
self.results = results
self.daemon = True
def run(self):
logger.debug('Worker %s launched', self.id)
while True:
task_id, func, args, kwargs = self.tasks.get()
logger.debug('Worker %s start to work on %s', self.id, func.__name__)
try:
self.results.put_nowait((task_id, func(*args, **kwargs)))
except Exception as err:
logger.debug('Thread(%s): error with task %s\n%s', self.id, repr(func.__name__), err)
finally:
logger.debug('Worker %s finished work on %s', self.id, func.__name__)
self.tasks.task_done()
import requests
# create parallel instance with 4 workers
parallel = Parallel(4)
# launch jobs
for i in range(20):
parallel.add_task(i, requests.post, img_url, data=img_json_data, headers=img_headers)
# wait for all jobs to return data
print parrallel.get_results()

You can use celery for the same. With celery the processing will be async and you can check for status as well as result. For further info click here

You need to queue this task for asynchronous processing.
There are multiple options here :
celery which has larger learning curve for a newbie. check here
python-rq which is relatively very light weight and a goto library. check here
You can use any of the message queues among redis,rabbitmq etc

How to use queue with concurrent future ThreadPoolExecutor in python 3?

I am using simple threading modules to do concurrent jobs. Now I would like to take advantages of concurrent futures modules. Can some put me a example of using a queue with concurrent library?
I am getting TypeError: 'Queue' object is not iterable
I dont know how to iterate queues
code snippet:
def run(item):
self.__log.info(str(item))
return True
<queue filled here>
with concurrent.futures.ThreadPoolExecutor(max_workers = 100) as executor:
furtureIteams = { executor.submit(run, item): item for item in list(queue)}
for future in concurrent.futures.as_completed(furtureIteams):
f = furtureIteams[future]
print(f)

I would suggest something like this:
def run(queue):
item = queue.get()
self.__log.info(str(item))
return True
<queue filled here>
workerThreadsToStart = 10
with concurrent.futures.ThreadPoolExecutor(max_workers = 100) as executor:
furtureIteams = { executor.submit(run, queue): index for intex in range(workerThreadsToStart)}
for future in concurrent.futures.as_completed(furtureIteams):
f = furtureIteams[future]
print(f)
The problem you will run in is that a queue is thought to be endless and as a medium to decouple the threads that put something into the queue and threads that get items out of the queue.
When
you have a finite number of items or
you compute all items at once
and afterwards process them in parallel, a queue makes no sense.
A ThreadPoolExecutor makes a queue obsolete in these cases.
I had a look at the ThreadPoolExecutor source:
def submit(self, fn, *args, **kwargs): # line 94
self._work_queue.put(w) # line 102
A Queue is used inside.

As commented above, you can use the iter() function to execute a ThreadPool on a queue object. A very general code for this would look something like this:
with concurrent.futures.ThreadPoolExecutor() as executor:
executor.map(run, iter(queue.get, None))
Where the run method executes the aspired work on the items of the queue.

end daemon processes with multiprocessing module

I include an example usage of multiprocessing below. This is a process pool model. It is not as simple as it might be, but is relatively close in structure to the code I'm actually using. It also uses sqlalchemy, sorry.
My question is - I currently have a situation where I have a relatively long running Python script which is executing a number of functions which each look like the code below, so the parent process is the same in all cases. In other words, multiple pools are created by one python script. (I don't have to do it this way, I suppose, but the alternative is to use something like os.system and subprocess.) The problem is that these processes hang around and hold on to memory. The docs say these daemon processes are supposed to stick around till the parent process exits, but what about if the parent process then goes on to generate another pool or processes and doesn't exit immediately.
Calling terminate() works, but this doesn't seem terribly polite. Is there a good way to ask the processes to terminate nicely? I.e. clean up after yourself and go away now, I need to start up the next pool?
I also tried calling join() on the processes. According to the documentation this means wait for the processes to terminate. What if they don't plan to terminate? What actually happens is that the process hangs.
Thanks in advance.
Regards, Faheem.
import multiprocessing, time
class Worker(multiprocessing.Process):
"""Process executing tasks from a given tasks queue"""
def __init__(self, queue, num):
multiprocessing.Process.__init__(self)
self.num = num
self.queue = queue
self.daemon = True
def run(self):
import traceback
while True:
func, args, kargs = self.queue.get()
try:
print "trying %s with args %s"%(func.__name__, args)
func(*args, **kargs)
except:
traceback.print_exc()
self.queue.task_done()
class ProcessPool:
"""Pool of threads consuming tasks from a queue"""
def __init__(self, num_threads):
self.queue = multiprocessing.JoinableQueue()
self.workerlist = []
self.num = num_threads
for i in range(num_threads):
self.workerlist.append(Worker(self.queue, i))
def add_task(self, func, *args, **kargs):
"""Add a task to the queue"""
self.queue.put((func, args, kargs))
def start(self):
for w in self.workerlist:
w.start()
def wait_completion(self):
"""Wait for completion of all the tasks in the queue"""
self.queue.join()
for worker in self.workerlist:
print worker.__dict__
#worker.terminate() <--- terminate used here
worker.join() <--- join used here
start = time.time()
from sqlalchemy import *
from sqlalchemy.orm import *
dbuser = ''
password = ''
dbname = ''
dbstring = "postgres://%s:%s#localhost:5432/%s"%(dbuser, password, dbname)
db = create_engine(dbstring, echo=True)
m = MetaData(db)
def make_foo(i):
t1 = Table('foo%s'%i, m, Column('a', Integer, primary_key=True))
conn = db.connect()
for i in range(10):
conn.execute("DROP TABLE IF EXISTS foo%s"%i)
conn.close()
for i in range(10):
make_foo(i)
m.create_all()
def do(i, dbstring):
dbstring = "postgres://%s:%s#localhost:5432/%s"%(dbuser, password, dbname)
db = create_engine(dbstring, echo=True)
Session = scoped_session(sessionmaker())
Session.configure(bind=db)
Session.execute("ALTER TABLE foo%s SET ( autovacuum_enabled = false );"%i)
Session.execute("ALTER TABLE foo%s SET ( autovacuum_enabled = true );"%i)
Session.commit()
pool = ProcessPool(5)
for i in range(10):
pool.add_task(do, i, dbstring)
pool.start()
pool.wait_completion()

My way of dealing with this was:
import multiprocessing
for prc in multiprocessing.active_children():
prc.terminate()
I like this more so I don't have to pollute the worker function with some if clause.

You know multiprocessing already has classes for worker pools, right?
The standard way is to send your threads a quit signal:
queue.put(("QUIT", None, None))
Then check for it:
if func == "QUIT":
return

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Self-joining thread pool: where's my race condition? - python

Related

How to create redis workers dynamically without blocking the main thread?

How to use "with" with a list of objects

asynchronous post request in python

How to use queue with concurrent future ThreadPoolExecutor in python 3?

end daemon processes with multiprocessing module

Categories

Resources