Is it possible to manually lock/unlock a Queue? - python

I'm curious if there is a way to lock a multiprocessing.Queue object manually.
I have a pretty standard Producer/Consumer pattern set up in which my main thread is constantly producing a series of values, and a pool of multiprocessing.Process workers is acting on the values produced.
It is all controlled via a sole multiprocessing.Queue().
import time
import multiprocessing
class Reader(multiprocessing.Process):
def __init__(self, queue):
multiprocessing.Process.__init__(self)
self.queue = queue
def run(self):
while True:
item = self.queue.get()
if isinstance(item, str):
break
if __name__ == '__main__':
queue = multiprocessing.Queue()
reader = Reader(queue)
reader.start()
start_time = time.time()
while time.time() - start_time < 10:
queue.put(1)
queue.put('bla bla bla sentinal')
queue.join()
The issue I'm running into is that my worker pool cannot consume and process the queue as fast as the main thread insert values into it. So after some period of time, the Queue is so unwieldy that it pops a MemoryError.
An obvious solution would be to simply add a wait check in the producer to stall it from putting any more values into the queue. Something along the lines of:
while time.time() - start_time < 10:
queue.put(1)
while queue.qsize() > some_size:
time.sleep(.1)
queue.put('bla bla bla sentinal')
queue.join()
However, because of the funky nature of the program, I'd like to dump everything in the Queue to a file for later processing. But! Without being able to temporarily lock the queue, the worker can't consume everything in it as the producer is constantly filling it back up with junk -- conceptually anyway. After numerous tests it seems that at some point one of the locks wins (but usually the one adding to the queue).
Edit: Also, I realize it'd be possible to simply stop the producer and consume it from that thread... but that makes the Single Responsibility guy in me feel sad, as the producer is a Producer, not a Consumer.
Edit:
After looking through the source of Queue, I came up with this:
def dump_queue(q):
q._rlock.acquire()
try:
res = []
while not q.empty():
res.append(q._recv())
q._sem.release()
return res
finally:
q._rlock.release()
However, I'm too scared to use it! I have no idea if this is "correct" or not. I don't have a firm enough grasp to know if this'll hold up without blowing up any of Queues internals.
Anyone know if this'll break? :)

Given what was said in the comments, a Queue is simply a wrong data structure for your problem - but is likely part of a usable solution.
It sounds like you have only one Producer. Create a new, Producer-local (not shared across processes) class implementing the semantics you really need. For example,
class FlushingQueue:
def __init__(self, mpqueue, path_to_spill_file, maxsize=1000, dumpsize=1000000):
from collections import deque
self.q = mpqueue # a shared `multiprocessing.Queue`
self.dump_path = path_to_spill_file
self.maxsize = maxsize
self.dumpsize = dumpsize
self.d = deque() # buffer for overflowing values
def put(self, item):
if self.q.qsize() < self.maxsize:
self.q.put(item)
# in case consumers have made real progress
while self.d and self.q.qsize() < self.maxsize:
self.q.put(self.d.popleft())
else:
self.d.append(item)
if len(self.d) >= self.dumpsize:
self.dump()
def dump(self):
# code to flush self.d to the spill file; no
# need to look at self.q at all
I bet you can make this work :-)

Related

Does python provide a synchronized buffer?

I'm very familiar with Python queue.Queue. This is definitely the thing you want when you want to have a reliable stream between consumer and producer threads.
However, sometimes you have producers that are faster than consumers and are forced to drop data (as for live video frame capture, for example. We may typically want to buffer just the last one, or two frames).
Does Python provide an asynchronous buffer class, similar to queue.Queue?
It's not exactly obvious how to correctly implement one using queue.Queue.
I could, for example:
buf = queue.Queue(maxsize=3)
def produce(msg):
if buf.full():
buf.get(block=False) # Make space
buf.put(msg, block=False)
def consume():
msg = buf.get(block=True)
work(msg)
although I don't particularly like that produce is not a locked, queue-atomic operation. A consume may start between full and get, for example, and it would be (probably) broken for a multi-producer scenario.
Is there's an out-of-the-box solution?
There's nothing built in for this, but it appears straightforward enough to build your own buffer class that wraps a Queue and provides mutual exclusion between .put() and .get() with its own lock, and using a Condition variable to wake up would-be consumers whenever an item is added. Like so:
import threading
class SBuf:
def __init__(self, maxsize):
import queue
self.q = queue.Queue()
self.maxsize = maxsize
self.nonempty = threading.Condition()
def get(self):
with self.nonempty:
while not self.q.qsize():
self.nonempty.wait()
assert self.q.qsize()
return self.q.get()
def put(self, v):
with self.nonempty:
while self.q.qsize() >= self.maxsize:
self.q.get()
self.q.put(v)
assert 0 < self.q.qsize() <= self.maxsize
self.nonempty.notify_all()
BTW, I advise against trying to build this kind of logic out of raw locks. Of course it can be done, but Condition variables are very carefully designed to save you from universes of unintended race conditions. There's a learning curve for Condition variables, but one well worth climbing: they often make things easy instead of brain-busting. Indeed, Python's threading module uses them internally to implement all sort of things.
An Alternative
In the above, we only invoke queue.Queue methods under the protection of our own lock, so there's really no need to use a thread-safe container - we're supplying all the thread safety already.
So it would be a bit leaner to use a simpler container. Happily, a collections.deque can be configured to discard all but the most recent N entries itself, but "at C speed". Like so:
class SBuf:
def __init__(self, maxsize):
import collections
self.q = collections.deque(maxlen=maxsize)
self.maxsize = maxsize
self.nonempty = threading.Condition()
def get(self):
with self.nonempty:
while not self.q:
self.nonempty.wait()
assert self.q
return self.q.popleft()
def put(self, v):
with self.nonempty:
self.q.append(v) # discards oldest, if needed
assert 0 < len(self.q) <= self.maxsize
self.nonempty.notify()
This also changed .notify_all() to .notify(). In this use case, either works correctly, but we're only adding one item so there's no need to notify more than one consumer. If there are multiple consumers waiting, .notify_all() will wake all of them up but only the first will find a non-empty queue. The others will see that it's empty, and just .wait() again.
Queue is already multiprocessing and multithreading safe, in that you can't write and read from the queue at the same time. However, you are correct that there's nothing stopping the queue from getting modified between the full() and get commands.
As such you can use a lock, which is how you can control thread access between multiple lines. The lock can only be acquired once, so if its currently locked, all other threads will wait until it has been released before they continue.
import threading
lock = threading.Lock()
def produce(msg):
lock.acquire()
if buf.full():
buf.get(block=False) # Make space
buf.put(msg, block=False)
lock.release()
def consume():
msg = None
while !msg:
lock.acquire()
try:
msg = buf.get(block=False)
except queue.Empty:
# buffer is empty, wait and try again
sleep(0.01)
lock.release()
work(msg)

Mixing multiprocessing Pool (producers) with Process (consumer) [duplicate]

I'm using the multiprocessing module to split up a very large task. It works for the most part, but I must be missing something obvious with my design, because this way it's very hard for me to effectively tell when all of the data has been processed.
I have two separate tasks that run; one that feeds the other. I guess this is a producer/consumer problem. I use a shared Queue between all processes, where the producers fill up the queue, and the consumers read from the queue and do the processing. The problem is that there is a finite amount of data, so at some point everyone needs to know that all of the data has been processed so the system can shut down gracefully.
It would seem to make sense to use the map_async() function, but since the producers are filling up the queue, I don't know all of the items up front, so I have to go into a while loop and use apply_async() and try to detect when everything is done with some sort of timeout...ugly.
I feel like I'm missing something obvious. How can this be better designed?
PRODCUER
class ProducerProcess(multiprocessing.Process):
def __init__(self, item, consumer_queue):
self.item = item
self.consumer_queue = consumer_queue
multiprocessing.Process.__init__(self)
def run(self):
for record in get_records_for_item(self.item): # this takes time
self.consumer_queue.put(record)
def start_producer_processes(producer_queue, consumer_queue, max_running):
running = []
while not producer_queue.empty():
running = [r for r in running if r.is_alive()]
if len(running) < max_running:
producer_item = producer_queue.get()
p = ProducerProcess(producer_item, consumer_queue)
p.start()
running.append(p)
time.sleep(1)
CONSUMER
def process_consumer_chunk(queue, chunksize=10000):
for i in xrange(0, chunksize):
try:
# don't wait too long for an item
# if new records don't arrive in 10 seconds, process what you have
# and let the next process pick up more items.
record = queue.get(True, 10)
except Queue.Empty:
break
do_stuff_with_record(record)
MAIN
if __name__ == "__main__":
manager = multiprocessing.Manager()
consumer_queue = manager.Queue(1024*1024)
producer_queue = manager.Queue()
producer_items = xrange(0,10)
for item in producer_items:
producer_queue.put(item)
p = multiprocessing.Process(target=start_producer_processes, args=(producer_queue, consumer_queue, 8))
p.start()
consumer_pool = multiprocessing.Pool(processes=16, maxtasksperchild=1)
Here is where it gets cheesy. I can't use map, because the list to consume is being filled up at the same time. So I have to go into a while loop and try to detect a timeout. The consumer_queue can become empty while the producers are still trying to fill it up, so I can't just detect an empty queue an quit on that.
timed_out = False
timeout= 1800
while 1:
try:
result = consumer_pool.apply_async(process_consumer_chunk, (consumer_queue, ), dict(chunksize=chunksize,))
if timed_out:
timed_out = False
except Queue.Empty:
if timed_out:
break
timed_out = True
time.sleep(timeout)
time.sleep(1)
consumer_queue.join()
consumer_pool.close()
consumer_pool.join()
I thought that maybe I could get() the records in the main thread and pass those into the consumer instead of passing the queue in, but I think I end up with the same problem that way. I still have to run a while loop and use apply_async() Thank you in advance for any advice!
You could use a manager.Event to signal the end of the work. This event can be shared between all of your processes and then when you signal it from your main process the other workers can then gracefully shutdown.
while not event.is_set():
...rest of code...
So, your consumers would wait for the event to be set and handle the cleanup once it is set.
To determine when to set this flag you can do a join on the producer threads and when those are all complete you can then join on the consumer threads.
I would like to strongly recommend SimPy instead of multiprocess/threading to do discrete event simulation.

Monitoring a threaded Python program with htop

First of all, this is the code I am referring to:
from random import randint
import time
from threading import Thread
import Queue
class TestClass(object):
def __init__(self, queue):
self.queue = queue
def do(self):
while True:
wait = randint(1, 10)
time.sleep(1.0/wait)
print '[>] Enqueuing from TestClass.do...', wait
self.queue.put(wait)
class Handler(Thread):
def __init__(self, queue):
Thread.__init__(self)
self.queue = queue
def run(self):
task_no = 0
while True:
task = self.queue.get()
task_no += 1
print ('[<] Dequeuing from Handler.run...', task,
'task_no=', task_no)
time.sleep(1) # emulate processing time
print ('[*] Task %d done!') % task_no
self.queue.task_done()
def main():
q = Queue.Queue()
watchdog = TestClass(q)
observer = Thread(target=watchdog.do)
observer.setDaemon(True)
handler = Handler(q)
handler.setDaemon(True)
handler.start()
observer.start()
try:
while True:
wait = randint(1, 10)
time.sleep(1.0/wait)
print '[>] Enqueuing from main...', wait
q.put(wait)
except KeyboardInterrupt:
print '[*] Exiting...', True
if __name__ == '__main__':
main()
While the code is not very important to my question, it is a simple script that spawns 2 threads, on top of the main one. Two of them enqueue "tasks", and one dequeues them and "executes" them.
I am just starting to study threading in python, and I have of course ran into the subject of GIL, so I expected to have one process. But the thing is, when I monitor this particular script with htop, I notice not 1, but 3 processes being spawned.
How is this possible?
The GIL means only one thread will "do work" at a time but it doesn't mean that Python won't spawn the threads. In your case, you asked Python to spawn two threads so it did (giving you a total of three threads). FYI, top lists both processes and threads in case this was causing your confusion.
Python threads are useful for when you want concurrency but don't need parallelism. Concurrency is a tool for making programs simpler and more modular; it allows you to spawn a thread per task instead of having to write one big (often messy) while loop and/or use a bunch of callbacks (like JavaScript).
If you're interested in this subject, I recommend googling "concurrency versus parallelism". The concept is not language specific.
Edit: Alternativly, you can just read this Stack Overflow thread.

How to terminate Producer-Consumer threads from main thread in Python?

I have a Producer and a Consumer thread (threading.Thread), which share a queue of type Queue.
Producer run:
while self.running:
product = produced() ### I/O operations
queue.put(product)
Consumer run:
while self.running or not queue.empty():
product = queue.get()
time.sleep(several_seconds) ###
consume(product)
Now I need to terminate both threads from main thread, with the requirement that queue must be empty (all consumed) before terminating.
Currently I'm using code like below to terminate these two threads:
main thread stop:
producer.running = False
producer.join()
consumer.running = False
consumer.join()
But I guess it's unsafe if there are more consumers.
In addition, I'm not sure whether the sleep will release schedule to the producer so that it can produce more products. In fact, I find the producer keeps "starving" but I'm not sure whether this is the root cause.
Is there a decent way to deal with this case?
You can put a sentinel object in queue to signal end of tasks, causing all consumers to terminate:
_sentinel = object()
def producer(queue):
while running:
# produce some data
queue.put(data)
queue.put(_sentinel)
def consumer(queue):
while True:
data = queue.get()
if data is _sentinel:
# put it back so that other consumers see it
queue.put(_sentinel)
break
# Process data
This snippet is shamelessly copied from Python Cookbook 12.3.
Use a _sentinel to mark end of queue. None also works if no task produced by producer is None, but using a _sentinel is safer for the more general case.
You don't need to put multiple end markers into queue, for each consumer. You may not be aware of how many threads are consuming. Just put the sentinel back into queue when a consumer finds it, for other consumers to get the signal.
Edit 2:
a) The reason your consumers keep taking so much time is because your loop runs continously even when you have no data.
b) I added code at that bottom that shows how to handle this.
If I understood you correctly, the producer/consumer is a continuous process, e.g. it is acceptable to delay the shutdown until you exit the current blocking I/O and process the data you received from that.
In that case, to shut down your producer and consumer in an orderly fashion, I would add communication from the main thread to the producer thread to invoke a shutdown. In the most general case, this could be a queue that the main thread can use to queue a "shutdown" code, but in the simple case of a single producer that is to be stopped and never restarted, it could simply be a global shutdown flag.
Your producer should check this shutdown condition (queue or flag) in its main loop right before it would start a blocking I/O operation (e.g. after you have finished sending other data to the consumer queue). If the flag is set, then it should put a special end-of-data code (that does not look like your normal data) on the queue to tell the consumer that a shut down is occurring, and then the producer should return (terminate itself).
The consumer should be modified to check for this end-of-data code whenever it pulls data out of the queue. If the end-of-data code is found, it should do an orderly shutdown and return (terminating itself).
If there are multiple consumers, then the producer could queue multiple end-of-data messages -- one for each consumer -- before it shuts down. Since the consumers stop consuming after they read the message, they will all eventually shut down.
Alternatively, if you do not know up-front how many consumers there are, then part of the orderly shut down of the consumer could be re-queueing the end-of-data code.
This will insure that all consumers eventually see the end-of-data code and shut down, and when all are done, there will be one remaining item in the queue -- the end-of-data code queued by the last consumer.
EDIT:
The correct way to represent your end-of-data code is highly application dependent, but in many cases a simple None works very well. Since None is a singleton, the consumer can use the very efficient if data is None construct to deal with the end case.
Another possibility that can be even more efficient in some cases is to set up a try /except outside your main consumer loop, in such a way that if the except happened, it was because you were trying to unpack the data in a way that always works except for when you are processing the end-of-data code.
EDIT 2:
Combining these concepts with your initial code, now the producer does this:
while self.running:
product = produced() ### I/O operations
queue.put(product)
for x in range(number_of_consumers):
queue.put(None) # Termination code
Each consumer does this:
while 1:
product = queue.get()
if product is None:
break
consume(product)
The main program can then just do this:
producer.running = False
producer.join()
for consumer in consumers:
consumer.join()
One observation from your code is that, your consumer will keep on looking for getting some thing from the queue, ideally you should handle that by keeping some timeout and handle Empty exception for the same like below, ideally this helps to check the while self.running or not queue.empty() for every timeout.
while self.running or not queue.empty():
try:
product = queue.get(timeout=1)
except Empty:
pass
time.sleep(several_seconds) ###
consume(product)
I did simulated your situation and created producer and consumer threads, Below is the sample code that is running with 2 producers and 4 consumers it's working very well. hope this helps you!
import time
import threading
from Queue import Queue, Empty
"""A multi-producer, multi-consumer queue."""
# A thread that produces data
class Producer(threading.Thread):
def __init__(self, group=None, target=None, name=None,
args=(), kwargs=None, verbose=None):
threading.Thread.__init__(self, group=group, target=target, name=name,
verbose=verbose)
self.running = True
self.name = name
self.args = args
self.kwargs = kwargs
def run(self):
out_q = self.kwargs.get('queue')
while self.running:
# Adding some integer
out_q.put(10)
# Kepping this thread in sleep not to do many iterations
time.sleep(0.1)
print 'producer {name} terminated\n'.format(name=self.name)
# A thread that consumes data
class Consumer(threading.Thread):
def __init__(self, group=None, target=None, name=None,
args=(), kwargs=None, verbose=None):
threading.Thread.__init__(self, group=group, target=target, name=name,
verbose=verbose)
self.args = args
self.kwargs = kwargs
self.producer_alive = True
self.name = name
def run(self):
in_q = self.kwargs.get('queue')
# Consumer should die one queue is producer si dead and queue is empty.
while self.producer_alive or not in_q.empty():
try:
data = in_q.get(timeout=1)
except Empty, e:
pass
# This part you can do anything to consume time
if isinstance(data, int):
# just doing some work, infact you can make this one sleep
for i in xrange(data + 10**6):
pass
else:
pass
print 'Consumer {name} terminated (Is producer alive={pstatus}, Is Queue empty={qstatus})!\n'.format(
name=self.name, pstatus=self.producer_alive, qstatus=in_q.empty())
# Create the shared queue and launch both thread pools
q = Queue()
producer_pool, consumer_pool = [], []
for i in range(1, 3):
producer_worker = Producer(kwargs={'queue': q}, name=str(i))
producer_pool.append(producer_worker)
producer_worker.start()
for i in xrange(1, 5):
consumer_worker = Consumer(kwargs={'queue': q}, name=str(i))
consumer_pool.append(consumer_worker)
consumer_worker.start()
while 1:
control_process = raw_input('> Y/N: ')
if control_process == 'Y':
for producer in producer_pool:
producer.running = False
# Joining this to make sure all the producers die
producer.join()
for consumer in consumer_pool:
# Ideally consumer should stop once producers die
consumer.producer_alive = False
break

Running two endless parallel loops

I want to run two endless parallel loops. One is reading data from a server and updates an object with a number. The other is doing nothing else then reading it and in case of change, processing it. Does not have to be in sync or so. So my questions are :
In case of write from one side and read from another, does Python have issues with it ?
In case I get a sync problem, do I need to lock the read/write processes ? Any other way
I should do it ?
What is best to use, thread or threading ?
As the next step, I will read from 100 sites and update 100 objects,
and read from 100 loops for the changes. Is it recommend to use Multiprocessing from the
beginning so I can scale without problems ? Do I need at the read and write issues ?
Any help is appreciated.
Short answer is, whatever you think will be understandable for you.
Meaning, your code should make sense to you for learning purposes..
Here's an example, it's light and easy to use.
Getting values from and to the thread is easy..
It's not actual multi-threading tho (same CPU core)
from threading import *
class worker(Thread):
def __init__(self, input=0):
self.input = input
Thread.__init__(self)
self.start()
def run(self):
while 1:
self.input += 1
x = worker(-100)
y = worker(x.input)
print y.input
This is just an example to show that the Y thread can access the data in x.. in practice this can be dangerous considering that both threads will be updating the same variable :) (In short: -100 will be calculated twice per cycle, -98, -96, -94.. etc)
Will not span across multiple CPU's
Easy to use ( accessing data across threads is easy )
Logical code, if you're not familar with queue systems or distributed systems
Will raise a error if the OS can't create more threads (a "limitation")
from threading import Thread
from Queue import Queue
class producer(Thread):
def __init__(self,queue):
Thread.__init__(self)
self.queue=queue
self.start()
def run(self):
while 1:
self.queue.put(update_value())
class consumer(Thread):
def __init__(self,queue):
Thread.__init__(self)
self.queue=queue
self.start()
def run(self):
while True:
value = queue.get()
do_whatever_you_want(value)
queue = Queue()
producer(queue)
consumer(queue)
notice that you can scale by using 100 producer and one consumer (and of course one queue) 100 threads should be ok but things would be different if you wanted 10000

Categories