How to correctly use queues in python? - python

I am a beginner when it comes to python threading and multiprocessing so please bear with me.
I want to make a system that consists of three python scripts. The first one creates some data and sends this data to the second script continuously. The second script takes the data and saves on some file until the file exceeds defined memory limit. When that happens, the third script sends the data to an external device and gets rid of this "cache". I need all of this to happen concurrently. The pseudo code sums up what I am trying to do.
def main_1():
data = [1,2,3]
send_to_second_script(data)
def main_2():
rec_data = receive_from_first_script()
save_to_file(rec_data)
if file>limit:
signal_third_script()
def main_3():
if signal is true:
send_data_to_external_device()
remove_data_from_disk()
I understand that I can use queues to make this happen but I am not sure how.
Also , so far to do this, I tried a different approach where I created one python script and used threading to spawn threads for each part of the process. Is this correct or using queues is better?

Firstly, for Python you need to be really aware what the benefits of multithreading/multiprocessing gives you. IMO you should be considering multiprocessing instead of multithreading. Threading in Python is not actually concurrent due to GIL and there are many explanations out on which one to use. Easiest way to choose is to see if your program is IO-bound or CPU-bound. Anyways on to the Queue which is a simple way to work with multiple processes in python.
Using your pseudocode as an example, here is how you would use a Queue.
import multiprocessing
def main_1(output_queue):
test = 0
while test <=10: # simple limit to not run forever
data = [1,2,3]
print("Process 1: Sending data")
output_queue.put(data) #Puts data in queue FIFO
test+=1
output_queue.put("EXIT") # triggers the exit clause
def main_2(input_queue,output_queue):
file = 0 # Dummy psuedo variables
limit = 1
while True:
rec_data = input_queue.get() # Get the latest data from queue. Blocking if empty
if rec_data == "EXIT": # Exit clause is a way to cleanly shut down your processes
output_queue.put("EXIT")
print("Process 2: exiting")
break
print("Process 2: saving to file:", rec_data, "count = ", file)
file += 1
#save_to_file(rec_data)
if file>limit:
file = 0
output_queue.put(True)
def main_3(input_queue):
while(True):
signal = input_queue.get()
if signal is True:
print("Process 3: Data sent and removed")
#send_data_to_external_device()
#remove_data_from_disk()
elif signal == "EXIT":
print("Process 3: Exiting")
break
if __name__== '__main__':
q1 = multiprocessing.Queue() # Intializing the queues and the processes
q2 = multiprocessing.Queue()
p1 = multiprocessing.Process(target = main_1,args = (q1,))
p2 = multiprocessing.Process(target = main_2,args = (q1,q2,))
p3 = multiprocessing.Process(target = main_3,args = (q2,))
p = [p1,p2,p3]
for i in p: # Start all processes
i.start()
for i in p: # Ensure all processes are finished
i.join()
The prints may be a little off because I did not bother to lock the std_out. But using a queue ensures that stuff moves from one process to another.
EDIT: DO be aware that you should also have a look at multiprocessing locks to ensure that your file is 'thread-safe' when performing the move/delete. The pseudo code above only demonstrates how to use queue

Related

Mixing multiprocessing Pool (producers) with Process (consumer) [duplicate]

I'm using the multiprocessing module to split up a very large task. It works for the most part, but I must be missing something obvious with my design, because this way it's very hard for me to effectively tell when all of the data has been processed.
I have two separate tasks that run; one that feeds the other. I guess this is a producer/consumer problem. I use a shared Queue between all processes, where the producers fill up the queue, and the consumers read from the queue and do the processing. The problem is that there is a finite amount of data, so at some point everyone needs to know that all of the data has been processed so the system can shut down gracefully.
It would seem to make sense to use the map_async() function, but since the producers are filling up the queue, I don't know all of the items up front, so I have to go into a while loop and use apply_async() and try to detect when everything is done with some sort of timeout...ugly.
I feel like I'm missing something obvious. How can this be better designed?
PRODCUER
class ProducerProcess(multiprocessing.Process):
def __init__(self, item, consumer_queue):
self.item = item
self.consumer_queue = consumer_queue
multiprocessing.Process.__init__(self)
def run(self):
for record in get_records_for_item(self.item): # this takes time
self.consumer_queue.put(record)
def start_producer_processes(producer_queue, consumer_queue, max_running):
running = []
while not producer_queue.empty():
running = [r for r in running if r.is_alive()]
if len(running) < max_running:
producer_item = producer_queue.get()
p = ProducerProcess(producer_item, consumer_queue)
p.start()
running.append(p)
time.sleep(1)
CONSUMER
def process_consumer_chunk(queue, chunksize=10000):
for i in xrange(0, chunksize):
try:
# don't wait too long for an item
# if new records don't arrive in 10 seconds, process what you have
# and let the next process pick up more items.
record = queue.get(True, 10)
except Queue.Empty:
break
do_stuff_with_record(record)
MAIN
if __name__ == "__main__":
manager = multiprocessing.Manager()
consumer_queue = manager.Queue(1024*1024)
producer_queue = manager.Queue()
producer_items = xrange(0,10)
for item in producer_items:
producer_queue.put(item)
p = multiprocessing.Process(target=start_producer_processes, args=(producer_queue, consumer_queue, 8))
p.start()
consumer_pool = multiprocessing.Pool(processes=16, maxtasksperchild=1)
Here is where it gets cheesy. I can't use map, because the list to consume is being filled up at the same time. So I have to go into a while loop and try to detect a timeout. The consumer_queue can become empty while the producers are still trying to fill it up, so I can't just detect an empty queue an quit on that.
timed_out = False
timeout= 1800
while 1:
try:
result = consumer_pool.apply_async(process_consumer_chunk, (consumer_queue, ), dict(chunksize=chunksize,))
if timed_out:
timed_out = False
except Queue.Empty:
if timed_out:
break
timed_out = True
time.sleep(timeout)
time.sleep(1)
consumer_queue.join()
consumer_pool.close()
consumer_pool.join()
I thought that maybe I could get() the records in the main thread and pass those into the consumer instead of passing the queue in, but I think I end up with the same problem that way. I still have to run a while loop and use apply_async() Thank you in advance for any advice!
You could use a manager.Event to signal the end of the work. This event can be shared between all of your processes and then when you signal it from your main process the other workers can then gracefully shutdown.
while not event.is_set():
...rest of code...
So, your consumers would wait for the event to be set and handle the cleanup once it is set.
To determine when to set this flag you can do a join on the producer threads and when those are all complete you can then join on the consumer threads.
I would like to strongly recommend SimPy instead of multiprocess/threading to do discrete event simulation.

executing two class methods at the same time in Python

I am sure many similar questions have been asked before, but after reading many of them I am still not very sure what I should do. So, I have a Python script to control some external instruments (a camera and a power meter). I have written class for both instruments by calling the C functions in the .dll files using ctypes. Right now it looks something like this:
for i in range(10):
power_reading = newport.get_reading(N=100,interval=1) # take power meter reading
img = camera.capture(N=10)
value = image_processing(img) # analyze the img (ndarray) to get some values
results.append([power_reading,value]) # add both results to a list
I want to start executing the first two lines at the same time. Both newport.get_reading and camera.capture take about 100ms-1s to run (they will run for the same time if I choose the correct arguments). I don't need them to start at EXACTLY the same time, but ideally the delay should be smaller than about 10-20% of the total run time (so less than 0.2s delay when take each take about 1s to run). From what I have read, I can use the multiprocessing module. So I try something like this based on this post:
def p_get_reading(newport,N,interval,return_dict):
reading = newport.get_reading(N,interval,return_dict)
return_dict['power_meter'] = reading
def p_capture(camera,N,return_dict):
img = camera.capture(N)
return_dict['image'] = img
for i in range(10):
manager = multiprocessing.Manager()
return_dict = manager.dict()
p = multiprocessing.Process(target=p_capture, args=(camera,10))
p.start()
p2 = multiprocessing.Process(target=p_get_reading, args=(newport,100,1))
p2.start()
p.join()
p2.join()
print(return_dict)
I have a few problems/questions:
I need to get the return values from both function calls. Using my current method, return_dict is only showing the entry for capture_img but not the power meter reading, why is that? It also read that I can use Queue, what is the best method for my current purpose?
How can I know whether both functions indeed start running at the same time? I am thinking of using the time module to record both the start and end time of both functions, maybe using some wrapper function to do the time logging, but will the use of multiprocessing pose any restrictions?
I usually run my code on an IDE (spyder), and from what I have read, I need to run in command prompt to see the output (I have some print statements inside the functions for debugging purposes). Can I still run in IDE for having both functions run at the same time?
Using a Lock may help with synchronisation:
import multiprocessing
def p_get_reading(newport, N, interval, lock, return_dict):
lock.acquire()
lock.release()
reading = newport.get_reading(N, interval)
return_dict['power_meter'] = reading
def p_capture(camera, N, lock, return_dict):
lock.acquire()
lock.release()
img = camera.capture(N)
return_dict['image'] = img
if __name__ == "__main__":
for i in range(10):
manager = multiprocessing.Manager()
return_dict = manager.dict()
lock = multiprocessing.Lock()
lock.acquire()
p = multiprocessing.Process(target=p_capture, args=(camera,10,lock,return_dict))
p.start()
p2 = multiprocessing.Process(target=p_get_reading, args=(newport,100,1,lock,return_dict))
p2.start()
lock.release()
p.join()
p2.join()
print(return_dict)
The two Process objects can now be created and start()ed in any order as the main routine has already acquired the lock. Once released, the two processes will fight between themselves to acquire and release the lock, and be ready almost at the same time.
Also, note the use of if __name__ == "__main__" as this helps when multiprocessing makes new processes.
I must say this seems like an abuse of a Lock
An answer to your first question is simply no if you are doing in normal way, but yes if you want it to be. No because the target function cannot communicate back to spawning thread using a return. One way to do it is to use a queue and wrapper functions as following:
from threading import Thread
from Queue import Queue
def p_get_reading(newport,N,interval,return_dict):
reading = newport.get_reading(N,interval,return_dict)
return_dict.update({'power_meter': reading})
return return_dict
def p_capture(camera,N,return_dict):
img = camera.capture(N)
return_dict.update({'image': img})
return return_dict
def wrapper1(func, arg1, arg2, queue):
queue.put(func(arg1, arg2))
def wrapper2(func, arg1, arg2, arg3, queue):
queue.put(func(arg1, arg2, arg3))
q = Queue()
Thread(target=wrapper1, args=(p_capture, camera, 10 , q)).start()
Thread(target=wrapper2, args=(p_get_reading, newport, 100, 1, q)).start()
Now q holds the updated and returned dict from p_capture() and p_get_reading().

Python: How can I stop Threading/Multiprocessing from using 100% of my CPU?

I have code that reads data from 7 devices every second for an infinite amount of time. Each loop, a thread is created which starts 7 processes. After each process is done the program waits 1 second and starts again. Here is a snippet the code:
def all_thread(): #function that handels the threading
thread = threading.Thread(target=all_process) #prepares a thread for the devices
thread.start() #starts a thread for the devices
def all_process(): #function that prepares and runs processes
processes = [] #empty list for the processes to be stored
while len(gas_list) > 0: #this gaslist holds the connection information for my devices
for sen in gas_list: #for each sen(sensor) in the gas list
proc = multiprocessing.Process(target=main_reader, args=(sen, q)) #declaring a process variable that sends the gas object, value and queue information to reading function
processes.append(proc) #adding the process to the processes list
proc.start() #start the process
for sen in processes: #for each sensor in the processes list
sen.join() #wait for all the processes to complete before starting again
time.sleep(1) #wait one second
However, this uses 100% of my CPU. Is this by design of threading and multiprocessing or just bad coding? Is there a way I can limit the CPU usage? Thanks!
Update:
The comments were mentioning the main_reader() function so I will put it into the question. All it does is read each device, takes all the data and appends it to a list. Then the list is put into a queue to be displayed in the tkinter GUI.
def main_reader(data, q): #this function reads the device which takes less than a second
output_list = get_registry(data) #this function takes the device information, reads the registry and returns a list of data
q.put(output_list) #put the output list into the queue
As you state in the comments, your main_reader takes only a fraction of a second to run, which means process creation overhead might cause your problem.
Here is an example with multiprocessing.Pool. This creates a pool of workers and submits your tasks to them. Processes are started only once and never shut down or joined if this is meant to be an infinite loop. If you want to shut your pool down, you can do so by joining and closing it (see documentation for that).
from multiprocessing import Pool, Manager
from time import sleep
import threading
from random import random
gas_list = [1,2,3,4,5,6,7,8,9,10]
def main_reader(sen, rqu):
output = "%d/%f" % (sen, random())
rqu.put(output)
def all_processes(rq):
p = Pool(len(gas_list) + 1)
while True:
for sen in gas_list:
p.apply_async(main_reader, args=(sen, rq))
sleep(1)
m = Manager()
q = m.Queue()
t = threading.Thread(target=all_processes, args=(q,))
t.daemon = True
t.start()
while True:
r = q.get()
print r
If this does not help, you need to start digging deeper. I would first increase the sleep in your infinite loop to 10 seconds or even longer. This would allow you to monitor the behaviour of your program. If CPU peaks for a moment and then settles down for 10 seconds or so, you know the problem is in your main_reader. If it is still 100%, your problem must be elsewhere.
Is it possible your problem is not in this part of your program at all? You seem to launch this all in a thread, which indicates your main program is doing something else. Can it be this something else that peaks the CPU?

Python multiprocessing with an updating queue and an output queue

How can I script a Python multiprocess that uses two Queues as these ones?:
one as a working queue that starts with some data and that, depending on conditions of the functions to be parallelized, receives further tasks on the fly,
another that gathers results and is used to write down the result after processing finishes.
I basically need to put some more tasks in the working queue depending on what I found in its initial items. The example I post below is silly (I could transform the item as I like and put it directly in the output Queue), but its mechanics are clear and reflect part of the concept I need to develop.
Hereby my attempt:
import multiprocessing as mp
def worker(working_queue, output_queue):
item = working_queue.get() #I take an item from the working queue
if item % 2 == 0:
output_queue.put(item**2) # If I like it, I do something with it and conserve the result.
else:
working_queue.put(item+1) # If there is something missing, I do something with it and leave the result in the working queue
if __name__ == '__main__':
static_input = range(100)
working_q = mp.Queue()
output_q = mp.Queue()
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker,args=(working_q, output_q)) for i in range(mp.cpu_count())] #I am running as many processes as CPU my machine has (is this wise?).
for proc in processes:
proc.start()
for proc in processes:
proc.join()
for result in iter(output_q.get, None):
print result #alternatively, I would like to (c)pickle.dump this, but I am not sure if it is possible.
This does not end nor print any result.
At the end of the whole process I would like to ensure that the working queue is empty, and that all the parallel functions have finished writing to the output queue before the later is iterated to take out the results. Do you have suggestions on how to make it work?
The following code achieves the expected results. It follows the suggestions made by #tawmas.
This code allows to use multiple cores in a process that requires that the queue which feeds data to the workers can be updated by them during the processing:
import multiprocessing as mp
def worker(working_queue, output_queue):
while True:
if working_queue.empty() == True:
break #this is the so-called 'poison pill'
else:
picked = working_queue.get()
if picked % 2 == 0:
output_queue.put(picked)
else:
working_queue.put(picked+1)
return
if __name__ == '__main__':
static_input = xrange(100)
working_q = mp.Queue()
output_q = mp.Queue()
results_bank = []
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker,args=(working_q, output_q)) for i in range(mp.cpu_count())]
for proc in processes:
proc.start()
for proc in processes:
proc.join()
results_bank = []
while True:
if output_q.empty() == True:
break
results_bank.append(output_q.get_nowait())
print len(results_bank) # length of this list should be equal to static_input, which is the range used to populate the input queue. In other words, this tells whether all the items placed for processing were actually processed.
results_bank.sort()
print results_bank
You have a typo in the line that creates the processes. It should be mp.Process, not mp.process. This is what is causing the exception you get.
Also, you are not looping in your workers, so they actually only consume a single item each from the queue and then exit. Without knowing more about the required logic, it's not easy to give specific advice, but you will probably want to enclose the body of your worker function inside a while True loop and add a condition in the body to exit when the work is done.
Please note that, if you do not add a condition to explicitly exit from the loop, your workers will simply stall forever when the queue is empty. You might consider using the so-called poison pill technique to signal the workers they may exit. You will find an example and some useful discussion in the PyMOTW article on Communication Between processes.
As for the number of processes to use, you will need to benchmark a bit to find what works for you, but, in general, one process per core is a good starting point when your workload is CPU bound. If your workload is IO bound, you might have better results with a higher number of workers.

Filling a queue and managing multiprocessing in python

I'm having this problem in python:
I have a queue of URLs that I need to check from time to time
if the queue is filled up, I need to process each item in the queue
Each item in the queue must be processed by a single process (multiprocessing)
So far I managed to achieve this "manually" like this:
while 1:
self.updateQueue()
while not self.mainUrlQueue.empty():
domain = self.mainUrlQueue.get()
# if we didn't launched any process yet, we need to do so
if len(self.jobs) < maxprocess:
self.startJob(domain)
#time.sleep(1)
else:
# If we already have process started we need to clear the old process in our pool and start new ones
jobdone = 0
# We circle through each of the process, until we find one free ; only then leave the loop
while jobdone == 0:
for p in self.jobs :
#print "entering loop"
# if the process finished
if not p.is_alive() and jobdone == 0:
#print str(p.pid) + " job dead, starting new one"
self.jobs.remove(p)
self.startJob(domain)
jobdone = 1
However that leads to tons of problems and errors. I wondered if I was not better suited using a Pool of process. What would be the right way to do this?
However, a lot of times my queue is empty, and it can be filled by 300 items in a second, so I'm not too sure how to do things here.
You could use the blocking capabilities of queue to spawn multiple process at startup (using multiprocessing.Pool) and letting them sleep until some data are available on the queue to process. If your not familiar with that, you could try to "play" with that simple program:
import multiprocessing
import os
import time
the_queue = multiprocessing.Queue()
def worker_main(queue):
print os.getpid(),"working"
while True:
item = queue.get(True)
print os.getpid(), "got", item
time.sleep(1) # simulate a "long" operation
the_pool = multiprocessing.Pool(3, worker_main,(the_queue,))
# don't forget the comma here ^
for i in range(5):
the_queue.put("hello")
the_queue.put("world")
time.sleep(10)
Tested with Python 2.7.3 on Linux
This will spawn 3 processes (in addition of the parent process). Each child executes the worker_main function. It is a simple loop getting a new item from the queue on each iteration. Workers will block if nothing is ready to process.
At startup all 3 process will sleep until the queue is fed with some data. When a data is available one of the waiting workers get that item and starts to process it. After that, it tries to get an other item from the queue, waiting again if nothing is available...
Added some code (submitting "None" to the queue) to nicely shut down the worker threads, and added code to close and join the_queue and the_pool:
import multiprocessing
import os
import time
NUM_PROCESSES = 20
NUM_QUEUE_ITEMS = 20 # so really 40, because hello and world are processed separately
def worker_main(queue):
print(os.getpid(),"working")
while True:
item = queue.get(block=True) #block=True means make a blocking call to wait for items in queue
if item is None:
break
print(os.getpid(), "got", item)
time.sleep(1) # simulate a "long" operation
def main():
the_queue = multiprocessing.Queue()
the_pool = multiprocessing.Pool(NUM_PROCESSES, worker_main,(the_queue,))
for i in range(NUM_QUEUE_ITEMS):
the_queue.put("hello")
the_queue.put("world")
for i in range(NUM_PROCESSES):
the_queue.put(None)
# prevent adding anything more to the queue and wait for queue to empty
the_queue.close()
the_queue.join_thread()
# prevent adding anything more to the process pool and wait for all processes to finish
the_pool.close()
the_pool.join()
if __name__ == '__main__':
main()

Categories