I'm using multiprocessing Pool + Queue to share processing work between a parent process (processing with GPUs) and child processes (processing on the CPU). My program looks like this:
def reader_proc(queue):
## Read from the queue; this will be spawned as a separate Process
while True:
msg = queue.get() # Read from the queue and do nothing
do_cpu_work(msg)
if (msg == 'DONE'):
break
if __name__=='__main__':
queue = JoinableQueue()
pool = Pool(reader_proc, target=(queue,))
for task in GPUWork:
results = do_task(task)
for result in results:
queue.put(task)
# put 'DONE' on and join and close
I'm having a severe memory leak right now, even after explicitly deleting every variable in the reader_proc and calling gc.collect(). I'm calling into various C++ libraries from the reader_proc and I suspect one of them could be leaking memory. While I try and debug that, I need to get some processing done on this data.
Is there any way to refresh these reader processes? E.g. periodically terminate them and restart them. This exists with maxtasksperchild for a Pool operating on an iter but doesn't seem to apply to this Queue / Process based scheme.
Related
I am trying to run inference with tensorflow using multiprocessing. Each process uses 1 GPU. I have a list of files input_files[]. Every process gets one file, runs the model.predict on it and writes the results to file. To move on to next file, I need to close the process and restart it. This is because tensorflow doesn't let go of memory. So if I use the same process, I get memory leak.
I have written a code below which works. I start 5 processes, close them and start another 5. The issue is that all processes need to wait for the slowest one before they can move on. How can I start and close each process independent of the others?
Note that Pool.map is over input_files_small not input_files.
file1 --> start new process --> run prediction --> close process --> file2 --> start new process --> etc.
for i in range(0, len(input_files), num_process):
input_files_small = input_files[i:i+num_process]
try:
process_pool = multiprocessing.Pool(processes=num_process, initializer=init_worker, initargs=(gpu_ids))
pool_output = process_pool.map(worker_fn, input_files_small)
finally:
process_pool.close()
process_pool.join()
There is no need to re-create over and over the processing pool. First, specify maxtasksperchild=1 when creating the pool. This should result in creating a new process for each new task submitted. And instead of using method map, use method map_async, which will not block. You can use pool.close followed by pool.join() to wait for these submissions to complete implicitly if your worker function does not return results you need, as follows or use the second code variation:
process_pool = multiprocessing.Pool(processes=num_process, initializer=init_worker, initargs=(gpu_ids), maxtasksperchild=1)
for i in range(0, len(input_files), num_process):
input_files_small = input_files[i:i+num_process]
process_pool.map_async(worker_fn, input_files_small))
# wait for all outstanding tasks to complete
process_pool.close()
process_pool.join()
If you need return values from worker_fn:
process_pool = multiprocessing.Pool(processes=num_process, initializer=init_worker, initargs=(gpu_ids), maxtasksperchild=1)
results = []
for i in range(0, len(input_files), num_process):
input_files_small = input_files[i:i+num_process]
results.append(process_pool.map_async(worker_fn, input_files_small))
# get return values from map_async
pool_outputs = [result.get() for result in results]
# you do not need process_pool.close() and process_pool.join()
But, since there may be some "slow" tasks still running from an earlier invocation of map_async when tasks from a later invocation of map_async start up, some of these tasks may still have to wait to run. But at least all of your processes in the pool should stay fairly busy.
If you are expecting exceptions from your worker function and need to handle them in your main process, it gets more complicated.
Is it possible to have a few child processes running some calculations, then send the result to main process (e.g. update PyQt ui), but the processes are still running, after a while they send back data and update ui again?
With multiprocessing.queue, it seems like the data can only be sent back after process is terminated.
So I wonder whether this case is possible or not.
I don't know what you mean by "With multiprocessing.queue, it seems like the data can only be sent back after process is terminated". This is exactly the use case that Multiprocessing.Queue was designed for.
PyMOTW is a great resource for a whole load of Python modules, including Multiprocessing. Check it out here: https://pymotw.com/2/multiprocessing/communication.html
A simple example of how to send ongoing messages from a child to the parent using multiprocessing and loops:
import multiprocessing
def child_process(q):
for i in range(10):
q.put(i)
q.put("done") # tell the parent process we've finished
def parent_process():
q = multiprocessing.Queue()
child = multiprocessing.Process(target=child_process, args=(q,))
child.start()
while True:
value = q.get()
if value == "done": # no more values from child process
break
print value
# do other stuff, child will continue to run in separate process
I'm building a web application for processing ~60,000 (and growing) large files, perform some analysis and return a "best guess" that needs to be verified by a user. The files will be refined by category to avoid loading every file, but I'm still left with a scenario where I might have to process 1000+ files at a time.
These are large files that can take up to 8-9 seconds each to process, and in a 1000+ file situation it is impractical to have a user wait 8 seconds between reviews or 2 hours+ while the files are processed before hand.
To overcome this, I've decided to use multiprocessing to spawn several workers, each of which will pick from a queue of files, process them and insert into an output queue. I have another method that basically polls the output queue for items and then streams them to the client when one becomes available.
We're using gevent with uWSGI and Django in our environment and I'm aware that child process creation via multiprocessing in the context of gevent yields an undesired event loop state in the child. Greenlets spawned before forking are duplicated in the child. Therefore, I've decided to use lets to assist in the handling of the child processes.
This all works beautifully while uninterrupted. However, if a user were to switch categories anytime while files are still being processed, I close the SSE connection in the browser and open another causing a new set of child processes to spawn and killing the existing processes (or attempting to). This causes one of two things to happen.
When I yield out the results to the client, I get an IOError from uWSGI because the connection has closed. I wrapped the entire function in a try...finally and kill all the workers before the function exits.
I can either block while the processes are killed or do it in the background. Each method has different consequences. When trying to kill without blocking, the original processes are never killed, the new processes stop yielding, and any request (from any page) to the server hangs until I manually kill all uWSGI processes.
When blocking, uWSGI reports a segmentation fault, the main worker is killed and restarted, killing all child processes - new and old.
An example of the JavaScript used to open/close the connection:
var state = {};
function analyze(){
// If a connection exists, close it.
if (state.evtSource) {
state.evtSource.close();
}
// Create a new connection to the server.
evtSource = state.evtSource = new EventSource('?myarg=myval');
evtSource.onmessage = function(message){
//do stuff
}
}
Server-side example code:
from item import Item
import lets
import multiprocessing
import time
MAX_WORKERS = 10
# Worker is outside of ProcessFiles because ``lets``
# pickles the target.
def worker(item):
return item.process()
class ProcessFiles(object):
def __init__(self):
self.input_queue = multiprocessing.Queue()
self.output_queue = multiprocessing.Queue()
self.file_count = 0
self.pool = lets.ProcessPool(MAX_WORKERS)
def query_for_results(self):
# Query db for records of files to process.
# Return results and set self.file_count equal to
# the number of records returned.
pass
def start(self):
# Queue up files to process.
for result in self.query_for_results():
item = Item(result)
self.input_queue.append(item)
# Spawn up to MAX_WORKERS child processes to analyze
# all of the items in the input queue. Append processed file
# to output queue.
for item in self.input_queue:
self.pool.apply_async(worker, args=(item,), callback=self.callback)
# Poll for items to send to client.
return self.get_processed_items()
def callback(self, processed):
self.output_queue.put(processed)
def get_processed_items(self):
# Wait for the output queue to hold at least 1 item.
# When an item becomes available, yield it to client.
try:
count = 0
while count != self.file_count:
try:
item = self.output_queue.get(timeout=1)
except:
# Queue is empty. Wait and retry.
time.sleep(1)
continue
count += 1
yield item
yield 'end'
finally:
# Kill all child processes.
self.pool.kill(block=True) # <- Causes segfault.
#self.pool.kill(block=False) # <- Silently fails.
This only happens when a user makes a selection, and while processing those files, makes another selection, effectively closing the current connection and creating a new one, creating two different sets of child processes.
Why does blocking cause a segmentation fault? Why is the behavior different when blocking vs not blocking? What can I do to kill all the original processes?
I'm building a Python script/application which launches multiple so-called Fetchers.
They in turn do something and return data into a queue.
I want to make sure the Fetchers don't run for more than 60 seconds (because the entire application runs multiple times in one hour).
Reading the Python docs i noticed they say to be carefull when using Process.Terminate() because it can break the Queue.
My current code:
# Result Queue
resultQueue = Queue();
# Create Fetcher Instance
fetcher = fetcherClass()
# Create Fetcher Process List
fetcherProcesses = []
# Run Fetchers
for config in configList:
# Create Process to encapsulate Fetcher
log.debug("Creating Fetcher for Target: %s" % config['object_name'])
fetcherProcess = Process(target=fetcher.Run, args=(config,resultQueue))
log.debug("Starting Fetcher for Target: %s" % config['object_name'])
fetcherProcess.start()
fetcherProcesses.append((config, fetcherProcess))
# Wait for all Workers to complete
for config, fetcherProcess in fetcherProcesses:
log.debug("Waiting for Thread to complete (%s)." % str(config['object_name']))
fetcherProcess.join(DEFAULT_FETCHER_TIMEOUT)
if fetcherProcess.is_alive():
log.critical("Fetcher thread for object %s Timed Out! Terminating..." % config['object_name'])
fetcherProcess.terminate()
# Loop thru results, and save them in RRD
while not resultQueue.empty():
config, fetcherResult = resultQueue.get()
result = storage.Save(config, fetcherResult)
I want to make sure my Queue doesn't get corrupted when one of my Fetchers times out.
What is the best way to do this?
Edit: In response to a chat with sebdelsol a few clarifications:
1) I want to start processing data as soon as possible, because otherwise i have to perform a lot of Disk Intensive operations all at once. So sleeping the main thread for X_Timeout is not an option.
2) I need to wait for the Timeout only once, but per process, so if the main thread launches 50 fetchers, and this takes a few seconds to half a minute, i need to compensate.
3) I want to make sure the data that comes from Queue.Get() is put there by a Fetcher that didn't timeout (since it is theoretically possible that a fetcher was putting the data in the Queue, when the timeout occured, and it was shot to death...) That data should be dumped.
It's not a very bad thing when a timeout occurs, it's not a desirable situation, but corrupt data is worse.
You could pass a new multiprocessing.Lock() to every fetcher you start.
In the fetcher's process, be sure to wrap the Queue.put() with this lock:
with self.lock:
self.queue.put(result)
When you need to terminate a fetcher's process, use its lock :
with fetcherLock:
fetcherProcess.terminate()
This way, your queue won't get corrupted by killing a fetcher during the queue access.
Some fetcher's locks could get corrupted. But, that's not an issue since every new fetcher you launch has a brand new lock.
Why not
create a new queue and start all the fetchers that will use this
queue.
have your script actually sleep the amount of time you want the fetcher's processes to have for getting a result.
get everything from the resultQueue - it won't be corrupted, since you didn't have to kill any process.
finally, terminate all fetcher's processes that are still alive.
loop !
While attempting to store multiprocessing's process instance in multiprocessing list-variable 'poolList` I am getting a following exception:
SimpleQueue objects should only be shared between processes through inheritance
The reason why I would like to store the PROCESS instances in a variable is to be able to terminate all or just some of them later (if for example a PROCESS freezes). If storing a PROCESS in variable is not an option I would like to know how to get or to list all the PROCESSES started by mutliprocessing POOL. That would be very similar to what .current_process() method does. Except .current_process gets only a single process while I need all the processes started or all the processes currently running.
Two questions:
Is it even possible to store an instance of the Process (as a result of mp.current_process()
Currently I am only able to get a single process from inside of the function that the process is running (from inside of myFunct() using .current_process() method).
Instead I would like to to list all the processes currently running by multiprocessing. How to achieve it?
import multiprocessing as mp
poolList=mp.Manager().list()
def myFunct(arg):
print 'myFunct(): current process:', mp.current_process()
try: poolList.append(mp.current_process())
except Exception, e: print e
for i in range(110):
for n in range(500000):
pass
poolDict[arg]=i
print 'myFunct(): completed', arg, poolDict
from multiprocessing import Pool
pool = Pool(processes=2)
myArgsList=['arg1','arg2','arg3']
pool=Pool(processes=2)
pool.map_async(myFunct, myArgsList)
pool.close()
pool.join()
To list the processes started by a Pool()-instance(which is what you mean if I understand you correctly), there is the pool._pool-list. And it contains the instances of the processes.
However, it is not part of the documented interface and hence, really should not be used.
BUT...it seems a little bit unlikely that it would change just like that anyway. I mean, should they stop having an internal list of processes in the pool? And not call that _pool?
And also, it annoys me that there at least isn't a get processes-method. Or something.
And handling it breaking due to some name change should not be that difficult.
But still, use at your own risk:
from multiprocessing import pool
# Have to run in main
if __name__ == '__main__':
# Create 3 worker processes
_my_pool = pool.Pool(3)
# Loop, terminate, and remove from the process list
# Use a copy [:] of the list to remove items correctly
for _curr_process in _my_pool._pool[:]:
print("Terminating process "+ str(_curr_process.pid))
_curr_process.terminate()
_my_pool._pool.remove(_curr_process)
# If you call _repopulate, the pool will again contain 3 worker processes.
_my_pool._repopulate_pool()
for _curr_process in _my_pool._pool[:]:
print("After repopulation "+ str(_curr_process.pid))
The example creates a pool and manually terminates all processes.
It is important that you remember to delete the process you terminate from the pool yourself i you want Pool() to continue working as usual.
_my_pool._repopulate increases the number of working processes to 3 again, not needed to answer the question, but gives a little bit of behind-the-scenes insight.
Yes you can get all active process and perform action based on name of process
e.g
multiprocessing.Process(target=foo, name="refresh-reports")
and then
for p in multiprocessing.active_children():
if p.name == "foo":
p.terminate()
You're creating a managed List object, but then letting the associated Manager object expire.
Process objects are shareable because they aren't pickle-able; that is, they aren't simple.
Oddly the multiprocessing module doesn't have the equivalent of threading.enumerate() -- that is, you can't list all outstanding processes. As a workaround, I just store procs in a list. I never terminate() a process, but do sys.exit(0) in the parent. It's rough, because the workers will leave things in an inconsistent state, but it's okay for smaller programs
To kill a frozen worker, I suggest: 1) worker receives "heartbeat" jobs in a queue every now and then, 2) if parent notices worker A hasn't responded to a heartbeat in a certain amount of time, then p.terminate(). Consider restating the problem in another SO question, as it's interesting.
To be honest the map stuff is much easier than using a Manager.
Here's a Manager example I've used. A worker adds stuff to a shared list. Another worker occasionally wakes up, processes everything on the list, then goes back to sleep. The code also has verbose logs, which are essential for ease in debugging.
source
# producer adds to fixed-sized list; scanner uses them
import logging, multiprocessing, sys, time
def producer(objlist):
'''
add an item to list every sec; ensure fixed size list
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(1)
except KeyboardInterrupt:
return
msg = 'ding: {:04d}'.format(int(time.time()) % 10000)
logger.info('put: %s', msg)
del objlist[0]
objlist.append( msg )
def scanner(objlist):
'''
every now and then, run calculation on objlist
'''
logger = multiprocessing.get_logger()
logger.info('start')
while True:
try:
time.sleep(5)
except KeyboardInterrupt:
return
logger.info('items: %s', list(objlist))
def main():
logger = multiprocessing.log_to_stderr(
level=logging.INFO
)
logger.info('setup')
# create fixed-length list, shared between producer & consumer
manager = multiprocessing.Manager()
my_objlist = manager.list( # pylint: disable=E1101
[None] * 10
)
multiprocessing.Process(
target=producer,
args=(my_objlist,),
name='producer',
).start()
multiprocessing.Process(
target=scanner,
args=(my_objlist,),
name='scanner',
).start()
logger.info('running forever')
try:
manager.join() # wait until both workers die
except KeyboardInterrupt:
pass
logger.info('done')
if __name__=='__main__':
main()