Python multiprocessing.Process OSError: [Errno 24] Too many open files - python

I have the following code:
def formatGravities(gravities):
# create a list to keep all processes
processes = []
# create a list to keep connections
parent_connections = []
formatted_gravities = []
# create a process per instance
for gravity in gravities:
# create a pipe for communication
parent_conn, child_conn = Pipe()
parent_connections.append(parent_conn)
# create the process, pass arguments
process = Process(target=formatGravity,
args=(gravity, child_conn))
processes.append(process)
# start all processes
for process in processes:
process.start()
# make sure that all processes have finished
for process in processes:
process.join()
for parent_connection in parent_connections:
formatted_gravities.append(parent_connection.recv()[0])
return formatted_gravities
len(gravities) is on the magnitude of millions. I can understand that I'd not be able to open millions of processes at the same time, and thats probably why I get the error, but how can i change my code so that it waits to spawn processes if max number of processes are already spawned.
I do have the requirement that I can’t use multiprocessing.Queue or multiprocessing.Pool.

Related

Passing objects in python multiprocess.spawn

I have 50 processes I want to run in parallel. I need to run the processes on a gpu. My machine has 8 gpus, I pass the device number to each process so it knows what device to run on. Once that processes is done I want to run another process on that device. The processes are run as subprocesses using POpen with the command below
python special_process.py device
A simple way to do this would be
for group in groups:
processes = [subprocess.POpen(f'python special_process.py {device}'.split()) for device in range(8)]
[p.wait() for p in process]
where groups, are the 50 processes split into groups of 8.
The downside of this is some processes take longer than others and all processes need to finish before it moves to the next group.
I was hoping to do something like multiprocess.spawn, but I need the last process to return the device number so it is clear which device is open to run on. I tried using Queue and Process from multiprocessing but I can't get more than 1 process to run at once.
Any help would be very appreciated. Thanks
Simple while loop and building your own queue worked. Just don't use wait until the end.
import subprocess
d = list(range(20))
num_gpus = 8
procs = []
gpus_free = set([j for j in range(num_gpus)])
gpus_used = set()
while len(d) > 0:
for proc, gpu in procs:
poll = proc.poll()
if poll is None:
# Proc still running
continue
else:
# Proc complete - pop from list
procs.remove((proc, gpu))
gpus_free.add(gpu)
# Submit new processes
if len(procs) < num_gpus:
this_process = d.pop()
gpu_for_this_process = gpus_free.pop()
command = f"python3 inner_function.py {gpu_for_this_process} {this_process}"
proc = subprocess.Popen(command, shell= True)
procs.append((proc, gpu_for_this_process))
[proc.wait() for proc, _ in procs]
print('DONE with all')

How can I refresh Pool processes reading from Queue?

I'm using multiprocessing Pool + Queue to share processing work between a parent process (processing with GPUs) and child processes (processing on the CPU). My program looks like this:
def reader_proc(queue):
## Read from the queue; this will be spawned as a separate Process
while True:
msg = queue.get() # Read from the queue and do nothing
do_cpu_work(msg)
if (msg == 'DONE'):
break
if __name__=='__main__':
queue = JoinableQueue()
pool = Pool(reader_proc, target=(queue,))
for task in GPUWork:
results = do_task(task)
for result in results:
queue.put(task)
# put 'DONE' on and join and close
I'm having a severe memory leak right now, even after explicitly deleting every variable in the reader_proc and calling gc.collect(). I'm calling into various C++ libraries from the reader_proc and I suspect one of them could be leaking memory. While I try and debug that, I need to get some processing done on this data.
Is there any way to refresh these reader processes? E.g. periodically terminate them and restart them. This exists with maxtasksperchild for a Pool operating on an iter but doesn't seem to apply to this Queue / Process based scheme.

Multiprocessing with threading?

when I trying to make my script multi-threading,
I've found out multiprocessing,
I wonder if there is a way to make multiprocessing work with threading?
cpu 1 -> 3 threads(worker A,B,C)
cpu 2 -> 3 threads(worker D,E,F)
...
Im trying to do it myself but I hit so much problems.
is there a way to make those two work together?
You can generate a number of Processes, and then spawn Threads from inside them. Each Process can handle almost anything the standard interpreter thread can handle, so there's nothing stopping you from creating new Threads or even new Processes within each Process. As a minimal example:
def foo():
print("Thread Executing!")
def bar():
threads = []
for _ in range(3): # each Process creates a number of new Threads
thread = threading.Thread(target=foo)
threads.append(thread)
thread.start()
for thread in threads:
thread.join()
if __name__ == "__main__":
processes = []
for _ in range(3):
p = multiprocessing.Process(target=bar) # create a new Process
p.start()
processes.append(p)
for process in processes:
process.join()
Communication between threads can be handled within each Process, and communication between the Processes can be handled at the root interpreter level using Queues or Manager objects.
You can define a function that takes a process and make it run 3 threads and then spawn your processes to target this function, for example:
def threader(process):
for _ in range(3):
threading.Thread(target=yourfunc).start()
def main():
# spawn whatever processes here to target threader

How to use python multiprocessing pool in continuous loop

I am using python multiprocessing library for executing a selenium script. My code is below :
#-- start and join multiple threads ---
thread_list = []
total_threads=10 #-- no of parallel threads
for i in range(total_threads):
t = Process(target=get_browser_and_start, args=[url,nlp,pixel])
thread_list.append(t)
print "starting thread..."
t.start()
for t in thread_list:
print "joining existing thread..."
t.join()
As I understood the join() function, it will wait for each process to complete. But I want that as soon as a process is released, it will be assigned another task to perform new function.
It can be understood like this:
Say 8 processes started in first instance.
no_of_tasks_to_perform = 100
for i in range(no_of_tasks_to_perform):
processes start(8)
if process no 2 finished executing, start new process
maintain 8 process at any point of time till
"i" is <= no_of_tasks_to_perform
Instead of starting new processes every now and then, try to put all your tasks into a multiprocessing.Queue(), and start 8 long-running processes, in each process keep accessing the task queue to get new tasks and then do the job, until there's no task any more.
In your case, it's more like this:
from multiprocessing import Queue, Process
def worker(queue):
while not queue.empty():
task = queue.get()
# now start to work on your task
get_browser_and_start(url,nlp,pixel) # url, nlp, pixel can be unpacked from task
def main():
queue = Queue()
# Now put tasks into queue
no_of_tasks_to_perform = 100
for i in range(no_of_tasks_to_perform):
queue.put([url, nlp, pixel, ...])
# Now start all processes
process = Process(target=worker, args=(queue, ))
process.start()
...
process.join()

python multi-processing zombie processes

I have a simple implementation of python's multi-processing module
if __name__ == '__main__':
jobs = []
while True:
for i in range(40):
# fetch one by one from redis queue
#item = item from redis queue
p = Process(name='worker '+str(i), target=worker, args=(item,))
# if p is not running, start p
if not p.is_alive():
jobs.append(p)
p.start()
for j in jobs:
j.join()
jobs.remove(j)
def worker(url_data):
"""worker function"""
print url_data['link']
What I expect this code to do:
run in infinite loop, keep waiting for Redis queue.
if Redis queue not empty, fetch item.
create 40 multiprocess.Process, not more not less
if a process has finished processing, start new process, so that ~40 process are running at all time.
I read that, to avoid zombie process that should be bound(join) to the parent, that's what I expected to achieve in the second loop. But the issue is that on launching it spawns 40 processes, workers finish processing and enter zombie state, until all currently spawned processes haven't finished,
then in next iteration of "while True", the same pattern continues.
So my question is:
How can I avoid zombie processes. and spawn new process as soon as 1 in 40 has finished
For a task like the one you described is usually better to use a different approach using Pool.
You can have the main process fetching data and the workers deal with it.
Following an example of Pool from Python Docs
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1) # prints "100" unless your computer is *very* slow
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
I also suggest to use imap instead of map as it seems your task can be asynch.
Roughly your code will be:
p = Pool(40)
while True:
items = items from redis queue
p.imap_unordered(worker, items) #unordered version is faster
def worker(url_data):
"""worker function"""
print url_data['link']

Categories