How to use Python multiprocessing queue to access GPU (through PyOpenCL)?

How to use Python multiprocessing queue to access GPU (through PyOpenCL)? - python

I have code that takes a long time to run and so I've been investigating Python's multiprocessing library in order to speed things up. My code also has a few steps that utilize the GPU via PyOpenCL. The problem is, if I set multiple processes to run at the same time, they all end up trying to use the GPU at the same time, and that often results in one or more of the processes throwing an exception and quitting.
In order to work around this, I staggered the start of each process so that they'd be less likely to bump into each other:
process_list = []
num_procs = 4
# break data into chunks so each process gets it's own chunk of the data
data_chunks = chunks(data,num_procs)
for chunk in data_chunks:
if len(chunk) == 0:
continue
# Instantiates the process
p = multiprocessing.Process(target=test, args=(arg1,arg2))
# Sticks the thread in a list so that it remains accessible
process_list.append(p)
# Start threads
j = 1
for process in process_list:
print('\nStarting process %i' % j)
process.start()
time.sleep(5)
j += 1
for process in process_list:
process.join()
I also wrapped a try except loop around the function that calls the GPU so that if two processes DO try to access it at the same time, the one who doesn't get access will wait a couple of seconds and try again:
wait = 2
n = 0
while True:
try:
gpu_out = GPU_Obj.GPU_fn(params)
except:
time.sleep(wait)
print('\n Waiting for GPU memory...')
n += 1
if n == 5:
raise Exception('Tried and failed %i times to allocate memory for opencl kernel.' % n)
continue
break
This workaround is very clunky and even though it works most of the time, processes occasionally throw exceptions and I feel like there should be a more effecient/elegant solution using multiprocessing.queue or something similar. However, I'm not sure how to integrate it with PyOpenCL for GPU access.

Sounds like you could use a multiprocessing.Lock to synchronize access to the GPU:
data_chunks = chunks(data,num_procs)
lock = multiprocessing.Lock()
for chunk in data_chunks:
if len(chunk) == 0:
continue
# Instantiates the process
p = multiprocessing.Process(target=test, args=(arg1,arg2, lock))
...
Then, inside test where you access the GPU:
with lock: # Only one process will be allowed in this block at a time.
gpu_out = GPU_Obj.GPU_fn(params)
Edit:
To do this with a pool, you'd do this:
# At global scope
lock = None
def init(_lock):
global lock
lock = _lock
data_chunks = chunks(data,num_procs)
lock = multiprocessing.Lock()
for chunk in data_chunks:
if len(chunk) == 0:
continue
# Instantiates the process
p = multiprocessing.Pool(initializer=init, initargs=(lock,))
p.apply(test, args=(arg1, arg2))
...
Or:
data_chunks = chunks(data,num_procs)
m = multiprocessing.Manager()
lock = m.Lock()
for chunk in data_chunks:
if len(chunk) == 0:
continue
# Instantiates the process
p = multiprocessing.Pool()
p.apply(test, args=(arg1, arg2, lock))

Related

multiprocessing value hangs with lock

I've read the documentation here, and seems that to make sure that the Value does not hang we need to use a lock. I did just that but it still gets stuck:
from multiprocessing import Process, Value, freeze_support, Lock
nb_threads = 3
nbloops = 10
v = Value('i', 0)
def run_process(lock):
global nbloops
i = 0
while i < nbloops:
# do stuff
i += 1
with lock:
v.value += 1
# wait for all the processes to finish doing something
while v.value % nb_threads != 0:
pass
if __name__ == '__main__':
freeze_support()
processes = []
lock = Lock()
for i in range(0, 3):
processes.append( Process( target=run_process, args=(lock,) ) )
for process in processes:
process.start()
for process in processes:
process.join()
I've tried accessing the value using lock but it still blocks:
val = -1
while val % nb_threads != 0:
with lock:
val = v.value
How can I fix this? Thanks

Your code has a race condition; you do not guarantee that all three processes break free from the while v.value % nb_threads != 0 loop before allowing them to move on. This allows one or two of the processes to move on to the next iteration of the while i < nbloops loop, increment v.value, and then prevent the remaining process/processes from ever breaking out of their own while v.value % nb_threads != 0 loop. The kind of synchronization you're trying to do there is best handled by a Barrier, rather than looping and repeatedly checking the value.
Also, multiprocessing.Value also has a built-in synchronization by default, and you can explicitly access the Lock it uses for that by calling Value.get_lock, so there is no need to explicitly a Lock of your own to each process. Putting together, you have:
from multiprocessing import Process, Value, freeze_support, Lock, Barrier
nb_threads = 3
nbloops = 10
v = Value('i', 0)
def run_process(barrier):
global nbloops
i = 0
while i < nbloops:
# do stuff
i += 1
with v.get_lock():
v.value += 1
# wait for all the processes to finish doing something
out = barrier.wait()
if __name__ == '__main__':
freeze_support()
processes = []
b = Barrier(nb_threads)
for i in range(0, nb_threads):
processes.append( Process( target=run_process, args=(b,) ) )
for process in processes:
process.start()
for process in processes:
process.join()
The Barrier guarantees that no process can move on to the next iteration of the loop until all of them have called Barrier.wait(), at which point all three are simultaneously able to progress. The Barrier object supports re-use, so it can safely be called on each iteration.

A way to wait for currently running tasks to finish then stop in multiprocessing Pool

I have a large number of tasks (40,000 to be exact) that I am using a Pool to run in parallel. To maximize efficiency, I pass the list of all tasks at once to starmap and let them run.
I would like to have it so that if my program is broken using Ctrl+C then currently running tasks will be allowed to finish but new ones will not be started. I have figured out the signal handling part to handle the Ctrl+C breaking just fine using the recommended method and this works well (at least with Python 3.6.9 that I am using):
import os
import signal
import random as rand
import multiprocessing as mp
def init() :
signal.signal(signal.SIGINT, signal.SIG_IGN)
def child(a, b, c) :
st = rand.randrange(5, 20+1)
print("Worker thread", a+1, "sleep for", st, "...")
os.system("sleep " + str(st))
pool = mp.Pool(initializer=init)
try :
pool.starmap(child, [(i, 2*i, 3*i) for i in range(10)])
pool.close()
pool.join()
print("True exit!")
except KeyboardInterrupt :
pool.terminate()
pool.join()
print("Interupted exit!")
The problem is that Pool seems to have no function to let the currently running tasks complete and then stop. It only has terminate and close. In the example above I use terminate but this is not what I want as this immediately terminates all running tasks (whereas I want to let the currently running tasks run to completion). On the other hand, close simply prevents adding more tasks, but calling close then join will wait for all pending tasks to complete (40,000 of them in my real case) (whereas I only want currently running tasks to finish not all of them).
I could somehow gradually add my tasks one by one or in chunks so I could use close and join when interrupted, but this seems less efficient unless there is a way to add a new task as soon as one finishes manually (which I'm not seeing how to do from the Pool documentation). It really seems like my use case would be common and that Pool should have a function for this, but I have not seen this question asked anywhere (or maybe I'm just not searching for the right thing).
Does anyone know how to accomplish this easily?

I tried to do something similar with concurrent.futures - see the last code block in this answer: it attempts to throttle adding tasks to the pool and only adds new tasks as tasks complete. You could change the logic to fit your needs. Maybe keep the pending work items slightly greater than the number of workers so you don't starve the executor. something like:
import concurrent.futures
import random as rand
import time
def child(*args, n=0):
signal.signal(signal.SIGINT, signal.SIG_IGN)
a,b,c = args
st = rand.randrange(1, 5)
time.sleep(st)
x = f"Worker {n} thread {a+1} slept for {st} - args:{args}"
return (n,x)
if __name__ == '__main__':
nworkers = 5 # ncpus?
results = []
fs = []
with concurrent.futures.ProcessPoolExecutor(max_workers=nworkers) as executor:
data = ((i, 2*i, 3*i) for i in range(100))
for n,args in enumerate(data):
try:
# limit pending tasks
while len(executor._pending_work_items) >= nworkers + 2:
# wait till one completes and get the result
futures = concurrent.futures.wait(fs, return_when=concurrent.futures.FIRST_COMPLETED)
#print(futures)
results.extend(future.result() for future in futures.done)
print(f'{len(results)} results so far')
fs = list(futures.not_done)
print(f'add a new task {n}')
fs.append(executor.submit(child, *args,**{'n':n}))
except KeyboardInterrupt as e:
print('ctrl-c!!}',file=sys.stderr)
# don't add anymore tasks
break
# get leftover results as they finish
for future in concurrent.futures.as_completed(fs):
print(f'{len(executor._pending_work_items)} tasks pending:')
result = future.result()
results.append(result)
results.sort()
# separate the results from the value used to sort
for n,result in results:
print(result)
Here is a way to get the results sorted in submission order without modifying the task. It uses a dictionary to relate each future to its submission order and uses it for the sort key.
# same imports
def child(*args):
signal.signal(signal.SIGINT, signal.SIG_IGN)
a,b,c = args
st = random.randrange(1, 5)
time.sleep(st)
x = f"Worker thread {a+1} slept for {st} - args:{args}"
return x
if __name__ == '__main__':
nworkers = 5 # ncpus?
sort_dict = {}
results = []
fs = []
with concurrent.futures.ProcessPoolExecutor(max_workers=nworkers) as executor:
data = ((i, 2*i, 3*i) for i in range(100))
for n,args in enumerate(data):
try:
# limit pending tasks
while len(executor._pending_work_items) >= nworkers + 2:
# wait till one completes and grab it
futures = concurrent.futures.wait(fs, return_when=concurrent.futures.FIRST_COMPLETED)
results.extend(future for future in futures.done)
print(f'{len(results)} futures completed so far')
fs = list(futures.not_done)
future = executor.submit(child, *args)
fs.append(future)
print(f'task {n} added - future:{future}')
sort_dict[future] = n
except KeyboardInterrupt as e:
print('ctrl-c!!',file=sys.stderr)
# don't add anymore tasks
break
# get leftover futures as they finish
for future in concurrent.futures.as_completed(fs):
print(f'{len(executor._pending_work_items)} tasks pending:')
results.append(future)
#sort the futures
results.sort(key=lambda f: sort_dict[f])
# get the results
for future in results:
print(future.result())
You could also just add an attribute to each future and sort on that (no need for the dictionary)
...
future = executor.submit(child, *args)
# add an attribute to the future that can be sorted on
future.submitted = n
fs.append(future)
...
results.sort(key=lambda f: f.submitted)

How to use pipe correctly in multiple processes(>2)

How to use pipe correctly in multiple processes(>2)?
eg. one producer several consumer
these code is failure in Linux environment
but windows environment is well
import multiprocessing, time
def consumer(pipe,id):
output_p, input_p = pipe
input_p.close()
while True:
try:
item = output_p.recv()
except EOFError:
break
print("%s consume：%s" % (id,item))
#time.sleep(3) # if no sleep these code will fault in Linux environment
# but windows environment is well
print('Consumer done')
def producer(sequence, input_p):
for item in sequence:
print('produce：',item)
input_p.send(item)
time.sleep(1)
if __name__ == '__main__':
(output_p, input_p) = multiprocessing.Pipe()
# create two consumer process
cons_p1 = multiprocessing.Process(target=consumer,args=((output_p,input_p),1))
cons_p1.start()
cons_p2 = multiprocessing.Process(target=consumer,args=((output_p,input_p),2))
cons_p2.start()
output_p.close()
sequence = [i for i in range(10)]
producer(sequence, input_p)
input_p.close()
cons_p1.join()
cons_p2.join()

Do not use pipe for multiple consumers. The documentation explicitly says it will be corrupted when more then two processes read or write. Which you do; two readers.
The two connection objects returned by Pipe() represent the two ends of the pipe. Each connection object has send() and recv() methods (among others). Note that data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time. Of course there is no risk of corruption from processes using different ends of the pipe at the same time.
So use Queue, or JoinableQueue even.
from multiprocessing import Process, JoinableQueue
from Queue import Empty
import time
def consumer(que, pid):
while True:
try:
item = que.get(timeout=10)
print("%s consume:%s" % (pid, item))
que.task_done()
except Empty:
break
print('Consumer done')
def producer(sequence, que):
for item in sequence:
print('produce:', item)
que.put(item)
time.sleep(1)
if __name__ == '__main__':
que = JoinableQueue()
# create two consumer process
cons_p1 = Process(target=consumer, args=(que, 1))
cons_p1.start()
cons_p2 = Process(target=consumer, args=(que, 2))
cons_p2.start()
sequence = [i for i in range(10)]
producer(sequence, que)
que.join()
cons_p1.join()
cons_p2.join()

How to implement a dynamic amount of concurrent threads?

I am launching concurrent threads doing some stuff:
concurrent = 10
q = Queue(concurrent * 2)
for j in range(concurrent):
t = threading.Thread(target=doWork)
t.daemon = True
t.start()
try:
# process each line and assign it to an available thread
for line in call_file:
q.put(line)
q.join()
except KeyboardInterrupt:
sys.exit(1)
At the same time I have a distinct thread counting time:
def printit():
threading.Timer(1.0, printit).start()
print current_status
printit()
I would like to increase (or decrease) the amount of concurrent threads for the main process let's say every minute. I can make a time counter in the time thread and make it do things every minute but how to change the amount of concurrent threads in the main process ?
Is it possible (and if yes how) to do that ?

This is my worker:
def UpdateProcesses(start,processnumber,CachesThatRequireCalculating,CachesThatAreBeingCalculated,CacheDict,CacheLock,IdleLock,FileDictionary,MetaDataDict,CacheIndexDict):
NewPool()
while start[processnumber]:
IdleLock.wait()
while len(CachesThatRequireCalculating)>0 and start[processnumber] == True:
CacheLock.acquire()
try:
cacheCode = CachesThatRequireCalculating[0] # The list can be empty if an other process takes the last item during the CacheLock
CachesThatRequireCalculating.remove(cacheCode)
print cacheCode,"starts processing by",processnumber,"process"
except:
CacheLock.release()
else:
CacheLock.release()
CachesThatAreBeingCalculated.append(cacheCode[:3])
Array,b,f = TIPP.LoadArray(FileDictionary[cacheCode[:2]])#opens the dask array
Array = ((Array[:,:,CacheIndexDict[cacheCode[:2]][cacheCode[2]]:CacheIndexDict[cacheCode[:2]][cacheCode[2]+1]].compute()/2.**(MetaDataDict[cacheCode[:2]]["Bit Depth"])*255.).astype(np.uint16)).transpose([1,0,2]) #slices and calculates the array
f.close() #close the file
if CachesThatAreBeingCalculated.count(cacheCode[:3]) != 0: #if not, this cache is not needed annymore (the cacheCode is removed bij wavelengthchange)
CachesThatAreBeingCalculated.remove(cacheCode[:3])
try: #If the first time the object if not aivalable try a second time
CacheDict[cacheCode[:3]] = Array
except:
CacheDict[cacheCode[:3]] = Array
print cacheCode,"done processing by",processnumber,"process"
if start[processnumber]:
IdleLock.clear()
This is how I start them:
self.ProcessLst = [] #list with all the processes who calculate the caches
for processnumber in range(min(NumberOfMaxProcess,self.processes)):
self.ProcessTerminateLst.append(True)
for processnumber in range(min(NumberOfMaxProcess,self.processes)):
self.ProcessLst.append(process.Process(target=Proc.UpdateProcesses,args=(self.ProcessTerminateLst,processnumber,self.CachesThatRequireCalculating,self.CachesThatAreBeingCalculated,self.CacheDict,self.CacheLock,self.IdleLock,self.FileDictionary,self.MetaDataDict,self.CacheIndexDict,)))
self.ProcessLst[-1].daemon = True
self.ProcessLst[-1].start()
I close them like this:
for i in range(len(self.ProcessLst)): #For both while loops in the processes self.ProcessTerminateLst[i] must be True. So or the process is now ready to be terminad or is still in idle mode.
self.ProcessTerminateLst[i] = False
self.IdleLock.set() #Makes sure no process is in Idle and all are ready to be terminated

I would use a pool. a pool has a max number of threads it uses at the same time, but you can apply inf number of jobs. They stay in the waiting list until a thread is available. I don't think you can change number of current processes in the pool.

multiprocessing - reading big input data - program hangs

I want to run parallel computation on some input data which is loaded from a file. (The file can be really big, so I use a generator for this.)
On a certain number of items, my code runs OK but above this threshold the program hangs (some of the worker processes do not end).
Any suggestions? (I am running this with python2.7, 8 CPUs; 5,000 lines still OK, 7,500 does not work.)
Firstly, you need an input file. Generate it in bash:
for i in {0..10000}; do echo -e "$i"'\r' >> counter.txt; done
Then, run this:
python2.7 main.py 100 counter.txt > run_log.txt
main.py:
#!/usr/bin/python2.7
import os, sys, signal, time
import Queue
import multiprocessing as mp
def eat_queue(job_queue, result_queue):
"""Eats input queue, feeds output queue
"""
proc_name = mp.current_process().name
while True:
try:
job = job_queue.get(block=False)
if job == None:
print(proc_name + " DONE")
return
result_queue.put(execute(job))
except Queue.Empty:
pass
def execute(x):
"""Does the computation on the input data
"""
return x*x
def save_result(result):
"""Saves results in a list
"""
result_list.append(result)
def load(ifilename):
"""Generator reading the input file and
yielding it row by row
"""
ifile = open(ifilename, "r")
for line in ifile:
line = line.strip()
num = int(line)
yield (num)
ifile.close()
print("file closed".upper())
def put_tasks(job_queue, ifilename):
"""Feeds the job queue
"""
for item in load(ifilename):
job_queue.put(item)
for _ in range(get_max_workers()):
job_queue.put(None)
def get_max_workers():
"""Returns optimal number of processes to run
"""
max_workers = mp.cpu_count() - 2
if max_workers < 1:
return 1
return max_workers
def run(workers_num, ifilename):
job_queue = mp.Queue()
result_queue = mp.Queue()
# decide how many processes are to be created
max_workers = get_max_workers()
print "processes available: %d" % max_workers
if workers_num < 1 or workers_num > max_workers:
workers_num = max_workers
workers_list = []
# a process for feeding job queue with the input file
task_gen = mp.Process(target=put_tasks, name="task_gen",
args=(job_queue, ifilename))
workers_list.append(task_gen)
for i in range(workers_num):
tmp = mp.Process(target=eat_queue, name="w%d" % (i+1),
args=(job_queue, result_queue))
workers_list.append(tmp)
for worker in workers_list:
worker.start()
for worker in workers_list:
worker.join()
print "worker %s finished!" % worker.name
if __name__ == '__main__':
result_list = []
args = sys.argv
workers_num = int(args[1])
ifilename = args[2]
run(workers_num, ifilename)

This is because nothing in your code takes anything off result_queue. The behavior then depends on internal queue buffering details: if "not a lot" of data is waiting, everything appears fine, but if "a lot" of data is waiting, everything freezes. Not much more can be said, because it involves layers of internal magic ;-) But the docs do warn about it:
Warning
As mentioned above, if a child process has put items on a queue (and it has not used JoinableQueue.cancel_join_thread), then that process will not terminate until all buffered items have been flushed to the pipe.
This means that if you try joining that process you may get a deadlock unless you are sure that all items which have been put on the queue have been consumed. Similarly, if the child process is non-daemonic then the parent process may hang on exit when it tries to join all its non-daemonic children.
Note that a queue created using a manager does not have this issue. See Programming guidelines.
One easy way to repair that: First add
result_queue.put(None)
before eat_queue() returns. Then add:
count = 0
while count < workers_num:
if result_queue.get() is None:
count += 1
before the main program .join()s the workers. That drains the result queue, and everything shuts down cleanly then.
BTW, this code is pretty bizarre:
while True:
try:
job = job_queue.get(block=False)
if job == None:
print(proc_name + " DONE")
return
result_queue.put(execute(job))
except Queue.Empty:
pass
Why are you doing non-blocking get()? This turns into a CPU-hog "busy loop" so long as the queue is empty. The primary point of .get() is to supply an efficient way to wait for work to show up. So:
while True:
job = job_queue.get()
if job is None:
print(proc_name + " DONE")
break
else:
result_queue.put(execute(job))
result_queue.put(None)
does the same thing, but far more efficiently.
Queue size caution
You didn't ask about this, but let's cover it before it bites you ;-) By default, there is no bound on a Queue's size. If, e.g., you add a billion items to the Queue, it will demand enough RAM to hold a billion items. So if your producer(s) can generate work items faster than your consumer(s) can process them, memory use can get out of hand quickly.
Fortunately, that's easy to repair: specify a maximum queue size. For example,
job_queue = mp.Queue(maxsize=10*workers_num)
^^^^^^^^^^^^^^^^^^^^^^^
Then job_queue.put(some_work_item) will block until consumers reduce the size of the queue to less than the maximum. This way you can process enormous problems with a queue that requires trivial RAM.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

How to use Python multiprocessing queue to access GPU (through PyOpenCL)? - python

Related

multiprocessing value hangs with lock

A way to wait for currently running tasks to finish then stop in multiprocessing Pool

How to use pipe correctly in multiple processes(>2)

How to implement a dynamic amount of concurrent threads?

multiprocessing - reading big input data - program hangs

Categories

Resources