I am new to python multiprocess and I want to understand why my code does not terminate (maybe zombi or deadlock) and how to fix it. The createChain functions execute a for loop also and returns a tuple: (value1, value2). Inside createChain function there are other calls to other functions. I don't think posting the createChain function code will help because inside that function I am not doing something regarding multiprocess. I tried to make the processes as deamon but still didn't work. The strange think is that if I decrease the value of maxChains i.e to 500 or 100 is working.
I just want the process to do some heavy tasks and put the results to a data type.
My version of python is 2.7
def createTable(chainsPerCore, q, chainLength):
for chain in xrange(chainsPerCore):
q.put(createChain(chainLength, chain))
def initTable():
maxChains = 1000
chainLength = 10000
resultsQueue = JoinableQueue()
numOfCores = cpu_count()
chainsPerCore = maxChains / numOfCores
processes = [Process(target=createTable, args=(chainsPerCore, resultsQueue, chainLength,)) for x in range(numOfCores)]
for p in processes:
# p.daemon = True
p.start()
# Wait for hashing cores to finish
for p in processes:
p.join()
resultsQueue.task_done()
temp = [resultsQueue.get() for p in processes]
print temp
Based on the very useful comments of Tadhg McDonald-Jensen I understood better my needs and how the Queues are workings and for what purpose they should be used.
I change my code to
def initTable(output):
maxChains = 1000
results = []
with closing(Pool(processes=8)) as pool:
results = pool.map(createChain, xrange(maxChains))
pool.terminate()
Related
I'm used to multiprocessing, but now I have a problem where mp.Pool isn't the tool that I need.
I have a process that prepares input and another process that uses it. I'm not using up all of my cores, so I want to have the two go at the same time, with the first getting the batch ready for the next iteration. How do I do this? And (importantly) what is this sort of thing called, so that I can go and google it?
Here's a dummy example. The following code takes 8 seconds:
import time
def make_input():
time.sleep(1)
return "cthulhu r'lyeh wgah'nagl fhtagn"
def make_output(input):
time.sleep(1)
return input.upper()
start = time.time()
for i in range(4):
input = make_input()
output = make_output(input)
print(output)
print(time.time() - start)
CTHULHU R'LYEH WGAH'NAGL FHTAGN
CTHULHU R'LYEH WGAH'NAGL FHTAGN
CTHULHU R'LYEH WGAH'NAGL FHTAGN
CTHULHU R'LYEH WGAH'NAGL FHTAGN
8.018263101577759
If I were preparing input batches at the same time as I was doing the output, it would take four seconds. Something like this:
next_input = make_input()
start = time.time()
for i in range(4):
res = do_at_the_same_time(
output = make_output(next_input),
next_input = make_input()
)
print(output)
print(time.time() - start)
But, obviously, that doesn't work. How can I accomplish what I'm trying to accomplish?
Important note: I tried the following, but it failed because the executing worker was working in the wrong scope (like, for my actual use-case). In my dummy use-case, it doesn't work because it prints in a different process.
def proc(i):
if i == 0:
return make_input()
if i == 1:
return make_output(next_input)
next_input = make_input()
for i in range(4):
pool = mp.Pool(2)
next_input = pool.map(proc, [0, 1])[0]
pool.close()
So I need a solution where the second processes happens in the same scope or environment as the for loop, and where the first has output that can be gotten from that scope.
You should be able to use Pool. If I understand it correctly, you want one worker to prepare the input for the next worker which runs and does something more with it, given your example functions, this should do just that:
pool = mp.Pool(2)
for i in range(4):
next_input = pool.apply(make_input)
pool.apply_async(make_output, (next_input, ), callback=print)
pool.close()
pool.join()
We prepare a pool with 2 workers, now we want run the loop to run our pair of tasks twice.
We delegate make_input to a worker using apply() waiting for the function to complete assign the result to next_input. Note: in this example we could have used a single worker pool and just run next_input = make_input() (i.e. in the same process your script runs in and just delegate the make_output()).
Now the more interesting bit: by using apply_async() we ask a worker to run make_output, passing single parameter next_input to it and telling it to runt (or any function) print with the result of make_output as argument passed to the function registered with callback.
Then we close() the pool not accepting any more jobs and join() to wait for processes to complete their jobs.
Is there a way to have multithreading implemented for multiple for loops under a single function. I am aware that it can be achieved if we have separate functions, but is it possible to have it under the same function.
For example:
def sqImport():
for i in (0,50):
do something specific to 0-49
for i in (50,100):
do something specific to 50-99
for i in (100,150):
do something specific to 100-149
If there are 3 separate functions for 3 different for loops then we can do:
threadA = Thread(target = loopA)
threadB = Thread(target = loopB)
threadC = Thread(target = loopC)
threadA.run()
threadB.run()
threadC.run()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()
threadC.join()
But is there a way to achieve this under a single function?
First of all: I think you really should take a look at multiprocessing.ThreadPool if you are going to use it in a productive system. What I describe below is just a possible workaround (which might be simpler and therefore could be used for testing purposes).
You could pass an id to the function and use that to decide which loop you take like so:
from threading import Thread
def sqImport(tId):
if tId == 0:
for i in range(0,50):
print i
elif tId == 1:
for i in range(50,100):
print i
elif tId == 2:
for i in range(100,150):
print i
threadA = Thread(target = sqImport, args=[0])
threadB = Thread(target = sqImport, args=[1])
threadC = Thread(target = sqImport, args=[2])
threadA.start()
threadB.start()
threadC.start()
# Do work indepedent of loopA and loopB
threadA.join()
threadB.join()
threadC.join()
Note that I used start() instead of run() because run() does not start a different thread but executes in the current thread context. Moreover I changed your for i in (x, y) loops in for i in range(x,y) loops, because I think, You want to iterate over a range and not a tuple(that would iterate only over x and y).
An alternative Solution using multiprocessing might look like this:
from multiprocessing.dummy import Pool as ThreadPool
# The worker function
def sqImport(data):
for i in data:
print i
# The three ranges for the three different threads
ranges = [
range(0, 50),
range(50, 100),
range(100, 150)
]
# Create a threadpool with 3 threads
pool = ThreadPool(3)
# Run sqImport() on all ranges
pool.map(sqImport, ranges)
pool.close()
pool.join()
You can use multiprocessing.ThreadPool which will divide you tasks equally between running threads.
Follow Threading pool similar to the multiprocessing Pool? for more on this.
If you are really looking for parallel execution then go for processes because threads will face python GIL(Global Interpreted Lock).
I have a Producer process that runs and puts the results in a Queue
I also have a Consumer function that takes the results from the Queue and processes them , for example:
def processFrame(Q,commandsFile):
fr = Q.get()
frameNum = fr[0]
Frame = fr[1]
#
# Process the frame
#
commandsFile.write(theProcessedResult)
I want to run my consumer function using multiple processes, they number should be set by user:
processes = raw_input('Enter the number of process you want to use: ')
i tried using Pool:
pool = Pool(int(processes))
pool.apply(processFrame, args=(q,toFile))
when i try this , it returns a RuntimeError: Queue objects should only be shared between processes through inheritance.
what does that mean?
I also tried to use a list of processes:
while (q.empty() == False):
mp = [Process(target=processFrame, args=(q,toFile)) for x in range(int(processes))]
for p in mp:
p.start()
for p in mp:
p.join()
This one seems to run, but not as expected.
it using multiple processes on same frame from Queue, doesn't Queue have locks?
also ,in this case the number of processes i'm allowed to use must divide the number of frames without residue(reminder) - for example:
if i have 10 frames i can use only 1,2,5,10 processes. if i use 3,4.. it will create a process while Q empty and wont work.
if u want to recycle the procces until q is empty u should just try to do somthing like that:
code1:
def proccesframe():
while(True):
frame = queue.get()
##do something
your procces will be blocked until there is something in the queue
i dont think that's a good idie to use multiproccess on the cunsomer part , you should use them on the producer.
if u want to terminate the procces when the queue is empty u can do something like that:
code2:
def proccesframe():
while(!queue.empty()):
frame = queue.get()
##do something
terminate_procces()
update:
if u want to use multiproccesing in the consumer part just do a simple loop and add code2 , then you will be able to close your proccess when u finish doing stuff with the queue.
I am not entirely sure what are you trying to accomplish from your explanation, but have you considered using multiprocessing.Pool with its methods map or map_async?
from multiprocessing import Pool
from foo import bar # your function
if __name__ == "__main__":
p = Pool(4) # your number of processes
result = p.map_async(bar, [("arg #1", "arg #2"), ...])
print result.get()
It collects result from your function in unordered(!) iterable and you can use it however you wish.
UPDATE
I think you should not use queue and be more straightforward:
from multiprocessing import Pool
def process_frame(fr): # PEP8 and see the difference in definition
# magic
return result # and result handling!
if __name__ == "__main__":
p = Pool(4) # your number of processes
results = p.map_async(process_frame, [fr_1, fr_2, ...])
# Do not ever write or manipulate with files in parallel processes
# if you are not 100% sure what you are doing!
for result in results.get():
commands_file.write(result)
UPDATE 2
from multiprocessing import Pool
import random
import time
def f(x):
return x*x
def g(yr):
with open("result.txt", "ab") as f:
for y in yr:
f.write("{}\n".format(y))
if __name__ == '__main__':
pool = Pool(4)
while True:
# here you fetch new data and send it to process
new_data = [random.randint(1, 50) for i in range(4)]
pool.map_async(f, new_data, callback=g)
Some example how to do it and I updated the algorithm to be "infinite", it can be only closed by interruption or kill command from outside. You can use also apply_async, but it would cause slow downs with result handling (depending on speed of processing).
I have also tried using long-time open result.txt in global scope, but every time it hit deadlock.
How can I get the result from my process without using a pool ?
(I'm willing to conserve an eye on the progression:
(print "\r",float(done)/total,"%",)
which can't be done using a pool as far I know)
def multiprocess(function, argslist, ncpu):
total = len(argslist)
done = 0
jobs = []
while argslist != []:
if len(mp.active_children()) < ncpu:
p = mp.Process(target=function,args=(argslist.pop(),))
jobs.append(p)
p.start()
done+=1
print "\r",float(done)/total,"%",
#get results here
for job in jobs:
job.get_my_result()???
The processes are really short (<0.5 seconds) but I have around 1 million of them.
I saw this thread Can I get a return value from multiprocessing.Process? I tried to reproduce it but I couldn't make it work properly.
At your entire disposal for any further information.
This question may be considered as a duplicate but anyway here is the solution to my problem:
def multiprocess(function, argslist, ncpu):
total = len(argslist)
done = 0
result_queue = mp.Queue()
jobs = []
while argslist != [] and done<10 :
if len(mp.active_children()) < ncpu:
p = mp.Process(target=function,args=(result_queue, argslist.pop(),))
jobs.append(p)
p.start()
done+=1
print "\r",float(done)/total,"%",
#get results here
res = [result_queue.get() for p in jobs]
print res
and I had to change as well the
return function_result
into
result_queue.put(function_result)
The easiest way should be a queue that is passed as argument to your function. The results of that function can be put into that queue and later on you can iterate over that queue to collect all the results or process it as soon as a result arrives. However, it only works when the you can work with "unordered" results. See the Python documentation for details: Examples for Multiprocessing and Queues
I am using the multiprocessor.Pool() module to speed up an "embarrassingly parallel" loop. I actually have a nested loop, and am using multiprocessor.Pool to speed up the inner loop. For example, without parallelizing the loop, my code would be as follows:
outer_array=[random_array1]
inner_array=[random_array2]
output=[empty_array]
for i in outer_array:
for j in inner_array:
output[j][i]=full_func(j,i)
With parallelizing:
import multiprocessing
from functools import partial
outer_array=[random_array1]
inner_array=[random_array2]
output=[empty_array]
for i in outer_array:
partial_func=partial(full_func,arg=i)
pool=multiprocessing.Pool()
output[:][i]=pool.map(partial_func,inner_array)
pool.close()
My main question is if this is the correct, and I should be including the multiprocessing.Pool() inside the loop, or if instead I should create the pool outside loop, i.e.:
pool=multiprocessing.Pool()
for i in outer_array:
partial_func=partial(full_func,arg=i)
output[:][i]=pool.map(partial_func,inner_array)
Also, I am not sure if I should include the line "pool.close()" at the end of each loop in the second example above; what would be the benefits of doing so?
Thanks!
Ideally, you should call the Pool() constructor exactly once - not over & over again. There are substantial overheads when creating worker processes, and you pay those costs every time you invoke Pool(). The processes created by a single Pool() call stay around! When they finish the work you've given to them in one part of the program, they stick around, waiting for more work to do.
As to Pool.close(), you should call that when - and only when - you're never going to submit more work to the Pool instance. So Pool.close() is typically called when the parallelizable part of your main program is finished. Then the worker processes will terminate when all work already assigned has completed.
It's also excellent practice to call Pool.join() to wait for the worker processes to terminate. Among other reasons, there's often no good way to report exceptions in parallelized code (exceptions occur in a context only vaguely related to what your main program is doing), and Pool.join() provides a synchronization point that can report some exceptions that occurred in worker processes that you'd otherwise never see.
Have fun :-)
import itertools
import multiprocessing as mp
def job(params):
a = params[0]
b = params[1]
return a*b
def multicore():
a = range(1000)
b = range(2000)
paramlist = list(itertools.product(a,b))
print(paramlist[0])
pool = mp.Pool(processes = 4)
res=pool.map(job, paramlist)
for i in res:
print(i)
if __name__=='__main__':
multicore()
how about this?
import time
from pathos.parallel import stats
from pathos.parallel import ParallelPool as Pool
def work(x, y):
return x * y
pool = Pool(5)
pool.ncpus = 4
pool.servers = ('localhost:5654',)
t1 = time.time()
results = pool.imap(work, range(1, 2), range(1, 11))
print("INFO: List is: %s" % list(results))
print(stats())
t2 = time.time()
print("TIMER: Function completed time is: %.5f" % (t2 - t1))