How can I get the result from my process without using a pool ?
(I'm willing to conserve an eye on the progression:
(print "\r",float(done)/total,"%",)
which can't be done using a pool as far I know)
def multiprocess(function, argslist, ncpu):
total = len(argslist)
done = 0
jobs = []
while argslist != []:
if len(mp.active_children()) < ncpu:
p = mp.Process(target=function,args=(argslist.pop(),))
jobs.append(p)
p.start()
done+=1
print "\r",float(done)/total,"%",
#get results here
for job in jobs:
job.get_my_result()???
The processes are really short (<0.5 seconds) but I have around 1 million of them.
I saw this thread Can I get a return value from multiprocessing.Process? I tried to reproduce it but I couldn't make it work properly.
At your entire disposal for any further information.
This question may be considered as a duplicate but anyway here is the solution to my problem:
def multiprocess(function, argslist, ncpu):
total = len(argslist)
done = 0
result_queue = mp.Queue()
jobs = []
while argslist != [] and done<10 :
if len(mp.active_children()) < ncpu:
p = mp.Process(target=function,args=(result_queue, argslist.pop(),))
jobs.append(p)
p.start()
done+=1
print "\r",float(done)/total,"%",
#get results here
res = [result_queue.get() for p in jobs]
print res
and I had to change as well the
return function_result
into
result_queue.put(function_result)
The easiest way should be a queue that is passed as argument to your function. The results of that function can be put into that queue and later on you can iterate over that queue to collect all the results or process it as soon as a result arrives. However, it only works when the you can work with "unordered" results. See the Python documentation for details: Examples for Multiprocessing and Queues
Related
I have 800 files with some data to process, it's enough that I want to use multiprocessing to do this but I think I'm not doing it correctly.
Inside my main() function I'm trying to spin off 1 process for each file that needs processing (I'm guessing that this is not a good idea because my computer won't be able to handle 800 concurrent processes but I haven't gotten that far yet).
Here is my main():
manager = multiprocessing.Manager()
arr = manager.list()
def main():
count = 0
with open("loc.csv") as loc_file:
locs = csv.reader(loc_file, delimiter=',')
for loc in locs:
if count != 0:
process = multiprocessing.Process(target=sort_run, args=[loc])
process.start()
process.join()
count += 1
And then my code that is the target of the process:
def sort_run(loc):
start_time = time.time()
sorted_list = sort_splits.sort_splits(loc[0])
value = process_reads.count_coverage(sorted_list, loc[0])
arr.append([loc[0], value])
I'm using the multiprocessing.Manager() so that my processes can access the arr list properly. I received the error:
An attempt has been made to start a new process before the current
process has finished its bootstrapping phase.
I think what's happening is the loop is too fast to spin off the processes correctly. Or maybe each process has to have a specific variable not just "process = ..."
I am pulling .8 million of records in one go(this is one time process) from mongodb using pymongo and performing some operation over it .
My code look as below.
proc = []
for rec in cursor: # cursor has .8 million rows
print cnt
cnt = cnt + 1
url = rec['urlk']
mkptid = rec['mkptid']
cii = rec['cii']
#self.process_single_layer(url, mkptid, cii)
proc = Process(target=self.process_single_layer, args=(url, mkptid, cii))
procs.append(proc)
proc.start()
# complete the processes
for proc in procs:
proc.join()
process_single_layer is a function which is basically downloading urls.from cloud and storing locally.
Now the problem is downloading process is slow as it has to hit a url. And since records are huge to process 1k rows it is taking 6 minutes.
To reduce the time I wanted to implement Multiprocessing. But It is hard to see any difference with above code.
Please suggest me how can I improve the performance in this scenario.
First of all you need to count all the rows in your file and then spawn a fixed number of processes (ideally matching the number of your processor cores), to which you feed via queues (one for each process) a number of rows equal to the division total_number_of_rows / number_of_cores. The idea behind this approach is that you split the processing of those rows between multiple processes, hence achieving parallelism.
A way to find out the number of cores dynamically is by doing:
import multiprocessing as mp
cores_count = mp.cpu_count()
A slight improvement that can be done by avoiding the initial rows count is by adding a row cyclically by creating the list of queues and then apply a cycle iterator on it.
A full example:
import queue
import multiprocessing as mp
import itertools as itools
cores_count = mp.cpu_count()
def dosomething(q):
while True:
try:
row = q.get(timeout=5)
except queue.Empty:
break
# ..do some processing here with the row
pass
if __name__ == '__main__':
processes
queues = []
# spawn the processes
for i in range(cores_count):
q = mp.Queue()
queues.append(q)
proc = Process(target=dosomething, args=(q,))
processes.append(proc)
queues_cycle = itools.cycle(queues)
for row in cursor:
q = next(queues_cycle)
q.put(row)
# do the join after spawning all the processes
for p in processes:
p.join()
It's easier to use a pool in this scenario.
Queues are not necessary as you don't need to communicate between your spawned processes. We can use the Pool.map to distribute the workload.
Pool.imap or Pool.imap_unordered might be faster with a larger chunk size. (Ref: https://docs.python.org/3/library/multiprocessing.html#multiprocessing.pool.Pool.imap) You can use the Pool.starmap if you want and get rid of tuple unpacking.
from multiprocessing import Pool
def process_single_layer(data):
# unpack the tuple and do the processing
url, mkptid, cii = data
return "downloaded" + url
def get_urls():
# replace this code: iterate over cursor and yield necessary data as a tuple
for rec in range(8):
url = "url:" + str(rec)
mkptid = "mkptid:" + str(rec)
cii = "cii:" + str(rec)
yield (url, mkptid, cii)
# you can come up with suitable process count based on the number of CPUs.
with Pool(processes=4) as pool:
print(pool.map(process_single_layer, get_urls()))
Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
I am new to python multiprocess and I want to understand why my code does not terminate (maybe zombi or deadlock) and how to fix it. The createChain functions execute a for loop also and returns a tuple: (value1, value2). Inside createChain function there are other calls to other functions. I don't think posting the createChain function code will help because inside that function I am not doing something regarding multiprocess. I tried to make the processes as deamon but still didn't work. The strange think is that if I decrease the value of maxChains i.e to 500 or 100 is working.
I just want the process to do some heavy tasks and put the results to a data type.
My version of python is 2.7
def createTable(chainsPerCore, q, chainLength):
for chain in xrange(chainsPerCore):
q.put(createChain(chainLength, chain))
def initTable():
maxChains = 1000
chainLength = 10000
resultsQueue = JoinableQueue()
numOfCores = cpu_count()
chainsPerCore = maxChains / numOfCores
processes = [Process(target=createTable, args=(chainsPerCore, resultsQueue, chainLength,)) for x in range(numOfCores)]
for p in processes:
# p.daemon = True
p.start()
# Wait for hashing cores to finish
for p in processes:
p.join()
resultsQueue.task_done()
temp = [resultsQueue.get() for p in processes]
print temp
Based on the very useful comments of Tadhg McDonald-Jensen I understood better my needs and how the Queues are workings and for what purpose they should be used.
I change my code to
def initTable(output):
maxChains = 1000
results = []
with closing(Pool(processes=8)) as pool:
results = pool.map(createChain, xrange(maxChains))
pool.terminate()
I have a Producer process that runs and puts the results in a Queue
I also have a Consumer function that takes the results from the Queue and processes them , for example:
def processFrame(Q,commandsFile):
fr = Q.get()
frameNum = fr[0]
Frame = fr[1]
#
# Process the frame
#
commandsFile.write(theProcessedResult)
I want to run my consumer function using multiple processes, they number should be set by user:
processes = raw_input('Enter the number of process you want to use: ')
i tried using Pool:
pool = Pool(int(processes))
pool.apply(processFrame, args=(q,toFile))
when i try this , it returns a RuntimeError: Queue objects should only be shared between processes through inheritance.
what does that mean?
I also tried to use a list of processes:
while (q.empty() == False):
mp = [Process(target=processFrame, args=(q,toFile)) for x in range(int(processes))]
for p in mp:
p.start()
for p in mp:
p.join()
This one seems to run, but not as expected.
it using multiple processes on same frame from Queue, doesn't Queue have locks?
also ,in this case the number of processes i'm allowed to use must divide the number of frames without residue(reminder) - for example:
if i have 10 frames i can use only 1,2,5,10 processes. if i use 3,4.. it will create a process while Q empty and wont work.
if u want to recycle the procces until q is empty u should just try to do somthing like that:
code1:
def proccesframe():
while(True):
frame = queue.get()
##do something
your procces will be blocked until there is something in the queue
i dont think that's a good idie to use multiproccess on the cunsomer part , you should use them on the producer.
if u want to terminate the procces when the queue is empty u can do something like that:
code2:
def proccesframe():
while(!queue.empty()):
frame = queue.get()
##do something
terminate_procces()
update:
if u want to use multiproccesing in the consumer part just do a simple loop and add code2 , then you will be able to close your proccess when u finish doing stuff with the queue.
I am not entirely sure what are you trying to accomplish from your explanation, but have you considered using multiprocessing.Pool with its methods map or map_async?
from multiprocessing import Pool
from foo import bar # your function
if __name__ == "__main__":
p = Pool(4) # your number of processes
result = p.map_async(bar, [("arg #1", "arg #2"), ...])
print result.get()
It collects result from your function in unordered(!) iterable and you can use it however you wish.
UPDATE
I think you should not use queue and be more straightforward:
from multiprocessing import Pool
def process_frame(fr): # PEP8 and see the difference in definition
# magic
return result # and result handling!
if __name__ == "__main__":
p = Pool(4) # your number of processes
results = p.map_async(process_frame, [fr_1, fr_2, ...])
# Do not ever write or manipulate with files in parallel processes
# if you are not 100% sure what you are doing!
for result in results.get():
commands_file.write(result)
UPDATE 2
from multiprocessing import Pool
import random
import time
def f(x):
return x*x
def g(yr):
with open("result.txt", "ab") as f:
for y in yr:
f.write("{}\n".format(y))
if __name__ == '__main__':
pool = Pool(4)
while True:
# here you fetch new data and send it to process
new_data = [random.randint(1, 50) for i in range(4)]
pool.map_async(f, new_data, callback=g)
Some example how to do it and I updated the algorithm to be "infinite", it can be only closed by interruption or kill command from outside. You can use also apply_async, but it would cause slow downs with result handling (depending on speed of processing).
I have also tried using long-time open result.txt in global scope, but every time it hit deadlock.