I am trying to use multiprocessing to process very large number of files.
I tried to put the list of files into queue and make 3 workers split the load with a common Queue data type. However this seems not working. Probably I am misunderstanding about the queue in multiprocessing package.
Below is the example source code:
import multiprocessing
from multiprocessing import Queue
def worker(i, qu):
"""worker function"""
while ~qu.empty():
val=qu.get()
print 'Worker:',i, ' start with file:',val
j=1
for k in range(i*10000,(i+1)*10000): # some time consuming process
for j in range(i*10000,(i+1)*10000):
j=j+k
print 'Worker:',i, ' end with file:',val
if __name__ == '__main__':
jobs = []
qu=Queue()
for j in range(100,110): # files numbers are from 100 to 110
qu.put(j)
for i in range(3): # 3 multiprocess
p = multiprocessing.Process(target=worker, args=(i,qu))
jobs.append(p)
p.start()
p.join()
Thanks for the comments.
I come to know that using Pool is the best solution.
import multiprocessing
import time
def worker(val):
"""worker function"""
print 'Worker: start with file:',val
time.sleep(1.1)
print 'Worker: end with file:',val
if __name__ == '__main__':
file_list=range(100,110)
p = multiprocessing.Pool(2)
p.map(worker, file_list)
Two issues:
1) you are joining only on the 3rd process
2) Why not use multiprocessing.Pool?
3) race condition on qu.get()
1 & 3)
import multiprocessing
from multiprocessing import Queue
def worker(i, qu):
"""worker function"""
while 1:
try:
val=qu.get(timeout)
except Queue.Empty: break# Yay no race condition
print 'Worker:',i, ' start with file:',val
j=1
for k in range(i*10000,(i+1)*10000): # some time consuming process
for j in range(i*10000,(i+1)*10000):
j=j+k
print 'Worker:',i, ' end with file:',val
if __name__ == '__main__':
jobs = []
qu=Queue()
for j in range(100,110): # files numbers are from 100 to 110
qu.put(j)
for i in range(3): # 3 multiprocess
p = multiprocessing.Process(target=worker, args=(i,qu))
jobs.append(p)
p.start()
for p in jobs: #<--- join on all processes ...
p.join()
2)
for how to use the Pool, see:
https://docs.python.org/2/library/multiprocessing.html
You are joining only the last of your created processes. That means if the first or the second process is still working while the third is finished, your main process is goning down and kills the remaining processes before they are finished.
You should join them all in order to wait until they are finished:
for p in jobs:
p.join()
Another thing is you should consider using qu.get_nowait() in order to get rid of the race condition between qu.empty() and qu.get().
For example:
try:
while 1:
message = self.queue.get_nowait()
""" do something fancy here """
except Queue.Empty:
pass
I hope that helps
Related
I am following some examples online to learn how to program in parallel, i.e., how to use multiprocessing.
I am running on windows 10, with spyder 3.3.6, python 3.7.
import os
import time
from multiprocessing import Process, Queue
def square(numbers, queue):
print("started square")
for i in numbers:
queue.put(i*i)
print(i*i)
print(f"{os.getpid()}")
def cube(numbers, queue):
print("started cube")
for i in numbers:
queue.put(i*i*i)
print(i*i*i)
print(f"{os.getpid()}")
if __name__ == '__main__':
numbers = range(5)
queue = Queue()
square_process = Process(target=square, args=(numbers,queue))
cube_process = Process(target=cube, args=(numbers,queue))
square_process.start()
cube_process.start()
square_process.join()
cube_process.join()
print("Already joined")
while not queue.empty():
print(queue.get())
I expect the output of queue to be mixed or uncertain as it depends on how fast a process is started or how fast the first process finishes all the statements?
Theoretically, we can get something like 0, 1, 4, 8, 9, 27, 16, 64.
But the actual output is sequential like below
0
1
4
9
16
0
1
8
27
64
There are few things to understand here
Two processes are executing square and cube functions independently. Within the functions they will maintain the order as it is governed by for loop.
The only part that is going to be random at a point in time is - 'which process is executing and adding what to queue'. So it may be that square process is in its 5th iteration (i = 4) while cube process is in its 2nd iteration (i = 1).
You are using a single instance of Queue to add items from two processes that are executing square and cube functions separately. Queues are first in first out (FIFO) so when you get from Queue (& print in the main thread) it will maintain the order in which it has received the items.
Execute following updated version of your program, to better understand
import os
import time
from multiprocessing import Process, Queue
def square(numbers, queue):
print("started square process id is %s"%os.getpid())
for i in numbers:
queue.put("Square of %s is %s "%(i, i*i))
print("square: added %s in queue:"%i)
def cube(numbers, queue):
print("started cube process id is %s"%os.getpid())
for i in numbers:
queue.put("Cube of %s is %s "%(i, i*i*i))
print("cube: added %s in queue:"%i)
if __name__ == '__main__':
numbers = range(15)
queue = Queue()
square_process = Process(target=square, args=(numbers,queue))
cube_process = Process(target=cube, args=(numbers,queue))
square_process.start()
cube_process.start()
square_process.join()
cube_process.join()
print("Already joined")
while not queue.empty():
print(queue.get())
pretty sure this is just because spinning up a process takes some time, so they tend to run after each other
I rewrote it to make jobs have a better chance of running in parallel:
from multiprocessing import Process, Queue
from time import time, sleep
def fn(queue, offset, start_time):
sleep(start_time - time())
for i in range(10):
queue.put(offset + i)
if __name__ == '__main__':
queue = Queue()
start_time = time() + 0.1
procs = []
for i in range(2):
args = (queue, i * 10, start_time)
procs.append(Process(target=fn, args=args))
for p in procs: p.start()
for p in procs: p.join()
while not queue.empty():
print(queue.get())
I should note that I get nondeterministic ordering of output as you seemed to be expecting. I'm under Linux so you might get something different under Windows, but I think it's unlikely
Looks like MisterMiyagi is right. Start additional python process is much more expensive, than calculating squares from 0 to 4 :) I've created version of code with lock primitive and now we sure that processes started simultaneously.
import os
from multiprocessing import Process, Queue, Lock
def square(numbers, queue, lock):
print("started square")
# Block here, until lock release
lock.acquire()
for i in numbers:
queue.put(i*i)
print(f"{os.getpid()}")
def cube(numbers, queue, lock):
# Finally release lock
lock.release()
print("started cube")
for i in numbers:
queue.put(i*i*i)
print(f"{os.getpid()}")
if __name__ == '__main__':
numbers = range(5)
queue = Queue()
lock = Lock()
# Activate lock
lock.acquire()
square_process = Process(target=square, args=(numbers,queue,lock))
cube_process = Process(target=cube, args=(numbers,queue,lock))
square_process.start()
cube_process.start()
cube_process.join()
square_process.join()
print("Already joined")
while not queue.empty():
print(queue.get())
My output is:
0
0
1
4
1
9
8
16
27
64
The processes themselves are not doing anything CPU heavy or network bound so they take pretty negligible amount of time to execute. My guess would be that by the time the second process is started, the first one is already finished. Processes are parallel by nature, but since your tasks are so menial it gives the illusion that they are being run sequentially. You can introduce some randomness into your script to see the parallelism in action,
import os
from multiprocessing import Process, Queue
from random import randint
from time import sleep
def square(numbers, queue):
print("started square")
for i in numbers:
if randint(0,1000)%2==0:
sleep(3)
queue.put(i*i)
print(i*i)
print(f"square PID : {os.getpid()}")
def cube(numbers, queue):
print("started cube")
for i in numbers:
if randint(0,1000)%2==0:
sleep(3)
queue.put(i*i*i)
print(i*i*i)
print(f"cube PID : {os.getpid()}")
if __name__ == '__main__':
numbers = range(5)
queue = Queue()
square_process = Process(target=square, args=(numbers,queue))
cube_process = Process(target=cube, args=(numbers,queue))
square_process.start()
cube_process.start()
square_process.join()
cube_process.join()
print("Already joined")
while not queue.empty():
print(queue.get())
Here the two processes randomly pause their execution, so when one process is paused the other one gets a chance to add a number to the queue (multiprocessing.Queue is thread and process safe). If you run this script a couple of times you'll see that the order of items in the queue are not always the same
I am following an instruction from youtube to learn multiprocessing
from multiprocessing import Pool
import subprocess
import time
def f(n):
sum = 0
for x in range(1000):
sum += x*x
return sum
if __name__ == "__main__":
t1 = time.time()
p = Pool()
result = p.map(f, range(10000))
p.close()
p.join()
print("Pool took: ", time.time()-t1)
I am puzzled about p.close() and p.join()
when processes were closed, they did not exist any more, how could manipulate .join to them?
join() waits for a child process to be killed. Killed processes send a signal informing their parents that they are quite dead. close() doesn't kill any process, It just closes a pipe which informs readers of that pipe, that there will be no more data coming through it.
I know the basic usage of multiprocessing about pools,and I use apply_async() func to avoid block,my problem code such like:
from multiprocessing import Pool, Queue
import time
q = Queue(maxsize=20)
script = "my_path/my_exec_file"
def initQueue():
...
def test_func(queue):
print 'Coming'
While True:
do_sth
...
if __name__ == '__main__':
initQueue()
pool = Pool(processes=3)
for i in xrange(11,20):
result = pool.apply_async(test_func, (q,))
pool.close()
while True:
if q.empty():
print 'Queue is emty,quit'
break
print 'Main Process Lintening'
time.sleep(2)
The results output are always Main Process Linstening,I can;t find word 'Coming'..
The code above has no syntax error and no any Exceptions.
Any one can help, thanks!
I'm doing an optimization of parameters of a complex simulation. I'm using the multiprocessing module for enhancing the performance of the optimization algorithm. The basics of multiprocessing I learned at http://pymotw.com/2/multiprocessing/basics.html.
The complex simulation lasts different times depending on the given parameters from the optimization algorithm, around 1 to 5 minutes. If the parameters are chosen very badly, the simulation can last 30 minutes or more and the results are not useful. So I was thinking about build in a timeout to the multiprocessing, that terminates all simulations that last more than a defined time. Here is an abstracted version of the problem:
import numpy as np
import time
import multiprocessing
def worker(num):
time.sleep(np.random.random()*20)
def main():
pnum = 10
procs = []
for i in range(pnum):
p = multiprocessing.Process(target=worker, args=(i,), name = ('process_' + str(i+1)))
procs.append(p)
p.start()
print('starting', p.name)
for p in procs:
p.join(5)
print('stopping', p.name)
if __name__ == "__main__":
main()
The line p.join(5) defines the timeout of 5 seconds. Because of the for-loop for p in procs: the program waits 5 seconds until the first process is finished and then again 5 seconds until the second process is finished and so on, but i want the program to terminate all processes that last more than 5 seconds. Additionally, if none of the processes last longer than 5 seconds the program must not wait this 5 seconds.
You can do this by creating a loop that will wait for some timeout amount of seconds, frequently checking to see if all processes are finished. If they don't all finish in the allotted amount of time, then terminate all of the processes:
TIMEOUT = 5
start = time.time()
while time.time() - start <= TIMEOUT:
if not any(p.is_alive() for p in procs):
# All the processes are done, break now.
break
time.sleep(.1) # Just to avoid hogging the CPU
else:
# We only enter this if we didn't 'break' above.
print("timed out, killing all processes")
for p in procs:
p.terminate()
p.join()
If you want to kill all the processes you could use the Pool from multiprocessing
you'll need to define a general timeout for all the execution as opposed of individual timeouts.
import numpy as np
import time
from multiprocessing import Pool
def worker(num):
xtime = np.random.random()*20
time.sleep(xtime)
return xtime
def main():
pnum = 10
pool = Pool()
args = range(pnum)
pool_result = pool.map_async(worker, args)
# wait 5 minutes for every worker to finish
pool_result.wait(timeout=300)
# once the timeout has finished we can try to get the results
if pool_result.ready():
print(pool_result.get(timeout=1))
if __name__ == "__main__":
main()
This will get you a list with the return values for all your workers in order.
More information here:
https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.pool
Thanks to the help of dano I found a solution:
import numpy as np
import time
import multiprocessing
def worker(num):
time.sleep(np.random.random()*20)
def main():
pnum = 10
TIMEOUT = 5
procs = []
bool_list = [True]*pnum
for i in range(pnum):
p = multiprocessing.Process(target=worker, args=(i,), name = ('process_' + str(i+1)))
procs.append(p)
p.start()
print('starting', p.name)
start = time.time()
while time.time() - start <= TIMEOUT:
for i in range(pnum):
bool_list[i] = procs[i].is_alive()
print(bool_list)
if np.any(bool_list):
time.sleep(.1)
else:
break
else:
print("timed out, killing all processes")
for p in procs:
p.terminate()
for p in procs:
print('stopping', p.name,'=', p.is_alive())
p.join()
if __name__ == "__main__":
main()
Its not the most elegant way, I'm sure there is a better way than using bool_list. Processes that are still alive after the timeout of 5 seconds will be killed. If you are setting shorter times in the worker function than the timeout, you will see that the program stops before the timeout of 5 seconds is reached. I'm still open for more elegant solutions if there are :)
I'm facing problems with the following example code:
from multiprocessing import Lock, Process, Queue, current_process
def worker(work_queue, done_queue):
for item in iter(work_queue.get, 'STOP'):
print("adding ", item, "to done queue")
#this works: done_queue.put(item*10)
done_queue.put(item*1000) #this doesnt!
return True
def main():
workers = 4
work_queue = Queue()
done_queue = Queue()
processes = []
for x in range(10):
work_queue.put("hi"+str(x))
for w in range(workers):
p = Process(target=worker, args=(work_queue, done_queue))
p.start()
processes.append(p)
work_queue.put('STOP')
for p in processes:
p.join()
done_queue.put('STOP')
for item in iter(done_queue.get, 'STOP'):
print(item)
if __name__ == '__main__':
main()
When the done Queue becomes big enough (a limit about 64k i think), the whole thing freezes without any further notice.
What is the general approach for such a situation when the queue becomes too big? is there some way to remove elements on the fly once they are processed? The Python docs recommend removing the p.join(), in a real application however i can not estimate when the processes have finished. Is there a simple solution for this problem besides infinite looping and using .get_nowait()?
This works for me with 3.4.0alpha4, 3.3, 3.2, 3.1 and 2.6. It tracebacks with 2.7 and 3.0. I pylint'd it, BTW.
#!/usr/local/cpython-3.3/bin/python
'''SSCCE for a queue deadlock'''
import sys
import multiprocessing
def worker(workerno, work_queue, done_queue):
'''Worker function'''
#reps = 10 # this worked for the OP
#reps = 1000 # this worked for me
reps = 10000 # this didn't
for item in iter(work_queue.get, 'STOP'):
print("adding", item, "to done queue")
#this works: done_queue.put(item*10)
for thing in item * reps:
#print('workerno: {}, adding thing {}'.format(workerno, thing))
done_queue.put(thing)
done_queue.put('STOP')
print('workerno: {0}, exited loop'.format(workerno))
return True
def main():
'''main function'''
workers = 4
work_queue = multiprocessing.Queue(maxsize=0)
done_queue = multiprocessing.Queue(maxsize=0)
processes = []
for integer in range(10):
work_queue.put("hi"+str(integer))
for workerno in range(workers):
dummy = workerno
process = multiprocessing.Process(target=worker, args=(workerno, work_queue, done_queue))
process.start()
processes.append(process)
work_queue.put('STOP')
itemno = 0
stops = 0
while True:
item = done_queue.get()
itemno += 1
sys.stdout.write('itemno {0}\r'.format(itemno))
if item == 'STOP':
stops += 1
if stops == workers:
break
print('exited done_queue empty loop')
for workerno, process in enumerate(processes):
print('attempting process.join() of workerno {0}'.format(workerno))
process.join()
done_queue.put('STOP')
if __name__ == '__main__':
main()
HTH