I've got a list of links in my program, and I want to visit these links and collect some data from them, also I've got three other empty lists which I will append to based on the result. I don't know how to use multiprocessing on this task.
All processes should read from the main list and will write(append) to one of other three lists.
How to distribute links between processes.
Is multiprocessing even suitable for this task? If yes witch method should be used?
def remove_redirect():
main_list = get_data_from_db()
empty_list_1 = []
empty_list_2 = []
empty_list_3 = []
for link in main_list:
req_result = manage_requests(link)
if isinstance(req_result, bool):
pass
else:
try:
if req_result.find('div', class_="errorName"):
empty_list_1.append(link)
elif req_result.find('section', class_='filter'):
empty_list_2.append(link)
elif req_result.find('div', class_="proStatus"):
empty_list_3.append(link)
except Exception as error:
print(error)
pass
if __name__ == "__main__":
processes = []
for i in range(os.cpu_count()):
print('Registering process %d' % i)
processes.append(Process(target=remove_redirect))
for process in processes:
process.start()
for process in processes:
process.join()
Related
I'm wondering if there can be a sort of deadlock in the following code. I have to read each element of a database (about 1 million items), process it, then collect the results in a unique file.
I've parallelized the execution with multiprocessing using two Queue's and three types of processes:
Reader: Main process which reads the database and adds the read items in a task_queue
Worker: Pool of processes. Each worker gets an item from task_queue, processes the item, saves the results in an intermediate file stored in item_name/item_name.txt and puts the item_name in a completed_queue
Writer: Process which gets an item_name from completed_queue, gets the intermediate result from item_name/item_name.txt and writes it in results.txt
from multiprocessing import Pool, Process, Queue
class Computation():
def __init__(self,K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self,):
pool = Pool(n_cpus, self.worker, args=())
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
pool.close()
pool.join()
self.completed_queue.put("DONE")
writer.join()
The code works, but it seems that sometimes the writer or the pool stops working (or they are very slow). Is a deadlock possible in this scenario?
There are a couple of issues with your code. First, by using the queues as you are, you are in effect creating your own process pool and have no need for using the multiprocessing.Pool class at all. You are using a pool initializer as an actual pool worker and it's a bit of a misuse of this class; you would be better off to just use regular Process instances (my opinion, anyway).
Second, although it is well and good that you are putting message DONE to the writer_process to signal it to terminate, you have not done similarly for the self.n_cpus worker processes, which are looking for 'STOP' messages, and therefore the reader function needs to put self.n_cpus STOP messages in the task queue:
from multiprocessing import Process, Queue
class Computation():
def __init__(self, K):
self.task_queue = Queue()
self.completed_queue = Queue()
self.n_cpus = K
def reader(self,):
with open(db, "r") as db:
... # Read an item
self.task_queue.put(item)
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
self.task_queue.put('STOP')
def worker(self,):
while True:
item = self.task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self,):
while True:
f = self.completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
processes = [Process(target=self.worker) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=())
writer.start()
self.reader()
for p in processes:
p.join()
self.completed_queue.put("DONE")
writer.join()
Personally, instead of using 'STOP' and 'DONE' as the sentinel messages, I would use None instead, assuming that is not a valid actual message. I have tested the above code where reader just processed strings in a list and self.process_item(item) simply appended ' done' to the each of those strings and put the modified string on the completed_queue and replaced self.write_f in the writer_process with a print call. I did not see any problems with the code as is.
Update to use a Managed Queue
Disclaimer: I have had no experience using mpi4py and have no idea how the queue proxies would get distributed across different computers. The above code may not be sufficient as suggested by the following article, How to share mutliprocessing queue object between multiple computers. However, that code is creating instances of Queue.Queue (that code is Python 2 code) and not the proxies that are returned by the multiprocessing.SyncManager. The documentation on this is very poor. Try the above change to see if it works better (it will be slower).
Because the proxy returned by manager.Queue(), I have had to rearrange the code a bit; the queues are now being passed explicitly as arguments to the process functions:
from multiprocessing import Process, Manager
class Computation():
def __init__(self, K):
self.n_cpus = K
def reader(self, task_queue):
with open(db, "r") as db:
... # Read an item
# signal to the worker processes to terminate:
for _ in range(self.n_cpus):
task_queue.put('STOP')
def worker(self, task_queue, completed_queue):
while True:
item = task_queue.get(True)
if item == "STOP":
break
self.process_item(item)
def writer_process(self, completed_queue):
while True:
f = completed_queue.get(True)
if f == "DONE":
break
self.write_f(f)
def run(self):
with Manager() as manager:
task_queue = manager.Queue()
completed_queue = manager.Queue()
processes = [Process(target=self.worker, args=(task_queue, completed_queue)) for _ in range(self.n_cpus)]
for p in processes:
p.start()
writer = Process(target=self.writer_process, args=(completed_queue,))
writer.start()
self.reader(task_queue)
for p in processes:
p.join()
completed_queue.put("DONE")
writer.join()
How to use pipe correctly in multiple processes(>2)?
eg. one producer several consumer
these code is failure in Linux environment
but windows environment is well
import multiprocessing, time
def consumer(pipe,id):
output_p, input_p = pipe
input_p.close()
while True:
try:
item = output_p.recv()
except EOFError:
break
print("%s consumeļ¼%s" % (id,item))
#time.sleep(3) # if no sleep these code will fault in Linux environment
# but windows environment is well
print('Consumer done')
def producer(sequence, input_p):
for item in sequence:
print('produceļ¼',item)
input_p.send(item)
time.sleep(1)
if __name__ == '__main__':
(output_p, input_p) = multiprocessing.Pipe()
# create two consumer process
cons_p1 = multiprocessing.Process(target=consumer,args=((output_p,input_p),1))
cons_p1.start()
cons_p2 = multiprocessing.Process(target=consumer,args=((output_p,input_p),2))
cons_p2.start()
output_p.close()
sequence = [i for i in range(10)]
producer(sequence, input_p)
input_p.close()
cons_p1.join()
cons_p2.join()
Do not use pipe for multiple consumers. The documentation explicitly says it will be corrupted when more then two processes read or write. Which you do; two readers.
The two connection objects returned by Pipe() represent the two ends of the pipe. Each connection object has send() and recv() methods (among others). Note that data in a pipe may become corrupted if two processes (or threads) try to read from or write to the same end of the pipe at the same time. Of course there is no risk of corruption from processes using different ends of the pipe at the same time.
So use Queue, or JoinableQueue even.
from multiprocessing import Process, JoinableQueue
from Queue import Empty
import time
def consumer(que, pid):
while True:
try:
item = que.get(timeout=10)
print("%s consume:%s" % (pid, item))
que.task_done()
except Empty:
break
print('Consumer done')
def producer(sequence, que):
for item in sequence:
print('produce:', item)
que.put(item)
time.sleep(1)
if __name__ == '__main__':
que = JoinableQueue()
# create two consumer process
cons_p1 = Process(target=consumer, args=(que, 1))
cons_p1.start()
cons_p2 = Process(target=consumer, args=(que, 2))
cons_p2.start()
sequence = [i for i in range(10)]
producer(sequence, que)
que.join()
cons_p1.join()
cons_p2.join()
I have a dictionary with names and ages:
classmates = {'Anne':15, 'Laura':17, 'Michel':16, 'Lee':15, 'Mick':17, 'Liz':16}
I want to select all the names that with the letter "L". I can do it like this:
for name, age in classmates.items():
if "L" in name:
print(name)
or
Lnames = [name for name in classmates.items() if "L" in name]
Is there a more efficient way of doing when I have millions of entries and I need to repeat the operation millions of times?
One liner with List Comprehension.
[ key for key in classmates.keys() if key.startswith('L') ]
#driver values
In : classmates = {'Anne':15, 'Laura':17, 'Michel':16, 'Lee':15, 'Mick':17, 'Liz':16}
Out : ['Lee', 'Liz', 'Laura']
As others have pointed out, use startswith instead of in to find if a character is there at the start.
There will be no performance in your search time until you use some kind of parallel computations. You can use the filtering that #Kaushnik NP mentioned on filtering splitted chunks of data on several processes.
So you have to split your dictionary in 4 smaller dictionaries (dependent on number of cores your processor has), run a worker on each of chunks and store your common data somewhere.
Here is a snippet that uses multiprocessing python lib and queues to schedule some work and store the results of that work:
#!/usr/bin/env python
import multiprocessing, os, signal, time, Queue
def do_work():
print 'Work Started: %d' % os.getpid()
time.sleep(2)
return 'Success'
def manual_function(job_queue, result_queue):
signal.signal(signal.SIGINT, signal.SIG_IGN)
while not job_queue.empty():
try:
job = job_queue.get(block=False)
result_queue.put(do_work())
except Queue.Empty:
pass
#except KeyboardInterrupt: pass
def main():
job_queue = multiprocessing.Queue()
result_queue = multiprocessing.Queue()
for i in range(6):
job_queue.put(None)
workers = []
for i in range(3):
tmp = multiprocessing.Process(target=manual_function,
args=(job_queue, result_queue))
tmp.start()
workers.append(tmp)
try:
for worker in workers:
worker.join()
except KeyboardInterrupt:
print 'parent received ctrl-c'
for worker in workers:
worker.terminate()
worker.join()
while not result_queue.empty():
print result_queue.get(block=False)
if __name__ == "__main__":
main()
For other examples on that topic, use that link
Suppose I have the following multiprocessing structure:
import multiprocessing as mp
def worker(working_queue, output_queue):
while True:
if working_queue.empty() == True:
break
else:
picked = working_queue.get()
res_item = "Number " + str(picked)
output_queue.put(res_item)
return
if __name__ == '__main__':
static_input = xrange(100)
working_q = mp.Queue()
output_q = mp.Queue()
results_bank = []
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker,args=(working_q, output_q)) for i in range(2)]
for proc in processes:
proc.start()
for proc in processes:
proc.join()
results_bank = []
while True:
if output_q.empty() == True:
break
results_bank.append(output_q.get_nowait())
if len(results_bank) == len(static_input):
print "Good run"
else:
print "Bad run"
My question: How would I 'batch' write my results to a single file while the working_queue is still 'working' (or at least, not finished)?
Note: My actual data structure is not sensitive to unordered results relative to inputs (despite my example using integers).
Also, I think that batch/set writing from the output queue is best practice rather than from the growing results bank object. However, I am open to solutions relying on either approach. I am new to multiprocessing so unsure of best practice or most efficient solution(s) to this question.
If you wish to use mp.Processes and mp.Queues, here is a way to process the results in batches. The main idea is in the writer function, below:
import itertools as IT
import multiprocessing as mp
SENTINEL = None
static_len = 100
def worker(working_queue, output_queue):
for picked in iter(working_queue.get, SENTINEL):
res_item = "Number {:2d}".format(picked)
output_queue.put(res_item)
def writer(output_queue, threshold=10):
result_length = 0
items = iter(output_queue.get, SENTINEL)
for batch in iter(lambda: list(IT.islice(items, threshold)), []):
print('\n'.join(batch))
result_length += len(batch)
state = 'Good run' if result_length == static_len else 'Bad run'
print(state)
if __name__ == '__main__':
num_workers = 2
static_input = range(static_len)
working_q = mp.Queue()
output_q = mp.Queue()
writer_proc = mp.Process(target=writer, args=(output_q,))
writer_proc.start()
for i in static_input:
working_q.put(i)
processes = [mp.Process(target=worker, args=(working_q, output_q))
for i in range(num_workers)]
for proc in processes:
proc.start()
# Put SENTINELs in the Queue to tell the workers to exit their for-loop
working_q.put(SENTINEL)
for proc in processes:
proc.join()
output_q.put(SENTINEL)
writer_proc.join()
When passed two arguments, iter expects a callable and a sentinel:
iter(callable, sentinel). The callable (i.e. a function) gets called repeatedly until it returns a value equal to the sentinel. So
items = iter(output_queue.get, SENTINEL)
defines items to be an iterable which, when iterated over, will return items from output_queue
until output_queue.get() returns SENTINEL.
The for-loop:
for batch in iter(lambda: list(IT.islice(items, threshold)), []):
calls the lambda function repeatedly until an empty list is returned. When called, the lambda function returns a list of up to threshold number of items from the iterable items. Thus, this is an idiom for "grouping by n items without padding". See this post for more on this idiom.
Note that it is not a good practice to test working_q.empty(). It could lead to a race condition. For example, suppose we have the 2 worker processes on these lines when the working_q has only 1 item left in it:
def worker(working_queue, output_queue):
while True:
if working_queue.empty() == True: <-- Process-1
break
else:
picked = working_queue.get() <-- Process-2
res_item = "Number " + str(picked)
output_queue.put(res_item)
return
Suppose Process-1 calls working_queue.empty() while there is still one item in the queue. So it returns False. Then Process-2 calls working_queue.get() and obtains the last item. Then Process-1 gets to line picked = working_queue.get() and hangs because there are no more items in the queue.
Therefore, use sentinels (as shown above) to concretely signal when a for-loop
or while-loop should stop instead of checking queue.empty().
There is no operation like "batch q.get". But it is a good practice to put/pop a batch of items instead of items one by one.
Which is exactly what multiprocessing.Pool.map is doing with its parameter chunksize :)
For writing output as soon as possible there is Pool.imap_unordered which returns an iterable instead of list.
def work(item):
return "Number " + str(item)
import multiprocessing
static_input = range(100)
chunksize = 10
with multiprocessing.Pool() as pool:
for out in pool.imap_unordered(work, static_input, chunksize):
print(out)
I am trying to use multiprocessing to process very large number of files.
I tried to put the list of files into queue and make 3 workers split the load with a common Queue data type. However this seems not working. Probably I am misunderstanding about the queue in multiprocessing package.
Below is the example source code:
import multiprocessing
from multiprocessing import Queue
def worker(i, qu):
"""worker function"""
while ~qu.empty():
val=qu.get()
print 'Worker:',i, ' start with file:',val
j=1
for k in range(i*10000,(i+1)*10000): # some time consuming process
for j in range(i*10000,(i+1)*10000):
j=j+k
print 'Worker:',i, ' end with file:',val
if __name__ == '__main__':
jobs = []
qu=Queue()
for j in range(100,110): # files numbers are from 100 to 110
qu.put(j)
for i in range(3): # 3 multiprocess
p = multiprocessing.Process(target=worker, args=(i,qu))
jobs.append(p)
p.start()
p.join()
Thanks for the comments.
I come to know that using Pool is the best solution.
import multiprocessing
import time
def worker(val):
"""worker function"""
print 'Worker: start with file:',val
time.sleep(1.1)
print 'Worker: end with file:',val
if __name__ == '__main__':
file_list=range(100,110)
p = multiprocessing.Pool(2)
p.map(worker, file_list)
Two issues:
1) you are joining only on the 3rd process
2) Why not use multiprocessing.Pool?
3) race condition on qu.get()
1 & 3)
import multiprocessing
from multiprocessing import Queue
def worker(i, qu):
"""worker function"""
while 1:
try:
val=qu.get(timeout)
except Queue.Empty: break# Yay no race condition
print 'Worker:',i, ' start with file:',val
j=1
for k in range(i*10000,(i+1)*10000): # some time consuming process
for j in range(i*10000,(i+1)*10000):
j=j+k
print 'Worker:',i, ' end with file:',val
if __name__ == '__main__':
jobs = []
qu=Queue()
for j in range(100,110): # files numbers are from 100 to 110
qu.put(j)
for i in range(3): # 3 multiprocess
p = multiprocessing.Process(target=worker, args=(i,qu))
jobs.append(p)
p.start()
for p in jobs: #<--- join on all processes ...
p.join()
2)
for how to use the Pool, see:
https://docs.python.org/2/library/multiprocessing.html
You are joining only the last of your created processes. That means if the first or the second process is still working while the third is finished, your main process is goning down and kills the remaining processes before they are finished.
You should join them all in order to wait until they are finished:
for p in jobs:
p.join()
Another thing is you should consider using qu.get_nowait() in order to get rid of the race condition between qu.empty() and qu.get().
For example:
try:
while 1:
message = self.queue.get_nowait()
""" do something fancy here """
except Queue.Empty:
pass
I hope that helps