I've already researched and I'm not able to run the code correctly, below the meaning of each function argument
file - search list file that will be distributed to all processes
header - there are 6 headers, one for each process
count1 - for when he reaches 5 activate time.sleep
def google(file, header, count1):
function...
if __name__ == '__main__':
with Pool(6) as x:
x.starmap(google, lista, [headers_1, headers_2, headers_3, headers_4, headers_5, headers_6], 0)
Related
Let us consider the following code where I calculate the factorial of 4 really large numbers, saving each output to a separate .txt file (out_mp_{idx}.txt). I use multiprocessing (4 processes) to reduce the computation time. Though this works fine, I want to output all the 4 results in one file.
One way is to open each of the generated (4) files I create (from the code below) and append to a new file, but that's not my choice (below is just a simplistic version of my code, I have too many files to handle, which defeats the purpose of time-saving via multiprocessing). Is there a better way to automate such that the results from the processes are all dumped/appended to some file? Also, in my case the returned results form each process could be several lines, so how would we avoid open-file conflict for the case when the results are appended in the output file by one process and second process returns its answer and wants to open/access the output file?
As an alternative, I tried process.immap route, but that's not as computationally efficient as the below code. Something like this SO post.
from multiprocessing import Process
import os
import time
tic = time.time()
def factorial(n, idx): # function to calculate the factorial
num = 1
while n >= 1:
num *= n
n = n - 1
with open(f'out_mp_{idx}.txt', 'w') as f0: # saving output to a separate file
f0.writelines(str(num))
def My_prog():
jobs = []
N = [10000, 20000, 40000, 50000] # numbers for which factorial is desired
n_procs = 4
# executing multiple processes
for i in range(n_procs):
p = Process(target=factorial, args=(N[i], i))
jobs.append(p)
for j in jobs:
j.start()
for j in jobs:
j.join()
print(f'Exec. Time:{time.time()-tic} [s]')
if __name__=='__main__':
My_prog()
You can do this.
Create a Queue
a) manager = Manager()
b) data_queue = manager.Queue()
c) put all data in this queue.
Create a thread and start it before multiprocess
a) create a function which waits on data_queue.
Something like
`
def fun():
while True:
data = data_queue.get()
if instance(data_queue, Sentinal):
break
#write to a file
`
3) Remember to send some Sentinal object after all multiprocesses are done.
You can also make this thread a daemon thread and skip sentinal part.
I try to use multiprocessing in this way:
from multiprocessing import Pool
added = []
def foo(i):
added = []
# do something
added.append(x[i])
return added
if __name__ == '__main__':
h = 0
while len(added)<len(c):
pool = Pool(4)
result = pool.imap_unordered(foo, c)
added.append(result[-1])
pool.close()
pool.join()
h = h + 1
Multiprocessing takes place in the while-loop, and in the foo function, the
added list is created. In each subsequent step h in the loop, the listadded should be incremented by subsequent values, and the current list added should be used in the functionfoo. Is it possible to pass the current contents of the list to the function in each subsequent step of the loop? Because in the above code, the foo function creates the new contents of the added list from scratch each time. How can this be solved?
You can use a multiprocessing.Queue. The rough idea is to construct one of these in your main process, pass it to the child processes, and each foo() invocation can call put(x[i]) to add a value to the queue.
The main process will then read the queue to collect the results.
Please bear with me as this is a bit of a contrived example of my real application. Suppose I have a list of numbers and I wanted to add a single number to each number in the list using multiple (2) processes. I can do something like this:
import multiprocessing
my_list = list(range(100))
my_number = 5
data_line = [{'list_num': i, 'my_num': my_number} for i in my_list]
def worker(data):
return data['list_num'] + data['my_num']
pool = multiprocessing.Pool(processes=2)
pool_output = pool.map(worker, data_line)
pool.close()
pool.join()
Now however, there's a wrinkle to my problem. Suppose that I wanted to alternate adding two numbers (instead of just adding one). So around half the time, I want to add my_number1 and the other half of the time I want to add my_number2. It doesn't matter which number gets added to which item on the list. However, the one requirement is that I don't want to be adding the same number simultaneously at the same time across the different processes. What this boils down to essentially (I think) is that I want to use the first number on Process 1 and the second number on Process 2 exclusively so that the processes are never simultaneously adding the same number. So something like:
my_num1 = 5
my_num2 = 100
data_line = [{'list_num': i, 'my_num1': my_num1, 'my_num2': my_num2} for i in my_list]
def worker(data):
# if in Process 1:
return data['list_num'] + data['my_num1']
# if in Process 2:
return data['list_num'] + data['my_num2']
# and so forth
Is there an easy way to specify specific inputs per process? Is there another way to think about this problem?
multiprocessing.Pool allows to execute an initializer function which is going to be executed before the actual given function will be run.
You can use it altogether with a global variable to allow your function to understand in which process is running.
You probably want to control the initial number the processes will get. You can use a Queue to notify to the processes which number to pick up.
This solution is not optimal but it works.
import multiprocessing
process_number = None
def initializer(queue):
global process_number
process_number = queue.get() # atomic get the process index
def function(value):
print "I'm process %s" % process_number
return value[process_number]
def main():
queue = multiprocessing.Queue()
for index in range(multiprocessing.cpu_count()):
queue.put(index)
pool = multiprocessing.Pool(initializer=initializer, initargs=[queue])
tasks = [{0: 'Process-0', 1: 'Process-1', 2: 'Process-2'}, ...]
print(pool.map(function, tasks))
My PC is a dual core, as you can see only Process-0 and Process-1 are processed.
I'm process 0
I'm process 0
I'm process 1
I'm process 0
I'm process 1
...
['Process-0', 'Process-0', 'Process-1', 'Process-0', ... ]
I have a complexed problem with python multiprocessing module.
I have build a script that in one place has to call a multiargument function (call_function) for each element in a specyfic list. My idea is to define an integer 'N' and divide this problem for single sub processes.
li=[a,b,c,d,e] #elements are int's
for element in li:
call_function(element,string1,string2,int1)
call_summary_function()
Summary function will analyze results obtained by all iterations of the loop. Now, I want each iteration to be carried out by single sub process, but there cannot be more than N subprocesses altogether. If so, main process should wait until 1 of subprocesses end and then perform another iteration. Also, call_sumary_function need to be called after all the sub processes finish.
I have tried my best with multiprocessing module, Locks and global variables to keep the actual number of subprocesses running (to compare to N) but every time i get error.
//--------------EDIT-------------//
Firstly, the main process code:
MAX_PROCESSES=3
lock=multiprocessing.Lock()
processes=0
k=0
while k < len(k_list):
if processes<=MAX_PROCESSES: # running processes <= 'N' set by me
p = multiprocessing.Process(target=single_analysis, args=(k_list[k],main_folder,training_testing,subsets,positive_name,ratio_list,lock,processes))
p.start()
k+=1
else: time.sleep(1)
while processes>0: time.sleep(1)
Now: the function that is called by multiprocessing:
def single_analysis(k,main_folder,training_testing,subsets,positive_name,ratio_list,lock,processes):
lock.acquire()
processes+=1
lock.release()
#stuff to do
lock.acquire()
processes-=1
lock.release()
I get the Error that int value (processes variable) is always equal to 0, since single_analysis() function seems to create new, local variable processes.
When I change processes to global and import it in single_analysis() with global keyword and type print processes in within the function I get len(li) times 1...
What you're describing is pefectly suited for multiprocessing.Pool - specifically its map method:
import multiprocessing
from functools import partial
def call_function(string1, string2, int1, element):
# Do stuff here
if __name__ == "__main__":
li=[a,b,c,d,e]
p = multiprocessing.Pool(N) # The pool will contain N worker processes.
# Use partial so that we can pass a method that takes more than one argument to map.
func = partial(call_function, string1,string2,int1)
results = p.map(func, li)
call_summary_function(results)
p.map will call call_function(string1, string2, int1, element), for each element in the li list. results will be a list containing the value returned by each call to call_function. You can pass that list to call_summary_function to process the results.
I want to use subprocesses to let 20 instances of a written script run parallel. Lets say i have a big list of urls with like 100.000 entries and my program should control that all the time 20 instances of my script are working on that list. I wanted to code it as follows:
urllist = [url1, url2, url3, .. , url100000]
i=0
while number_of_subproccesses < 20 and i<100000:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
My script just writes something into a database or textfile. It doesnt output anything and dont need more input than the url.
My problem is i wasnt able to find something how to get the number of subprocesses that are active. Im a novice programmer so every hint and suggestion is welcome. I was also wondering how i can manage it once the 20 subprocesses are loaded that the while loop checks the conditions again? I thought of maybe putting another while loop over it, something like
while i<100000
while number_of_subproccesses < 20:
subprocess.Popen(['python', 'script.py', urllist[i]]
i = i+1
if number_of_subprocesses == 20:
sleep() # wait to some time until check again
Or maybe theres a bette possibility that the while loop is always checking on the number of subprocesses?
I also considered using the module multiprocessing, but i found it really convenient to just call the script.py with subprocessing instead of a function with multiprocessing.
Maybe someone can help me and lead me into the right direction. Thanks Alot!
Taking a different approach from the above - as it seems that the callback can't be sent as a parameter:
NextURLNo = 0
MaxProcesses = 20
MaxUrls = 100000 # Note this would be better to be len(urllist)
Processes = []
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global Processes
if NextURLNo < MaxUrls:
proc = subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print ("Started to Process %s", urllist[NextURLNo])
NextURLNo += 1
Processes.append(proc)
def CheckRunning():
""" Check any running processes and start new ones if there are spare slots."""
global Processes
global NextURLNo
for p in range(len(Processes):0:-1): # Check the processes in reverse order
if Processes[p].poll() is not None: # If the process hasn't finished will return None
del Processes[p] # Remove from list - this is why we needed reverse order
while (len(Processes) < MaxProcesses) and (NextURLNo < MaxUrls): # More to do and some spare slots
StartNew()
if __name__ == "__main__":
CheckRunning() # This will start the max processes running
while (len(Processes) > 0): # Some thing still going on.
time.sleep(0.1) # You may wish to change the time for this
CheckRunning()
print ("Done!")
Just keep count as you start them and use a callback from each subprocess to start a new one if there are any url list entries to process.
e.g. Assuming that your sub-process calls the OnExit method passed to it as it ends:
NextURLNo = 0
MaxProcesses = 20
NoSubProcess = 0
MaxUrls = 100000
def StartNew():
""" Start a new subprocess if there is work to do """
global NextURLNo
global NoSubProcess
if NextURLNo < MaxUrls:
subprocess.Popen(['python', 'script.py', urllist[NextURLNo], OnExit])
print "Started to Process", urllist[NextURLNo]
NextURLNo += 1
NoSubProcess += 1
def OnExit():
NoSubProcess -= 1
if __name__ == "__main__":
for n in range(MaxProcesses):
StartNew()
while (NoSubProcess > 0):
time.sleep(1)
if (NextURLNo < MaxUrls):
for n in range(NoSubProcess,MaxProcesses):
StartNew()
To keep constant number of concurrent requests, you could use a thread pool:
#!/usr/bin/env python
from multiprocessing.dummy import Pool
def process_url(url):
# ... handle a single url
urllist = [url1, url2, url3, .. , url100000]
for _ in Pool(20).imap_unordered(process_url, urllist):
pass
To run processes instead of threads, remove .dummy from the import.