Run parallel Stata do files in python using multiprocess and subprocess - python

I have a stata do file pyexample3.do, which uses its argument as a regressor to run a regression. The F-statistic from the regression is saved in a text file. The code is as follows:
clear all
set more off
local y `1'
display `"first parameter: `y'"'
sysuse auto
regress price `y'
local f=e(F)
display "`f'"
file open myhandle using test_result.txt, write append
file write myhandle "`f'" _n
file close myhandle
exit, STATA clear
Now I am trying to run the stata do file in parallel in python and write all the F-statistics in one text file. My cpu has 4 cores.
import multiprocessing
import subprocess
def work(staname):
dofile = "pyexample3.do"
cmd = ["StataMP-64.exe","/e", "do", dofile,staname]
return subprocess.call(cmd, shell=False)
if __name__ == '__main__':
my_list =[ "mpg","rep78","headroom","trunk","weight","length","turn","displacement","gear_ratio" ]
my_list.sort()
print my_list
# Get the number of processors available
num_processes = multiprocessing.cpu_count()
threads = []
len_stas = len(my_list)
print "+++ Number of stations to process: %s" % (len_stas)
# run until all the threads are done, and there is no data left
for list_item in my_list:
# if we aren't using all the processors AND there is still data left to
# compute, then spawn another thread
if( len(threads) < num_processes ):
p = multiprocessing.Process(target=work,args=[list_item])
p.start()
print p, p.is_alive()
threads.append(p)
else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
Although the do file is supposed to run 9 times as there are 9 strings in my_list, it was only run 4 times. So where went wrong?

In your for list_item in my_list loop, after the first 4 processes get initiated, it then goes into else:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
As you can see since thread.is_alive() won't block, this loop get executed immediately without any of those 4 processes finishing their task. Therefore only the first 4 processes get executed in total.
You could simply use a while loop to constantly check process status with a small interval:
keep_checking = True
while keep_checking:
for thread in threads:
if not thread.is_alive():
threads.remove(thread)
keep_checking = False
time.sleep(0.5) # wait 0.5s

Related

Mulitprocessing queue termination

I have a program i want to split into 10 parts with multiprocessing. Each worker will be searching for the same answer using different variables to look for it (in this case its brute forcing a password). How to I get the processes to communicate their status, and how do I terminate all processes once one process has found the answer. Thank you!
If you are going to split it into 10 parts than either you should have 10 cores or at least your worker function should not be 100% CPU bound.
The following code initializes each process with a multiprocess.Queue instance to which the worker function will write its result. The main process waits for the first entry written to the queue and then terminates all pool processes. For this demo, the worker function is passed arguments 1, 2, 3, ... 10 and then sleeps for that amount of time and returns the argument passed. So we would expect that the worker function that was passed the argument value of 1 to complete first and that the total running time of the program should be slightly more than 1 second (it takes some time to create the 10 processes):
import multiprocessing
import time
def init_pool(q):
global queue
queue = q
def worker(x):
time.sleep(x)
# write result to queue
queue.put_nowait(x)
def main():
queue = multiprocessing.Queue()
pool = multiprocessing.Pool(10, initializer=init_pool, initargs=(queue,))
for i in range(1, 11):
# non-blocking:
pool.apply_async(worker, args=(i,))
# wait for first result
result = queue.get()
pool.terminate() # kill all tasks
print('Result: ', result)
# required for Windows:
if __name__ == '__main__':
t = time.time()
main()
print('total time =', time.time() - t)
Prints:
Result: 1
total time = 1.2548246383666992

Python Multiprocessing : How to run a process again from a set of processes with next element of list?

I have a list which contains table names and let say size of list be n. Now I have m servers so I have opened m cursors corresponding to each which is also in another list. Now for every table I want to call a certain function which takes parameter as this two list.
templst = [T1,T2,T3,T4,T5,T6, T7,T8,T9,T10,T11]
curlst = [cur1,cur2,cur3,cur4,cur5]
These cursors are opened as cur = conn.cursor() so these are objects
def extract_single(tableName, cursorconn):
qry2 = "Select * FROM %s"% (tableName)
cursorconn.execute(qry2).fetchall()
print " extraction done"
return
Now I have opened 5 processess (since I have 5 cursors ) so as to run them in parallel.
processes = []
x = 0
for x in range(5):
new_p = 'p%x'%x
print "process :", new_p
new_p = multiprocessing.Process(target=extract_single, args=(templst[x],cur[x]))
new_p.start()
processes.append(new_p)
for process in processes:
process.join()
So this makes sure that I have opened 5 processes for each cursor and it took first 5 table names.
Now I want that as soon as any process among the 5 finishes it should immediately take the 6th table from my templst and the same thing goes on till all the templst is done.
How to modify this code for this behaviour ?
For Example
for simple example what I want to do. Let us consider a templst as an int for which I want to call sleep function
templst = [1,2,5,7,4,3,6,8,9,10,11]
curlst = [cur1,cur2,cur3,cur4,cur5]
def extract_single(sec, cursorconn):
print "Sleeping for second=%s done by cursor=%s"% (sec,cursorconn)
time.sleep(sec)
print " sleeping done"
return
so when I start the 5 cursors so it is possible that either the sleep(1) or sleep(2) finishes first
so as soon as it finishes I want to run sleep(3) with that cursor.
My real query will be dependent on cursor since it will be SQL query
Modified approach
Considering previous example of sleep. I now want to implement that I have suppose 10 cursors and my sleep queue is sorted in increasing order or decreasing order.
Considering list in increasing order
Now out of 10 cursors the first 5 cursors will take first 5 elements from queue and my another set of 5 cursors will take last five.
So basically my cursor queue is divided into 2 halfs which will take lowest value and another half will take highest value.
Now if cursor from first half finishes it should take next lowest value avaliable and if cursor from another second half then it should take (n-6)th value i.e. 6 value from end.
I need to traverse the queue from both side and have two set of cursors of each 5
example: curlst1 = [cur1,cur2,cur3,cur4,cur5]
curlst2 = [cur6,cur7,cur8,cur9,cur10 ]
templst = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
so cur1 -> 1
cur2 ->2
... cur5 -> 5
cur6 -> 16
cur7 ->15
.... cur10->12
now cur1 finishes first so it will take 6 (first avaliable element from front)
cur2 finsihes it takes 7 and so on
if cur 10 finsihes it will take 11 (next avaliable element from back)
and so on till all elements of templst.
Place your templst arguments, whether table names as in the real example or number of seconds to sleep as in the example below, on a multiprocessing queue. Then each process loops reading the next item from the queue. When the queue is empty, there is no more work to be performed and your process can return. You have in effect implemented your own process pooling where each process has its own dedicated cursor connection. Now your function extract_single takes as its first argument the queue from which to retrieve the table name or seconds argument from.
import multiprocessing
import Queue
import time
def extract_single(q, cursorconn):
while True:
try:
sec = q.get_nowait()
print "Sleeping for second=%s done by cursor=%s" % (sec,cursorconn)
time.sleep(sec)
print " sleeping done"
except Queue.Empty:
return
def main():
q = multiprocessing.Queue()
templst = [1,2,5,7,4,3,6,8,9,10,11]
for item in templst:
q.put(item) # add items to queue
curlst = [cur1,cur2,cur3,cur4,cur5]
process = []
for i in xrange(5):
p = multiprocessing.Process(target=extract_single, args=(q, curlst[i]))
process.append(p)
p.start()
for p in process:
p.join()
if __name__ == '__main__':
main()
Note
If you have fewer than 5 processors, you might try running this with 5 (or more) threads, in which case a regular Queue object should be used.
Updated Answer to Updated Question
The data structure that allows you to remove items from the front of the queue and also from the end is known as a deque (double-ended queue). Unfortunately, there is no version of a deque supported for multiprocessing. But I think that your table processing might work just as well with threading and it's highly unlikely that your computer has 10 processors to support 10 concurrent processes running anyway.
import threading
from collections import deque
import time
import sys
templst = [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16]
q = deque(templst)
curlst1 = [cur1,cur2,cur3,cur4,cur5]
curlst2 = [cur6,cur7,cur8,cur9,cur10]
def extract_single(cursorconn, from_front):
while True:
try:
sec = q.popleft() if from_front else q.pop()
#print "Sleeping for second=%s done by cursor=%s" % (sec,cursorconn)
sys.stdout.write("Sleeping for second=%s done by cursor=%s\n" % (sec,cursorconn))
sys.stdout.flush() # flush output
time.sleep(sec)
#print " sleeping done"
sys.stdout.write("sleeping done by %s\n" % cursorconn)
sys.stdout.flush() # flush output
except IndexError:
return
def main():
threads = []
for cur in curlst1:
t = threading.Thread(target=extract_single, args=(cur, True))
threads.append(t)
t.start()
for cur in curlst2:
t = threading.Thread(target=extract_single, args=(cur, False))
threads.append(t)
t.start()
for t in threads:
t.join()
if __name__ == '__main__':
main()

Python multiprocessing, spinning off processes inside a for loop

I have 800 files with some data to process, it's enough that I want to use multiprocessing to do this but I think I'm not doing it correctly.
Inside my main() function I'm trying to spin off 1 process for each file that needs processing (I'm guessing that this is not a good idea because my computer won't be able to handle 800 concurrent processes but I haven't gotten that far yet).
Here is my main():
manager = multiprocessing.Manager()
arr = manager.list()
def main():
count = 0
with open("loc.csv") as loc_file:
locs = csv.reader(loc_file, delimiter=',')
for loc in locs:
if count != 0:
process = multiprocessing.Process(target=sort_run, args=[loc])
process.start()
process.join()
count += 1
And then my code that is the target of the process:
def sort_run(loc):
start_time = time.time()
sorted_list = sort_splits.sort_splits(loc[0])
value = process_reads.count_coverage(sorted_list, loc[0])
arr.append([loc[0], value])
I'm using the multiprocessing.Manager() so that my processes can access the arr list properly. I received the error:
An attempt has been made to start a new process before the current
process has finished its bootstrapping phase.
I think what's happening is the loop is too fast to spin off the processes correctly. Or maybe each process has to have a specific variable not just "process = ..."

Multi-process, using Queue & Pool

I have a Producer process that runs and puts the results in a Queue
I also have a Consumer function that takes the results from the Queue and processes them , for example:
def processFrame(Q,commandsFile):
fr = Q.get()
frameNum = fr[0]
Frame = fr[1]
#
# Process the frame
#
commandsFile.write(theProcessedResult)
I want to run my consumer function using multiple processes, they number should be set by user:
processes = raw_input('Enter the number of process you want to use: ')
i tried using Pool:
pool = Pool(int(processes))
pool.apply(processFrame, args=(q,toFile))
when i try this , it returns a RuntimeError: Queue objects should only be shared between processes through inheritance.
what does that mean?
I also tried to use a list of processes:
while (q.empty() == False):
mp = [Process(target=processFrame, args=(q,toFile)) for x in range(int(processes))]
for p in mp:
p.start()
for p in mp:
p.join()
This one seems to run, but not as expected.
it using multiple processes on same frame from Queue, doesn't Queue have locks?
also ,in this case the number of processes i'm allowed to use must divide the number of frames without residue(reminder) - for example:
if i have 10 frames i can use only 1,2,5,10 processes. if i use 3,4.. it will create a process while Q empty and wont work.
if u want to recycle the procces until q is empty u should just try to do somthing like that:
code1:
def proccesframe():
while(True):
frame = queue.get()
##do something
your procces will be blocked until there is something in the queue
i dont think that's a good idie to use multiproccess on the cunsomer part , you should use them on the producer.
if u want to terminate the procces when the queue is empty u can do something like that:
code2:
def proccesframe():
while(!queue.empty()):
frame = queue.get()
##do something
terminate_procces()
update:
if u want to use multiproccesing in the consumer part just do a simple loop and add code2 , then you will be able to close your proccess when u finish doing stuff with the queue.
I am not entirely sure what are you trying to accomplish from your explanation, but have you considered using multiprocessing.Pool with its methods map or map_async?
from multiprocessing import Pool
from foo import bar # your function
if __name__ == "__main__":
p = Pool(4) # your number of processes
result = p.map_async(bar, [("arg #1", "arg #2"), ...])
print result.get()
It collects result from your function in unordered(!) iterable and you can use it however you wish.
UPDATE
I think you should not use queue and be more straightforward:
from multiprocessing import Pool
def process_frame(fr): # PEP8 and see the difference in definition
# magic
return result # and result handling!
if __name__ == "__main__":
p = Pool(4) # your number of processes
results = p.map_async(process_frame, [fr_1, fr_2, ...])
# Do not ever write or manipulate with files in parallel processes
# if you are not 100% sure what you are doing!
for result in results.get():
commands_file.write(result)
UPDATE 2
from multiprocessing import Pool
import random
import time
def f(x):
return x*x
def g(yr):
with open("result.txt", "ab") as f:
for y in yr:
f.write("{}\n".format(y))
if __name__ == '__main__':
pool = Pool(4)
while True:
# here you fetch new data and send it to process
new_data = [random.randint(1, 50) for i in range(4)]
pool.map_async(f, new_data, callback=g)
Some example how to do it and I updated the algorithm to be "infinite", it can be only closed by interruption or kill command from outside. You can use also apply_async, but it would cause slow downs with result handling (depending on speed of processing).
I have also tried using long-time open result.txt in global scope, but every time it hit deadlock.

Run separate processes in parallel - Python

I use the python 'multiprocessing' module to run single processes on multiple cores but I want to run a couple of independent processes in parallel.
For example, Process-one parses large files, Process-two find patterns in different files and process three does some calculation; can all these three different processed that have different sets of arguments be run in parallel?
def Process1(largefile):
Parse large file
runtime 2hrs
return parsed_file
def Process2(bigfile)
Find pattern in big file
runtime 2.5 hrs
return pattern
def Process3(integer)
Do astronomical calculation
Run time 2.25 hrs
return calculation_results
def FinalProcess(parsed,pattern,calc_results):
Do analysis
Runtime 10 min
return final_results
def main():
parsed = Process1(largefile)
pattern = Process2(bigfile)
calc_res = Process3(integer)
Final = FinalProcess(parsed,pattern,calc_res)
if __name__ == __main__:
main()
sys.exit()
In the above pseudo-code Process1, Process2 and Process3 are single-core processes i.e they can't be run on multiple processors. These processes are run sequentially and take 2+2.5+2.25hrs = 6.75 hrs. Is it possible to run these three processes in parallel? So that they run at the same time on different processors/cores and when most time taking (Process2) finishes than we move to Final Process.
From 16.6.1.5. Using a pool of workers:
from multiprocessing import Pool
def f(x):
return x*x
if __name__ == '__main__':
pool = Pool(processes=4) # start 4 worker processes
result = pool.apply_async(f, [10]) # evaluate "f(10)" asynchronously
print result.get(timeout=1) # prints "100" unless your computer is *very* slow
print pool.map(f, range(10)) # prints "[0, 1, 4,..., 81]"
You can, therefore, apply_async against a pool and get your results after everything is ready.
from multiprocessing import Pool
# all your methods declarations above go here
# (...)
def main():
pool = Pool(processes=3)
parsed = pool.apply_async(Process1, [largefile])
pattern = pool.apply_async(Process2, [bigfile])
calc_res = pool.apply_async(Process3, [integer])
pool.close()
pool.join()
final = FinalProcess(parsed.get(), pattern.get(), calc_res.get())
# your __main__ handler goes here
# (...)

Categories