how to implement single program multiple data (spmd) in python

how to implement single program multiple data (spmd) in python - python

i read the multiprocessing doc. in python and found that task can be assigned to different cpu cores. i like to run the following code (as a start) in parallel.
from multiprocessing import Process
import os
def do(a):
for i in range(a):
print i
if __name__ == "__main__":
proc1 = Process(target=do, args=(3,))
proc2 = Process(target=do, args=(6,))
proc1.start()
proc2.start()
now i get the output as 1 2 3 and then 1 ....6. but i need to work as 1 1 2 2 ie i want to run proc1 and proc2 in parallel (not one after other).

So you can have your code execute in parallel just by using map. I am using a delay (with time.sleep) to slow the code down so it prints as you want it to. If you don't use the sleep, the first process will finish before the second starts… and you get 0 1 2 0 1 2 3 4 5.
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool()
>>>
>>> def do(a):
... for i in range(a):
... import time
... time.sleep(1)
... print i
...
>>> _ = p.map(do, [3,6])
0
0
1
1
2
2
3
4
5
>>>
I'm using the multiprocessing fork pathos.multiprocessing because I'm the author and I'm too lazy to code it in a file. pathos enables you to do multiprocessing in the interpreter, but otherwise it's basically the same.

You can also use the library pp. I prefer pp over multiprocessing because it allows for parallel processing across different cpus on the network. A function (func) can be applied to a list of inputs (args) using a simple code:
job_server=pp.Server(ncpus=num_local_procs,ppservers=nodes)
result=[job() for job in job_server.submit(func,input) for arg in args]
You can also check out more examples at: https://github.com/gopiks/mappy/blob/master/map.py

Related

managing sequence of print output after multiprocessing

I have the following section of code which uses multiprocessing to run def chi2(i) and then prints out the full output:
import cmath, csv, sys, math, re
import numpy as np
import multiprocessing as mp
x1 = np.zeros(npt ,dtype=float)
x2 = np.zeros(npt ,dtype=float)
def chi2(i):
print("wavelength", i+1," of ", npt)
some calculations that generate x1[(i)], x2[(i)] and x[(1,i)]
print("\t", i+1,"x1:",x1[(i)])
print("\t", i+1,"x2:",x2[(i)])
x[(1,i)] = x1[(i)] * x2[(i)]
print("\t", i+1,"x:",x[(1,i)])
return x[(1,i)]
#-----------single process--------------
#for i in range (npt):
# chi2(i)
#------------parallel processes-------------
pool = mp.Pool(cpu)
x[1] = pool.map(chi2,[i for i in range (npt)])
pool.close()
#general output
print("x: \n",x.T)
If I run the script using a single process (commented section in script), the output is in the form I desire:
wavelength 1 of 221
1 x1: -0.3253846181978943
1 x2: -0.012596285460978723
1 x: 0.004098637535432249
wavelength 2 of 221
2 x1: -0.35587046869939154
2 x2: -0.014209153301058522
2 x: 0.005056618045069202
...
x:
[[3.30000000e+02 4.09863754e-03]
[3.40000000e+02 5.05661805e-03]
[3.50000000e+02 6.20083938e-03]
...
However, if I run the script with parallel processes, the output of wavelength i of npt is printed after that of print("x: \n",x.T) even though it appears first in the script:
x:
[[3.30000000e+02 4.09863754e-03]
[3.40000000e+02 5.05661805e-03]
[3.50000000e+02 6.20083938e-03]
...
wavelength 1 of 221
1 x1: -0.3253846181978943
1 x2: -0.012596285460978723
1 x: 0.004098637535432249
wavelength 2 of 221
2 x1: -0.35587046869939154
2 x2: -0.014209153301058522
2 x: 0.005056618045069202
...
I suspect this has something to do with the processing time of the mp.pool, which takes longer to generate the output after pool.close() than the simpler print("x: \n",x.T). May I know how to correct the sequence of output so that running the script with parallel processes will give the same sequence of output as when the script is run with a single process?

The point of multiprocessing to to run two processes simultaneously rather than sequentially. Since the processes are independent of each other, they print to the console independently so the order of printing may change from execution to execution.
When you do pool.close(), the pool closes but its processes continue to run. The main process on the other hand continues and prints to the console.
If you want to print only after the processes of the pool are done executing, add pool.join() after pool.close() which will wait for the pool to finish the process before proceeding with main process.

How do I use multiprocessing Pool and Queue together?

I need to perform ~18000 somewhat expensive calculations on a supercomputer and I'm trying to figure out how to parallelize the code. I had it mostly working with multiprocessing.Process but it would hang at the .join() step if I did more than ~350 calculations.
One of the computer scientists managing the supercomputer recommended I use multiprocessing.Pool instead of Process.
When using Process, I would set up an output Queue and a list of processes, then run and join the processes like this:
output = mp.Queue()
processes = [mp.Process(target=some_function,args=(x,output)) for x in some_array]
for p in processes:
p.start()
for p in processes:
p.join()
Because processes is a list, it is iterable, and I can use output.get() inside a list comprehension to get all the results:
result = [output.get() for p in processes]
What is the equivalent of this when using a Pool? If the Pool is not iterable, how can I get the output of each process that is inside it?
Here is my attempt with dummy data and a dummy calculation:
import pandas as pd
import multiprocessing as mp
##dummy function
def predict(row,output):
calc = [len(row.c1)**2,len(row.c2)**2]
output.put([row.c1+' - '+row.c2,sum(calc)])
#dummy data
c = pd.DataFrame(data=[['a','bb'],['ccc','dddd'],['ee','fff'],['gg','hhhh'],['i','jjj']],columns=['c1','c2'])
if __name__ == '__main__':
#output queue
print('initializing output container...')
output = mp.Manager().Queue()
#pool of processes
print('initializing and storing calculations...')
pool = mp.Pool(processes=5)
for i,row in c.iterrows(): #try some smaller subsets here
pool.apply_async(predict,args=(row,output))
#run processes and keep a counter-->I'm not sure what replaces this with Pool!
#for p in processes:
# p.start()
##exit completed processes-->or this!
#for p in processes:
# p.join()
#pool.close() #is this right?
#pool.join() #this?
#store each calculation
print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
print(p)
The output I get is:
initializing output container...
initializing and storing calculations...
storing output of calculations...
Traceback (most recent call last):
File "parallel_test.py", line 37, in <module>
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
TypeError: 'Pool' object is not iterable
What I want is for p to print and look like:
0 1
0 a - bb 5
1 ccc - dddd 25
2 ee - fff 13
3 gg - hhhh 20
4 i - jjj 10
How do I get the output from each calculation instead of just the first one?

Even though you store all your useful results in the queue output you want to fetch the results via calling output.get() the number of times it was stored in the output (number of training examples - len(c) in your case). For me it works if you change the line:
print('storing output of calculations...')
p = pd.DataFrame([output.get() for p in pool]) ## <-- this is where the code breaks because pool is not iterable
to:
print('storing output of calculations...')
p = pd.DataFrame([output.get() for _ in range(len(c))]) ## <-- no longer breaks

More parallel processes than available processors in pathos

I used to be able to run 100 parallel process this way:
from multiprocessing import Process
def run_in_parallel(some_list):
proc = []
for list_element in some_list:
time.sleep(20)
p = Process(target=main, args=(list_element,))
p.start()
proc.append(p)
for p in proc:
p.join()
run_in_parallel(some_list)
but now my inputs are a bit more complicated and I'm getting "that" pickle error. I had to switch to pathos.
The following minimal example of my code works well but it seems to be limited by the number of threads. How can I get pathos to scale up to 100 parallel process? My cpu only has 4 cores. My processes are idling most of the time but they have to run for days. I don't mind having that "time.sleep(20)" in there for the initialization.
from pathos.multiprocessing import ProcessingPool as Pool
input = zip(itertools.repeat((variable1, variable2, class1), len(some_list)), some_list)
p = Pool()
p.map(main, input)
edit:
Ideally I would like to do p = Pool(nodes=len(some_list)), which does not work of course.

I'm the pathos author. I'm not sure I'm interpreting your question correctly -- it's a bit easier to interpret the question when you have provided a minimal working code sample. However...
Is this what you mean?
>>> def name(x):
... import multiprocess as mp
... return mp.process.current_process().name
...
>>> from pathos.multiprocessing import ProcessingPool as Pool
>>> p = Pool(ncpus=10)
>>> p.map(name, range(10))
['PoolWorker-1', 'PoolWorker-2', 'PoolWorker-3', 'PoolWorker-4', 'PoolWorker-6', 'PoolWorker-5', 'PoolWorker-7', 'PoolWorker-8', 'PoolWorker-9', 'PoolWorker-10']
>>> p.map(name, range(20))
['PoolWorker-1', 'PoolWorker-2', 'PoolWorker-3', 'PoolWorker-4', 'PoolWorker-6', 'PoolWorker-5', 'PoolWorker-7', 'PoolWorker-8', 'PoolWorker-9', 'PoolWorker-10', 'PoolWorker-1', 'PoolWorker-2', 'PoolWorker-3', 'PoolWorker-4', 'PoolWorker-6', 'PoolWorker-5', 'PoolWorker-7', 'PoolWorker-8', 'PoolWorker-9', 'PoolWorker-10']
>>>
Then, for example, if you wanted to reconfigure to only use 4 cpus, you can do this:
>>> p.ncpus = 4
>>> p.map(name, range(20))
['PoolWorker-11', 'PoolWorker-11', 'PoolWorker-12', 'PoolWorker-12', 'PoolWorker-13', 'PoolWorker-13', 'PoolWorker-14', 'PoolWorker-14', 'PoolWorker-11', 'PoolWorker-11', 'PoolWorker-12', 'PoolWorker-12', 'PoolWorker-13', 'PoolWorker-13', 'PoolWorker-14', 'PoolWorker-14', 'PoolWorker-11', 'PoolWorker-11', 'PoolWorker-12', 'PoolWorker-12']
I'd worry that if you have only 4 cores, but want 100-way parallel, that you may not get the scaling that you think. Depending on how long the function you want to parallelize takes, you might want to use one of the other pools, like: pathos.threading.ThreadPool or a MPI-centric pool from pyina.
What happens with only 4 cores and 100 processes is that the 4 cores will have 100 instances of python spawned at once... so that may be a serious memory hit, and the multiple instances of python on a single core will compete for cpu time... so it might be best to play with the configuration a bit to find the right mix of resource oversubscribing and any resource idling.

Proper handling of parallel function that writes output in Python

I have a function that takes a text file as input, does some processing, and writes a pickled result to file. I'm trying to perform this in parallel across multiple files. The order in which files are processed doesn't matter, and the processing of each is totally independent. Here's what I have now:
import mulitprocessing as mp
import pandas as pd
from glob import glob
def processor(fi):
df = pd.read_table(fi)
...do some processing to the df....
filename = fi.split('/')[-1][:-4]
df.to_pickle('{}.pkl'.format(filename))
if __name__ == '__main__':
files = glob('/path/to/my/files/*.txt')
pool = mp.Pool(8)
for _ in pool.imap_unordered(processor, files):
pass
Now, this actually works totally fine as far as I can tell, but the syntax seems really hinky and I'm wondering if there is a better way of going about it. E.g. can I get the same result without having to perform an explicit loop?
I tried map_async(processor, files), but this doesn't generate any output files (but doesn't throw any errors).
Suggestions?

You can use map_async, but you need to wait for it to finish, since the async bit means "don't block after setting off the jobs, but return immediately". If you don't wait, if there's nothing after your code your program will exit and all subprocesses will be killed immediately and before completing - not what you want!
The following example should help:
from multiprocessing.pool import Pool
from time import sleep
def my_func(val):
print('Executing %s' % val)
sleep(0.5)
print('Done %s' % val)
pl = Pool()
async_result = pl.map_async(my_func, [1, 2, 3, 4, 5])
res = async_result.get()
print('Pool done: %s' % res)
The output of which (when I ran it) is:
Executing 2
Executing 1
Executing 3
Executing 4
Done 2
Done 1
Executing 5
Done 4
Done 3
Done 5
Pool done: [None, None, None, None, None]
Alternatively, using plain map would also do the trick, and then you don't have to wait for it since it is not "asynchronous" and synchronously waits for all jobs to be complete:
pl = Pool()
res = pl.map(my_func, [1, 2, 3, 4, 5])
print('Pool done: %s' % res)

Parallel Python: 4 threads have same speed as 2 threads

I'm using Parallel Python for executing a computation heavy code on multiple cores.
I have an i7-4600M processor, which has 2 cores and 4 threads.
The interesting thing is, the computation takes nearly the same time if I use 2 or 4 theads. I wrote a little example code, which demonstrates this phenomenon.
import itertools
import pp
import time
def cc(data, n):
count = 0
for A in data:
for B in itertools.product((-1,0,1), repeat=n):
inner_product = sum(a*b for a,b in zip(A,B))
if inner_product == 0:
count += 1
return count
n = 9
for thread_count in (1, 2, 3, 4):
print("Thread_count = {}".format(thread_count))
ppservers = ()
job_server = pp.Server(thread_count, ppservers=ppservers)
datas = [[] for _ in range(thread_count)]
for index, A in enumerate(itertools.product((0,1), repeat=n)):
datas[index%thread_count].append(A)
print("Data sizes: {}".format(map(len, datas)))
time_start = time.time()
jobs = [job_server.submit(cc,(data,n), (), ("itertools",)) for data in datas]
result = sum(job() for job in jobs)
time_end = time.time()
print("Time = {}".format(time_end - time_start))
print("Result = {}".format(result))
print
Here's a short video of running the program and the cpu usage: https://www.screenr.com/1ULN When I use 2 threads, the cpu has 50% usage, if I use 4 threads, it uses 100%. But it's only slightly faster. Using 2 threads, I get a speedup of 1.8x, using 3 threads a speedup of 1.9x, and using 4 threads a speedup of 2x.
If the code is too fast, use n = 10 or n = 11. But be careful, the complexity is 6^n. So n = 10 will take 6x as long as n = 9.

2 cores and 4 threads means you have two hyperthreads on each core, which won't scale linearly, since they share resources and can get in each other's way, depending on the workload. Parallel Python uses processes and IPC behind the scenes. Each core is scheduling two distinct processes, so you're probably seeing cache thrashing (a core's cache is shared between hyperthreads).

I know this thread is a bit old but I figured some added data points might help. I ran this on a vm with 4 virtual-cpus (2.93Ghz X5670 xeon) and 8GB of ram allocated. The VM was hosted on Hyper-V and is running Python 2.7.8 on Ubuntu 14.10 64-bit, but my version of PP is the fork PPFT.
In the first run the number of threads was 4. In the second I modified the for loop to go to 8.
Output: http://pastebin.com/ByF7nbfm
Adding 4 more cores, and doubling the ram, same for loop, looping for 8:
Output: http://pastebin.com/irKGWMRy

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to implement single program multiple data (spmd) in python - python

Related

managing sequence of print output after multiprocessing

How do I use multiprocessing Pool and Queue together?

More parallel processes than available processors in pathos

Proper handling of parallel function that writes output in Python

Parallel Python: 4 threads have same speed as 2 threads

Categories

Resources