MPI spawns all processes on master node - python

Here's my code:
import sys
sys.path.append("/apps/anaconda3/pkgs/qgis-3.8.1-py37h59d211b_0/lib")
import qgis
import os
from qgis.core import *
from mpi4py import MPI
from timeit import default_timer as dt
import numpy as np
# Initialize MPI
comm = MPI.COMM_WORLD
rank = int(comm.Get_rank())
if(rank==0):
print("Project loaded successfully")
# counter function
def counter(feats,vectors,rank):
cnt = 0
for feature in feats:
cands = vectors.getFeatures(QgsFeatureRequest().setFilterRect(feature.geometry().boundingBox()))
for area_feature in cands:
if feature.geometry().intersects(area_feature.geometry()):
cnt+=1
return cnt
start = MPI.Wtime()
# loading layers
layer_ids = list(project.mapLayers().keys())
vec = layer_ids[0]
vectorlayer = project.mapLayers()[vec]
grid = QgsVectorLayer("/home/600.shp", "grid", "ogr")
# define number of jobs
cores = comm.size()
# split data
gfeats = np.array(list(grid.getFeatures()))
subs = np.array_split(gfeats,cores)
# start process based on rank
print("rank : ",rank," len : ",len(sub[rank]))
value = counter(subs[rank],vectorlayer,rank)
ttime = MPI.Wtime()-start
# perform the reductions:
tcount = comm.reduce(value,op=MPI.SUM, root=0)
tstime = comm.reduce(ttime, op=MPI.MAX, root=0)
# Print on rank 0
if rank == 0:
print(' Rank 0: count = ',tcount)
print(' Rank 0: time = ',tstime)
Executed Using:
mpiexec -n 50 python mptest.py
In the code, I'm splitting the array into a number of processes I've mentioned, let's suppose 50, using np.array_split()
After, I'm using rank as an index and executing only that split part in each process(or rank).
HPC contains 5 nodes each with 24 cores.
On running the code, the time is same as when I've used multiprocessing module in python
which means all processes are spawning on a single node.
How can I use all 5 nodes, to execute this code and bring up maximum parallelization?
Thanks in advance!

Related

Parallelization with ray not working as expected

I am a beginner with parallel processing and I currently experiment with a simple program to understand how Ray works.
import numpy as np
import time
from pprint import pprint
import ray
ray.init(num_cpus = 4) # Specify this system has 4 CPUs.
data_rows = 800
data_cols = 10000
batch_size = int(data_rows/4)
# Prepare data
np.random.RandomState(100)
arr = np.random.randint(0, 100, size=[data_rows, data_cols])
data = arr.tolist()
# Solution Without Paralleization
def howmany_within_range(row, minimum, maximum):
"""Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
count = 0
for n in row:
if minimum <= n <= maximum:
count = count + 1
return count
results = []
start = time.time()
for row in data:
results.append(howmany_within_range(row, minimum=75, maximum=100))
end = time.time()
print("Without parallelization")
print("-----------------------")
pprint(results[:5])
print("Total time: ", end-start, "sec")
# Parallelization with ray
results = []
y = []
z = []
w = []
#ray.remote
def solve(data, minimum, maximum):
count = 0
count_row = 0
for i in data:
for n in i:
if minimum <= n <= maximum:
count = count + 1
count_row = count
count = 0
return count_row
start = time.time()
results = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(0, batch_size)])
y = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(1*batch_size, 2*batch_size)])
z = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(2*batch_size, 3*batch_size)])
w = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(3*batch_size, 4*batch_size)])
end = time.time()
results += y+z+w
print("With parallelization")
print("--------------------")
print(results[:5])
print("Total time: ", end-start, "sec")
I am getting much slower performance with Ray:
$ python3 raytest.py
Without parallelization
-----------------------
[2501, 2543, 2530, 2410, 2467]
Total time: 0.5162293910980225 sec
(solve pid=26294)
With parallelization
--------------------
[2501, 2543, 2530, 2410, 2467]
Total time: 1.1760196685791016 sec
In fact, if I scale up the input data I get messages in the terminal with the pid of the function and the program stalls.
Essentially, I try to split computations in batches of rows and assign each computation to a cpu core. What am I doing wrong?
there are two main problems when it comes to multiprocessing (your code)
there's an overhead associated with spawning the new processes to do your work.
there's an overhead associated with transferring data between different processes.
in order to spawn a new process, a new instance of the python interpreter is created and initialized (due to the GIL). also when you transfer data between processes, this data has to be serialized/deserialized at the sender/receiver, which in your program is happening twice (once from main process to workers, and again from workers to the main process.), so in short your program is spending all it's time paying this overhead instead of doing the actual computation.
if you want to utilize the benefit of multiprocessing in python you should have more computation being done at the workers using as little data transfer as possible, the way I usually determine if using multiprocessing will be a good idea is if the task is going to take more than 5 seconds to complete on a single cpu.
another good idea to reduce data transfer is slicing your arrays in chucks (multiple rows) instead of a single row per function call, as each row has to be serialized separately, which adds extra overhead.

Python multiprocessing: how to create x number of processes and get return value back

I have a program that I created using threads, but then I learned that threads don't run concurrently in python and processes do. As a result, I am trying to rewrite the program using multiprocessing, but I am having a hard time doing so. I have tried following several examples that show how to create the processes and pools, but I don't think it's exactly what I want.
Below is my code with the attempts I have tried. The program tries to estimate the value of pi by randomly placing points on a graph that contains a circle. The program takes two command-line arguments: one is the number of threads/processes I want to create, and the other is the total number of points to try placing on the graph (N).
import math
import sys
from time import time
import concurrent.futures
import random
import multiprocessing as mp
def myThread(arg):
# Take care of imput argument
n = int(arg)
print("Thread received. n = ", n)
# main calculation loop
count = 0
for i in range (0, n):
x = random.uniform(0,1)
y = random.uniform(0,1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count = count + 1
print("Thread found ", count, " points inside circle.")
return count;
# end myThread
# receive command line arguments
if (len(sys.argv) == 3):
N = sys.argv[1] # original ex: 0.01
N = int(N)
totalThreads = sys.argv[2]
totalThreads = int(totalThreads)
print("N = ", N)
print("totalThreads = ", totalThreads)
else:
print("Incorrect number of arguments!")
sys.exit(1)
if ((totalThreads == 1) or (totalThreads == 2) or (totalThreads == 4) or (totalThreads == 8)):
print()
else:
print("Invalid number of threads. Please use 1, 2, 4, or 8 threads.")
sys.exit(1)
# start experiment
t = int(time() * 1000) # begin run time
total = 0
# ATTEMPT 1
# processes = []
# for i in range(totalThreads):
# process = mp.Process(target=myThread, args=(N/totalThreads))
# processes.append(process)
# process.start()
# for process in processes:
# process.join()
# ATTEMPT 2
#pool = mp.Pool(mp.cpu_count())
#total = pool.map(myThread, [N/totalThreads])
# ATTEMPT 3
#for i in range(totalThreads):
#total = total + pool.map(myThread, [N/totalThreads])
# p = mp.Process(target=myThread, args=(N/totalThreads))
# p.start()
# ATTEMPT 4
# with concurrent.futures.ThreadPoolExecutor() as executor:
# for i in range(totalThreads):
# future = executor.submit(myThread, N/totalThreads) # start thread
# total = total + future.result() # get result
# analyze results
pi = 4 * total / N
print("pi estimate =", pi)
delta_time = int(time() * 1000) - t # calculate time required
print("Time =", delta_time, " milliseconds")
I thought that creating a loop from 0 to totalThreads that creates a process for each iteration would work. I also wanted to pass in N/totalThreads (to divide the work), but it seems that processes take in an iterable list rather than an argument to pass to the method.
What is it I am missing with multiprocessing? Is it at all possible to even do what I want to do with processes?
Thank you in advance for any help, it is greatly appreciated :)
I have simplified your code and used some hard-coded values which may or may not be reasonable.
import math
import concurrent.futures
import random
from datetime import datetime
def myThread(arg):
count = 0
for i in range(0, arg[0]):
x = random.uniform(0, 1)
y = random.uniform(0, 1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count += 1
return count
N = 10_000
T = 8
_start = datetime.now()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(myThread, (int(N / T),)): _ for _ in range(T)}
total = 0
for future in concurrent.futures.as_completed(futures):
total += future.result()
_end = datetime.now()
print(f'Estimate for PI = {4 * total / N}')
print(f'Run duration = {_end-_start}')
A typical output on my machine looks like this:-
Estimate for PI = 3.1472
Run duration = 0:00:00.008895
Bear in mind that the number of threads you start is effectively managed by the ThreadPoolExecutor (TPE) [ when constructed with no parameters ]. It makes decisions about the number of threads that can run based on your machine's processing capacity (number of cores etc). Therefore you could, if you really wanted to, set T to a very high number and the TPE will block execution of any new threads until it determines that there is capacity.

python MPI for dictionary iteration

I would like to split the iteration of dictionary using MPI (mpi4.py)(message passing interface).
for example,
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
tmp_list = []
for key, value in some_dict.items():
tmp_values = some_function(key,value)
tmp_list.append(tmp_values)
there is some simple code.
How do i compose the MPI code for the iteration.
You will first need to convert the dict into a list and then divide the list into a number of processes that you will be using. This is necessary so that comm.scatter can send parts of data across all processes. And then final results could be gathered using comm.gather
script.py
#!/usr/bin/python
from mpi4py import MPI
comm = MPI.COMM_WORLD
size = comm.Get_size() # get number of processes
rank = comm.Get_rank() # get the current rank of process
def some_fun(x,y): # some random function
return x+y
def chunkIt(seq, num): #function to chunk list into {size} parts for scatter to work
avg = len(seq) / float(num)
out = []
last = 0.0
while last < len(seq):
out.append(seq[int(last):int(last + avg)])
last += avg
return out
some_dict = {i: i**2 for i in range(100)} # some random data to work on
if rank == 0:
data = chunkIt(list(some_dict.items()), size) # convert dict into list first and then divide the data into {size} parts
# print(f"rank: {rank} / data: {data}")
else:
data = None
data = comm.scatter(data, root=0) # scatter the data accross given processes
print(f"rank: {rank} / data: {data}\n")
sub_dict = dict(data) # convert the list into dict
tmp_list = [] #local to each process
for key, value in sub_dict.items():
tmp_values = some_fun(key, value)
tmp_list.append(tmp_values)
tmp_list = comm.gather(tmp_list, root=0) # gathering data from all procs to root proc
if rank == 0:
print(f"length of tmp_list on rank: {rank} is: {len(tmp_list)}")
print(f"tmp_list: {tmp_list}") #tmp_list is list ot lists. make sure to convert it into required ds
else:
assert tmp_list is None
make it executable using chmod
chmod +x script.py
and then run
mpiexec -n 4 script.py
-n is the number of processes to run
Note: I am using ubuntu 16.04 and python 3.7.10 and mpi4py==3.0.3

Increase speed by eleminating loops

I have the following problem. The code below successfully linear fits may data from 50 to 400 samples (I never have more than 400 samples and the first 50 are of horrendous quality). In the third dimension I will have the value of 7 and the fourth dimension can have values of up to 10000 therefore this loop "solution" would take alot of time. How can I not use a for loop and decrease my runtimes? Thank you for your help (I am pretty new to Python)
from sklearn.linear_model import TheilSenRegressor
import numpy as np
#ransac = linear_model.RANSACRegressor()
skip_v=50#number of values to be skipped
N=400
test_n=np.reshape(range(skip_v, N),(-1,1))
f_n=7
d4=np.shape(data)
a6=np.ones((f_n,d4[3]))
b6=np.ones((f_n,d4[3]))
for j in np.arange(d4[3]):
for i in np.arange(f_n):
theil = TheilSenRegressor(random_state=0).fit(test_n,np.log(data[skip_v:,3,i,j]))
a6[i,j]=theil.coef_
b6[i,j]=theil.intercept_
You can use multiprocessing to work your loop in parallel. The following code is not working. It just demonstrates how to do it. It is only useful, if your numbers are really big. Otherwise, doing in sequential is faster.
from sklearn.linear_model import TheilSenRegressor
import numpy as np
import multiprocessing as mp
from itertools import product
def worker_function(input_queue, output_queue, skip_v, test_n, data):
for task in iter(input_queue.get, 'STOP'):
i = task[0]
j = task[1]
theil = TheilSenRegressor(random_state=0).fit(test_n,np.log(data[skip_v:,3,i,j]))
output_queue.put([i, j, theil])
if __name__ == "__main__":
# define data here
f_n = 7
d4 = np.shape(data)
skip_v = 50
N=400
test_n=np.reshape(range(skip_v, N),(-1,1))
input_queue = mp.Queue()
output_queue = mp.Queue()
# here you create all combinations of j and i of your loop
list1 = range(f_n)
list2 = range(d4[3])
list3 = [list1, list2]
tasks = [p for p in product(*list3)]
numProc = 4
# start processes
process = [mp.Process(target=worker_function,
args=(input_queue, output_queue,
skip_v, test_n, data)) for x in range(numProc)]
for p in process:
p.start()
# queue tasks
for i in tasks:
input_queue.put(i)
# signal workers to stop after tasks are all done
for i in range(numProc):
input_queue.put('STOP')
# get the results
for i in range(len(tasks)):
res = output_queue.get(block=True) # wait for results
a6[res[0], res[1]] = res[2].coef_
b6[res[0], res[1]] = res[2].intercept_

MPI processor quantity creates error, how to implement broadcast?

I have created a python program to calculate pi. I then decided to write it with mpi4py to run with several processes. The program works, but it returns a different value for pi than the original python version. As I looked into this problem more, I found that it returns a less accurate value when I run it with more processors. Why does the MPI version change the result with more processors? Also would it make more sense to use a broadcast rather then sending lots of individual messages? How would I implement broadcast if it is more effective?
MPI version:
#!/apps/moose/miniconda/bin/python
from mpi4py import MPI
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
def f(x):
return (1-(float(x)**2))**float(0.5)
n = 1000000
nm = dict()
pi = dict()
for i in range(1,size+1):
if i == size:
nm[i] = (i*n/size)+1
else:
nm[i] = i*n/size
if rank == 0:
val = 0
for i in range(0,nm[1]):
val = val+f(float(i)/float(n))
val = val*2
pi[0] = (float(2)/n)*(float(1)+val)
print name, "rank", rank, "calculated", pi[0]
for i in range(1, size):
pi[i] = comm.recv(source=i, tag=i)
number = sum(pi.itervalues())
number = "%.20f" %(number)
import time
time.sleep(0.3)
print "Pi is approximately", number
for proc in range(1, size):
if proc == rank:
val = 0
for i in range(nm[proc]+1,nm[proc+1]):
val = val+f(float(i)/float(n))
val = val*2
pi[proc] = (float(2)/n)*(float(1)+val)
comm.send(pi[proc], dest=0, tag = proc)
print name, "rank", rank, "calculated", pi[proc]
Original Python version:
#!/usr/bin/python
n = 1000000
def f(x):
return (1-(float(x)**2))**float(0.5)
val = 0
for i in range(n):
i = i+1
val = val+f(float(i)/float(n))
val = val*2
pi = (float(2)/n)*(float(1)+val)
print pi
Your code estimates by computing the area of the quarter of a disk, that is the intergral of using the trapezoidal rule.
The problem of your code is that the ranges of the i values for each process are not complete. Indeed, use a small n and print i to see what is happening. For instance, for i in range(nm[proc]+1,nm[proc+1]): must be changed to for i in range(nm[proc],nm[proc+1]):. Otherwise, i=nm[proc] is never handled.
In addition, in pi[0] = (float(2)/n)*(float(1)+val) and pi[proc] = (float(2)/n)*(float(1)+val), the term float(1) comes from x=0 in the integral. But it is counted many times, once by each process! As the number of errors varies directly with the number of processes, increasing the number of processes decreases the accuracy, which is the symptom that you have reported.
A broadcast corresponds to a situation where all processes of a communicator must get the same piece of data from a given process. On the contrary, it is here required that data from all processors must be combined using a sum to produce a result available to a single process (called "root"). The latter operation is called a reduction and it is performed by comm.Reduce().
Here is a piece of code based on yours using comm.Reduce() instead of send() and recv().
from mpi4py import MPI
import numpy as np
comm = MPI.COMM_WORLD
rank = comm.Get_rank()
size = comm.Get_size()
name = MPI.Get_processor_name()
def f(x):
return (1-(float(x)**2))**float(0.5)
n = 10000000
nm =np.zeros(size+1,'i')
nm[0]=1
for i in range(1,size+1):
if i == size:
nm[i]=n
else:
nm[i] = (i*n)/size
val=0
for i in range(nm[rank],nm[rank+1]):
val = val+f((float(i))/float(n))
out=np.array(0.0, 'd')
vala=np.array(val, 'd')
comm.Reduce([vala,MPI.DOUBLE],[out,MPI.DOUBLE],op=MPI.SUM,root=0)
if rank == 0:
number =(float(4)/n)*(out)+float(2)/n
number = "%.20f" %(number)
import time
time.sleep(0.3)
print "Pi is approximately", number

Categories