Parallelization with ray not working as expected

Parallelization with ray not working as expected - python

I am a beginner with parallel processing and I currently experiment with a simple program to understand how Ray works.
import numpy as np
import time
from pprint import pprint
import ray
ray.init(num_cpus = 4) # Specify this system has 4 CPUs.
data_rows = 800
data_cols = 10000
batch_size = int(data_rows/4)
# Prepare data
np.random.RandomState(100)
arr = np.random.randint(0, 100, size=[data_rows, data_cols])
data = arr.tolist()
# Solution Without Paralleization
def howmany_within_range(row, minimum, maximum):
"""Returns how many numbers lie within `maximum` and `minimum` in a given `row`"""
count = 0
for n in row:
if minimum <= n <= maximum:
count = count + 1
return count
results = []
start = time.time()
for row in data:
results.append(howmany_within_range(row, minimum=75, maximum=100))
end = time.time()
print("Without parallelization")
print("-----------------------")
pprint(results[:5])
print("Total time: ", end-start, "sec")
# Parallelization with ray
results = []
y = []
z = []
w = []
#ray.remote
def solve(data, minimum, maximum):
count = 0
count_row = 0
for i in data:
for n in i:
if minimum <= n <= maximum:
count = count + 1
count_row = count
count = 0
return count_row
start = time.time()
results = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(0, batch_size)])
y = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(1*batch_size, 2*batch_size)])
z = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(2*batch_size, 3*batch_size)])
w = ray.get([solve.remote(data[i:i+1], 75, 100) for i in range(3*batch_size, 4*batch_size)])
end = time.time()
results += y+z+w
print("With parallelization")
print("--------------------")
print(results[:5])
print("Total time: ", end-start, "sec")
I am getting much slower performance with Ray:
$ python3 raytest.py
Without parallelization
-----------------------
[2501, 2543, 2530, 2410, 2467]
Total time: 0.5162293910980225 sec
(solve pid=26294)
With parallelization
--------------------
[2501, 2543, 2530, 2410, 2467]
Total time: 1.1760196685791016 sec
In fact, if I scale up the input data I get messages in the terminal with the pid of the function and the program stalls.
Essentially, I try to split computations in batches of rows and assign each computation to a cpu core. What am I doing wrong?

there are two main problems when it comes to multiprocessing (your code)
there's an overhead associated with spawning the new processes to do your work.
there's an overhead associated with transferring data between different processes.
in order to spawn a new process, a new instance of the python interpreter is created and initialized (due to the GIL). also when you transfer data between processes, this data has to be serialized/deserialized at the sender/receiver, which in your program is happening twice (once from main process to workers, and again from workers to the main process.), so in short your program is spending all it's time paying this overhead instead of doing the actual computation.
if you want to utilize the benefit of multiprocessing in python you should have more computation being done at the workers using as little data transfer as possible, the way I usually determine if using multiprocessing will be a good idea is if the task is going to take more than 5 seconds to complete on a single cpu.
another good idea to reduce data transfer is slicing your arrays in chucks (multiple rows) instead of a single row per function call, as each row has to be serialized separately, which adds extra overhead.

Related

python multiprocess very slow

I'm having trouble using python multiprocess.
im trying with a minimal version of code:
import os
os.environ["OMP_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["OPENBLAS_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["MKL_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["VECLIB_MAXIMUM_THREADS"] = "1" # just in case the system uses multithrad somehow
os.environ["NUMEXPR_NUM_THREADS"] = "1" # just in case the system uses multithrad somehow
import numpy as np
from datetime import datetime as dt
from multiprocessing import Pool
from pandas import DataFrame as DF
def trytrytryshare(times):
i = 0
for j in range(times):
i+=1
return
def trymultishare(thread = 70 , times = 10):
st = dt.now()
args_l = [(times,) for i in range(thread)]
print(st)
p = Pool(thread)
for i in range(len(args_l)):
p.apply_async(func = trytrytryshare, args = (args_l[i]))
p.close()
p.join()
timecost = (dt.now()-st).total_seconds()
print('%d threads finished in %f secs' %(thread,timecost))
return timecost
if __name__ == '__main__':
res = DF(columns = ['thread','timecost'])
n = 0
for j in range(5):
for i in range(1,8,3):
timecost = trymultishare(thread = i,times = int(1e8))
res.loc[n] = [i,timecost]
n+=1
timecost = trymultishare(thread = 70,times = int(1e8))
res.loc[n] = [70,timecost]
n+=1
res_sum = res.groupby('thread').mean()
res_sum['decay'] = res_sum.loc[1,'timecost'] / res_sum['timecost']
on my own computer (8cores):
on my server (80 cores, im the only one using it)
i tried again, make one thread job longer.
the decay is really bad....
any idea how to "fix" this, or this is just what i can get when using multi-process?
thanks

The way you're timing apply_async is flawed. You won't know when the subprocesses have completed unless you wait for their results.
It's a good idea to work out an optimum process pool size based on number of CPUs. The code that follows isn't necessarily the best for all cases but it's what I use.
You shouldn't set the pool size to the number of processes you intend to run. That's the whole point of using a pool.
So here's a simpler example of how you could test subprocess performance.
from multiprocessing import Pool
from time import perf_counter
from os import cpu_count
def process(n):
r = 0
for _ in range(n):
r += 1
return r
POOL = max(cpu_count()-2, 1)
N = 1_000_000
def main(procs):
# no need for pool size to be bigger than the numer of processes to be run
poolsize = min(POOL, procs)
with Pool(poolsize) as pool:
_start = perf_counter()
for result in [pool.apply_async(process, (N,)) for _ in range(procs)]:
result.wait() # wait for async processes to terminate
_end = perf_counter()
print(f'Duration for {procs} processes with pool size of {poolsize} = {_end-_start:.2f}s')
if __name__ == '__main__':
print(f'CPU count = {cpu_count()}')
for procs in range(10, 101, 10):
main(procs)
Output:
CPU count = 20
Duration for 10 processes with pool size of 10 = 0.12s
Duration for 20 processes with pool size of 18 = 0.19s
Duration for 30 processes with pool size of 18 = 0.18s
Duration for 40 processes with pool size of 18 = 0.28s
Duration for 50 processes with pool size of 18 = 0.30s
Duration for 60 processes with pool size of 18 = 0.39s
Duration for 70 processes with pool size of 18 = 0.42s
Duration for 80 processes with pool size of 18 = 0.45s
Duration for 90 processes with pool size of 18 = 0.54s
Duration for 100 processes with pool size of 18 = 0.59s

My guess is that you're observing the cost of spawning new processes, since apply_async returns immediately. It's much cheaper to spawn one process in the case of thread==1 instead of spawning 70 processes (your last case with the worst decay).
The fact that the server with 80 cores performs better than you laptop with 8 cores could be due to the server containing better hardware in general (better heat removal, faster CPU, etc) or it might contain a different OS. Benchmarking across different machines is non-trivial.

Python multiprocessing: how to create x number of processes and get return value back

I have a program that I created using threads, but then I learned that threads don't run concurrently in python and processes do. As a result, I am trying to rewrite the program using multiprocessing, but I am having a hard time doing so. I have tried following several examples that show how to create the processes and pools, but I don't think it's exactly what I want.
Below is my code with the attempts I have tried. The program tries to estimate the value of pi by randomly placing points on a graph that contains a circle. The program takes two command-line arguments: one is the number of threads/processes I want to create, and the other is the total number of points to try placing on the graph (N).
import math
import sys
from time import time
import concurrent.futures
import random
import multiprocessing as mp
def myThread(arg):
# Take care of imput argument
n = int(arg)
print("Thread received. n = ", n)
# main calculation loop
count = 0
for i in range (0, n):
x = random.uniform(0,1)
y = random.uniform(0,1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count = count + 1
print("Thread found ", count, " points inside circle.")
return count;
# end myThread
# receive command line arguments
if (len(sys.argv) == 3):
N = sys.argv[1] # original ex: 0.01
N = int(N)
totalThreads = sys.argv[2]
totalThreads = int(totalThreads)
print("N = ", N)
print("totalThreads = ", totalThreads)
else:
print("Incorrect number of arguments!")
sys.exit(1)
if ((totalThreads == 1) or (totalThreads == 2) or (totalThreads == 4) or (totalThreads == 8)):
print()
else:
print("Invalid number of threads. Please use 1, 2, 4, or 8 threads.")
sys.exit(1)
# start experiment
t = int(time() * 1000) # begin run time
total = 0
# ATTEMPT 1
# processes = []
# for i in range(totalThreads):
# process = mp.Process(target=myThread, args=(N/totalThreads))
# processes.append(process)
# process.start()
# for process in processes:
# process.join()
# ATTEMPT 2
#pool = mp.Pool(mp.cpu_count())
#total = pool.map(myThread, [N/totalThreads])
# ATTEMPT 3
#for i in range(totalThreads):
#total = total + pool.map(myThread, [N/totalThreads])
# p = mp.Process(target=myThread, args=(N/totalThreads))
# p.start()
# ATTEMPT 4
# with concurrent.futures.ThreadPoolExecutor() as executor:
# for i in range(totalThreads):
# future = executor.submit(myThread, N/totalThreads) # start thread
# total = total + future.result() # get result
# analyze results
pi = 4 * total / N
print("pi estimate =", pi)
delta_time = int(time() * 1000) - t # calculate time required
print("Time =", delta_time, " milliseconds")
I thought that creating a loop from 0 to totalThreads that creates a process for each iteration would work. I also wanted to pass in N/totalThreads (to divide the work), but it seems that processes take in an iterable list rather than an argument to pass to the method.
What is it I am missing with multiprocessing? Is it at all possible to even do what I want to do with processes?
Thank you in advance for any help, it is greatly appreciated :)

I have simplified your code and used some hard-coded values which may or may not be reasonable.
import math
import concurrent.futures
import random
from datetime import datetime
def myThread(arg):
count = 0
for i in range(0, arg[0]):
x = random.uniform(0, 1)
y = random.uniform(0, 1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count += 1
return count
N = 10_000
T = 8
_start = datetime.now()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(myThread, (int(N / T),)): _ for _ in range(T)}
total = 0
for future in concurrent.futures.as_completed(futures):
total += future.result()
_end = datetime.now()
print(f'Estimate for PI = {4 * total / N}')
print(f'Run duration = {_end-_start}')
A typical output on my machine looks like this:-
Estimate for PI = 3.1472
Run duration = 0:00:00.008895
Bear in mind that the number of threads you start is effectively managed by the ThreadPoolExecutor (TPE) [ when constructed with no parameters ]. It makes decisions about the number of threads that can run based on your machine's processing capacity (number of cores etc). Therefore you could, if you really wanted to, set T to a very high number and the TPE will block execution of any new threads until it determines that there is capacity.

ProcessPoolExecutor on shared dataset and multiple arguments

I am facing an issue I was not able to solve by doing some search on the web.
I am using the minimal code below. The goal is to run some function 'f_sum' several million times by multiprocessing (using the ProcessPoolExecutor). I am adding multiple arguments by a list of tuples 'args'. In addition, the function is supposed to use some sort of data which is the same for all executions (in the example it's just one number). I do not want to add the data to the 'args' tuple for memory reasons.
The only option I found so far is adding the data outside of the "if name == 'main'". This will (for some reason that I do not understand) make the variable available to all processes. However, updating is not possible. Also, I do not really want to make the data definition outside because in the actual code it will be based on data import and might require additional manipulation.
Hope you can help and thanks in advance!
PS: I am using Python 3.7.9 on Win 10.
from concurrent.futures import ProcessPoolExecutor
import numpy as np
data = 0 # supposed to be a large data set & shared among all calculations)
num_workers = 6 # number of CPU cores
num_iterations = 10 # supposed to be large number
def f_sum(args):
(x,y) = args
print('This is process', x, 'with exponent:', y)
value = 0
for i in range(10**y):
value += i
return value/10**y + data
def multiprocessing(func, args, workers):
with ProcessPoolExecutor(workers) as executor:
results = executor.map(func, args)
return list(results)
if __name__ == '__main__':
data = 0.5 # try to update data, should not be part of 'args' due to memory
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,8)))
result = multiprocessing(f_sum, args, num_workers)
if np.abs(result[0]-np.round(result[0])) > 0:
print('data NOT updated')
Edit to original question:
>> Performance Example 1
from concurrent.futures import ProcessPoolExecutor
import numpy as np
import time
data_size = 10**8
num_workers = 4
num_sum = 10**7
num_iterations = 100
data = np.random.randint(0,100,size=data_size)
# data = np.linspace(0,data_size,data_size+1, dtype=np.uintc)
def f_sum(args):
(x,y) = args
print('This is process', x, 'random number:', y, 'last data', data[-1])
value = 0
for i in range(num_sum):
value += i
result = value - num_sum*(num_sum-1)/2 + data[-1]
return result
def multiprocessing(func, args, workers):
with ProcessPoolExecutor(workers) as executor:
results = executor.map(func, args)
return list(results)
if __name__ == '__main__':
t0 = time.time()
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,10)))
result = multiprocessing(f_sum, args, num_workers)
print(f'expected result: {data[-1]}, actual result: {np.unique(result)}')
t1 = time.time()
print(f'total time: {t1-t0}')
>> Output
This is process 99 random number: 6 last data 9
expected result: 86, actual result: [ 3. 9. 29. 58.]
total time: 11.760863542556763
Leads to false result if randint is used. For linspace result is correct.
>> Performance Example 2 - based on proposal in answer
from concurrent.futures import ProcessPoolExecutor
import numpy as np
from multiprocessing import Array
import time
data_size = 10**8
num_workers = 4
num_sum = 10**7
num_iterations = 100
input = np.random.randint(0, 100, size=data_size)
# input = np.linspace(0, data_size, data_size + 1, dtype=np.uintc)
def f_sum(args):
(x,y) = args
print('This is process', x, 'random number:', y, 'last data', data[-1])
value = 0
for i in range(num_sum):
value += i
result = value - num_sum*(num_sum-1)/2 + data[-1]
return result
def init_pool(the_data):
global data
data = the_data
def multiprocessing(func, args, workers, input):
data = Array('i', input, lock=False)
with ProcessPoolExecutor(max_workers=workers, initializer=init_pool, initargs=(data,)) as executor:
results = list(executor.map(func, args))
return results
if __name__ == '__main__':
t0 = time.time()
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,10)))
result = multiprocessing(f_sum, args, num_workers, input)
print(f'expected result: {input[-1]}, actual result:{np.unique(result)}')
t1 = time.time()
print(f'total time: {t1-t0}')
>> Output
This is process 99 random number: 7 last data 29
expected result: 29, actual result: [29.]
total time: 30.8266122341156
#Booboo
I added two examples to my original question, the "Performance Example 2" is based on your code. First interesting finding, my original code actually gives incorrect results if the data array is initialized with random integers. I noticed, that each process by itself initializes the data array. Since it is based on random numbers each process uses a different array for calculation, and even different than the main. So that use case would not work with this code, in your code it is correct all the time.
If using linspace, however, it works, since this gives the same result each time. Same would be true for the use case where some data is read from a file (which is my actual use case). Example 1 is still about 3x faster than Example 2, and I think the time is mainly used by the initializing of the array in your method.
Regarding memory usage I don't see a relevant difference in my task manager. Both Example produce a similar increase in memory, even if the shape is different.
I still believe that your method is the correct approach, however, memory usage seems to be similar and speed is slower in the example above.

The most efficient used of memory would be to use shared memory so that all processes are working on the same instance of data. This would be absolutely necessary if the processes updated data. In the example below, since the access to data is read only and I am using a simple array of integers, I am using multiprocessing.Array with no locking specified. The "trick" is to initialize your pool by specifying the initializer and initargs arguments so that each process in the pool has access to this shared memory. I have made a couple of other changes to the code, which I have commented
from concurrent.futures import ProcessPoolExecutor
import numpy as np
from multiprocessing import Array, cpu_count # new imports
def init_pool(the_data):
global data
data = the_data
def f_sum(args):
(x,y) = args
print('This is process', x, 'with exponent:', y)
value = 0
for i in range(10**y):
value += i
return value/10**y + len(data) # just use the length of data for now
def multiprocessing(func, args, workers):
data = Array('i', range(1000), lock=False) # read-only, integers 0, 1, 2, ... 999
with ProcessPoolExecutor(max_workers=workers, initializer=init_pool, initargs=(data,)) as executor:
results = list(executor.map(func, args)) # create the list of results here
print(results) # so that it can be printed out for demo purposes
return results
if __name__ == '__main__':
num_iterations = 10 # supposed to be large number
#num_workers = 6 # number of CPU cores
num_workers = cpu_count() # number of CPU cores
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,8)))
result = multiprocessing(f_sum, args, num_workers)
if np.abs(result[0]-np.round(result[0])) > 0:
print('data NOT updated')
Prints:
This is process 0 with exponent: 2
This is process 1 with exponent: 1
This is process 2 with exponent: 4
This is process 3 with exponent: 3
This is process 4 with exponent: 5
This is process 5 with exponent: 1
This is process 6 with exponent: 5
This is process 7 with exponent: 2
This is process 8 with exponent: 6
This is process 9 with exponent: 6
[1049.5, 1004.5, 5999.5, 1499.5, 50999.5, 1004.5, 50999.5, 1049.5, 500999.5, 500999.5]
data NOT updated
Updated Example 2
You saw my comments to your question concerning Example 1.
Your Example 2 is still not ideal: You have the statement input = np.random.randint(0, 100, size=data_size) as a global being needlessly executed by every process as it is initialized for use in the process pool. Below is an updated solution that also shows one way how you can have your worker function work directly with a numpy array that is backed up a multiprocessing.Array instance so that the numpy array exists in shared memory. You don't have to use this technique for what you are doing since you are only using numpy to create random numbers (I an not sure why), but it is a useful technique to know. But you should re-rerun your code after moving the initialization code of input as I have so it is only executed once.
I don't have the occasion to work with numpy day to day but I have come to learn that it uses multiprocessing internally for many of its own functions. So it is often not the best match for use with multiprocessing, although that does not seem to be applicable here since even in the case below we are just indexing an element of an array and it would not be using a sub-process to accomplish that.
from concurrent.futures import ProcessPoolExecutor
import numpy as np
from multiprocessing import Array
import time
import ctypes
data_size = 10**8
num_workers = 4
num_sum = 10**7
num_iterations = 100
# input = np.linspace(0, data_size, data_size + 1, dtype=np.uintc)
def to_shared_array(arr, ctype):
shared_array = Array(ctype, arr.size, lock=False)
temp = np.frombuffer(shared_array, dtype=arr.dtype)
temp[:] = arr.flatten(order='C')
return shared_array
def to_numpy_array(shared_array, shape):
'''Create a numpy array backed by a shared memory Array.'''
arr = np.ctypeslib.as_array(shared_array)
return arr.reshape(shape)
def f_sum(args):
(x,y) = args
print('This is process', x, 'random number:', y, 'last data', data[-1])
value = 0
for i in range(num_sum):
value += i
result = value - num_sum*(num_sum-1)/2 + data[-1]
return result
def init_pool(shared_array, shape):
global data
data = to_numpy_array(shared_array, shape)
def multiprocessing(func, args, workers, input):
input = np.random.randint(0, 100, size=data_size)
shape = input.shape
shared_array = to_shared_array(input, ctypes.c_long)
with ProcessPoolExecutor(max_workers=workers, initializer=init_pool, initargs=(shared_array, shape)) as executor:
results = list(executor.map(func, args))
return input, results
if __name__ == '__main__':
t0 = time.time()
args = []
for k in range(num_iterations):
args.append((k, np.random.randint(1,10)))
input, result = multiprocessing(f_sum, args, num_workers, input)
print(f'expected result: {input[-1]}, actual result:{np.unique(result)}')
t1 = time.time()
print(f'total time: {t1-t0}')

My computationally intensive Numba function runs 10x slower on the GPU than on CPU. Am I missing anything?

I'ḿ testing Numba's performance and tried a dummy but computationally intense function two ways: on the CPU with parallelism enabled (and prange), and on the GPU where I see it occupies 100% of the GPU when running. Both run fine, but the CPU one takes 10X less time to complete. I was epxecting the GPU to be faster in this case, even though my GPU is not very strong (Geforce 1050 ti) and my CPU is strong (Threadripper 3970x).
Here is my CPU benchmark:
from numba import *
import numpy as np
from numba import cuda
import time
def benchmark():
input_list = np.random.randint(10, size=320000)
out = np.zeros(len(input_list))
cpu_run_test(input_list, out)
print('Result size: ' + str(len(out)) + ' ' + str(out))
#njit(parallel=True, fastmath=True, nogil=True)
def cpu_run_test(input_list, out):
for step in range(len(input_list)):
for j in range(10):
count = 0
for item2 in input_list:
if input_list[step] == item2:
count = count + 1
out[step] = count
if __name__ == '__main__':
import timeit
print(timeit.timeit("benchmark()", setup="from __main__ import benchmark", number=1))
And here is my GPU benchmark (same computation, just partitioning work differently to take advantage of the GPUs blocks and threads appropriately:
from numba import *
import numpy as np
from numba import cuda
import time
def benchmark():
for xx in range(1):
new_array_duration = time.time()
input_list = np.random.randint(10, size=320000)
new_array_duration = time.time() - new_array_duration
print('New array duration: ' + str(new_array_duration))
to_device_duration = time.time()
d_array = cuda.to_device(input_list)
to_device_duration = time.time() - to_device_duration
print('To device duration: ' + str(to_device_duration))
kernel_duration = time.time()
run_test[16, 768](d_array)
kernel_duration = time.time() - kernel_duration
print('Kernel duration: ' + str(kernel_duration))
to_host_duration = time.time()
out = d_array.copy_to_host()
to_host_duration = time.time() - to_host_duration
print('To host duration: ' + str(to_host_duration))
#cuda.jit(fastmath=True)
def run_test(d_array):
array_slice_len = len(d_array) / cuda.blockDim.x
slice_start = (cuda.threadIdx.x * (cuda.blockIdx.x + 1)) * array_slice_len
for step in prange(slice_start, slice_start + array_slice_len):
if step > len(d_array) - 1:
return
for j in range(10):
count = 0
for item2 in d_array:
if d_array[step] == item2:
count = count + 1
d_array[step] = count
if __name__ == '__main__':
import timeit
# make_multithread(benchmark, 64)
print(timeit.timeit("benchmark()", setup="from __main__ import benchmark", number=1))
Anyone can just copy and paste and run either. I'm on Linux Mint 20, Python 3.7, latest Numba (0.51) and latest cudatoolkit installed.
The results are as follows:
CPU: 15.20 secs
GPU: 145.54 secs
Is this correct or I am missing some optimizations for getting the GPU code running faster?
What am I missing?

Not necessarely an answer, but way too long for a comment.
Some things I would try; altering the blocks and threads and see what happens. Also utilizing Local, Shared and Global Memory on the GPU.
Global Memory is the only one that can be set by the host and is accessible to all threads. It is also the slowest type of memory for access, so you generally want to minimize the amount of transfers and use Memory Coalescing where possible.
Shared Memory is memory that is available for all threads in one Block and is faster than global memory. Local Memory is the fastest and only visible to one thread.
Is starting prange on the GPU actually using the GPU threads? Because normaly (non-python) you don't have to invoke stuff in parallel on the GPU. Maybe that loop runs in serial instead?
You should also try to find an algorithm that does not involve branching (if there is one). If I understand your code correctly, you have a list with 320.000 items containing random integers in the range [0;10). For each item, you count how often that item appears and replace the item with the count of that item. And 10x for good measure. This will cause the items of the resulting set to oscillate between the values [0;10) and somewhat between [0;320.000), but probably more around 32.000 in general.
To make it easier to transform the algorithm by getting rid of that unpleasant oscillation, increasing the list size by factor 10 and getting rid of the loop should do the same.
As I don't speak python, here's some pseudocode for a non-branching vectorized implementation. Although I don't know if this will be faster at all, it is something I would try.
{l} local memory
{s} shared memory
{g} global memory
Host:
input_list{g} = (0..9)[3.200.000]
accumulator{g} = (0..0)[10]
run _1
run _2
Device Kernel _1:
let blocksize = get_blocksize()
let threadId = get_threadId()
let globalId = get_globalId()
let localSum{s} = arr[10]
let localCopy{s} = arr[blocksize.x]
if(globalId.x < size(input_list)) {
localCopy[threadId] = input_list[globalId]
localSum[localCopy[threadId]]++
}
if(threadId.x < 10) {
accumulator[threadId] += localSum[threadId]
}
Device Kernel _2:
let blocksize = get_blocksize()
let threadId = get_threadId()
let globalId = get_globalId()
let localAcc{l} = arr[10]
for(i = 0 to 10) {
localAcc[i] = accumulator[i]
}
if(globalId < size(input_list)) {
let val = input_list[globalId]
input_list[globalId] = localAcc[val]
}

Parallelization/multiprocessing of conditional for loop

I want to use multiprocessing in Python to speed up a while loop.
More specifically:
I have a matrix (samples*features). I want to select x subsets of samples whose values at a random subset of features is unequal to a certain value (-1 in this case).
My serial code:
np.random.seed(43)
datafile = '...'
df = pd.read_csv(datafile, sep=" ", nrows = 89)
no_feat = 500
no_samp = 5
no_trees = 5
i=0
iter=0
samples = np.zeros((no_trees, no_samp))
features = np.zeros((no_trees, no_feat))
while i < no_trees:
rand_feat = np.random.choice(df.shape[1], no_feat, replace=False)
iter_order = np.random.choice(df.shape[0], df.shape[0], replace=False)
samp_idx = []
a=0
#--------------
#how to run in parallel?
for j in iter_order:
pot_samp = df.iloc[j, rand_feat]
if len(np.where(pot_samp==-1)[0]) == 0:
samp_idx.append(j)
if len(samp_idx) == no_samp:
print a
break
a+=1
#--------------
if len(samp_idx) == no_samp:
samples[i,:] = samp_idx
features[i, :] = rand_feat
i+=1
iter+=1
if iter>1000: #break if subsets cannot be found
break
Searching for fitting samples is the potentially expensive part (the j for loop), which in theory can be run in parallel. In some cases, it is not necessary to iterate over all samples to find a large enough subset, which is why I am breaking out of the loop as soon as the subset is large enough.
I am struggling to find an implementation that would allow for checks of how many valid results are generated already. Is it even possible?
I have used joblib before. If I understand correctly this uses the pool methods of multiprocessing as a backend which only works for separate tasks? I am thinking that queues might be helpful but thus far I failed at implementing them.

I found a working solution. I decided to run the while loop in parallel and have the different processes interact over a shared counter. Furthermore, I vectorized the search for suitable samples.
The vectorization yielded a ~300x speedup and running on 4 cores speeds up the computation ~twofold.
First I tried to implement separate processes and put the results into a queue. Turns out these aren't made to store large amounts of data.
If someone sees another bottleneck in that code I would be glad if someone pointed it out.
With my basically nonexistent knowledge about parallel computing I found it really hard to puzzle this together, especially since the example on the internet are all very basic. I learnt a lot though =)
My code:
import numpy as np
import pandas as pd
import itertools
from multiprocessing import Pool, Lock, Value
from datetime import datetime
import settings
val = Value('i', 0)
worker_ID = Value('i', 1)
lock = Lock()
def findSamp(no_trees, df, no_feat, no_samp):
lock.acquire()
print 'starting worker - {0}'.format(worker_ID.value)
worker_ID.value +=1
worker_ID_local = worker_ID.value
lock.release()
max_iter = 100000
samp = []
feat = []
iter_outer = 0
iter = 0
while val.value < no_trees and iter_outer<max_iter:
rand_feat = np.random.choice(df.shape[1], no_feat, replace=False
#get samples with random features from dataset;
#find and select samples that don't have missing values in the random features
samp_rand = df.iloc[:,rand_feat]
nan_idx = np.unique(np.where(samp_rand == -1)[0])
all_idx = np.arange(df.shape[0])
notnan_bool = np.invert(np.in1d(all_idx, nan_idx))
notnan_idx = np.where(notnan_bool == True)[0]
if notnan_idx.shape[0] >= no_samp:
#if enough samples for random feature subset, select no_samp samples randomly
notnan_idx_rand = np.random.choice(notnan_idx, no_samp, replace=False)
rand_feat_rand = rand_feat
lock.acquire()
val.value += 1
#x = val.value
lock.release()
#print 'no of trees generated: {0}'.format(x)
samp.append(notnan_idx_rand)
feat.append(rand_feat_rand)
else:
#increase iter_outer counter if no sample subset could be found for random feature subset
iter_outer += 1
iter+=1
if iter >= max_iter:
print 'exiting worker{0} because iter >= max_iter'.format(worker_ID_local)
else:
print 'worker{0} - finished'.format(worker_ID_local)
return samp, feat
def initialize(*args):
global val, worker_ID, lock
val, worker_ID, lock = args
def star_findSamp(i_df_no_feat_no_samp):
return findSamp(*i_df_no_feat_no_samp)
if __name__ == '__main__':
np.random.seed(43)
datafile = '...'
df = pd.read_csv(datafile, sep=" ", nrows = 89)
df = df.fillna(-1)
df = df.iloc[:, 6:]
no_feat = 700
no_samp = 10
no_trees = 5000
startTime = datetime.now()
print 'starting multiprocessing'
ncores = 4
p = Pool(ncores, initializer=initialize, initargs=(val, worker_ID, lock))
args = itertools.izip([no_trees]*ncores, itertools.repeat(df), itertools.repeat(no_feat), itertools.repeat(no_samp))
result = p.map(star_findSamp, args)#, callback=log_result)
p.close()
p.join()
print '{0} sample subsets for tree training have been found'.format(val.value)
samples = [x[0] for x in result if x != None]
samples = np.vstack(samples)
features = [x[1] for x in result if x != None]
features = np.vstack(features)
print datetime.now() - startTime

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Parallelization with ray not working as expected - python

Related

python multiprocess very slow

Python multiprocessing: how to create x number of processes and get return value back

ProcessPoolExecutor on shared dataset and multiple arguments

My computationally intensive Numba function runs 10x slower on the GPU than on CPU. Am I missing anything?

Parallelization/multiprocessing of conditional for loop

Categories

Resources