Python multiprocessing is slower than regular. How can I improve?

Python multiprocessing is slower than regular. How can I improve? - python

Basically have a script that combs a dataset of nodes/points to remove those that overlap. The actual script is more complicated but I pared it down to basically a simple overlap check that does nothing with it for demonstration.
I tried a few variants with locks, queues, pools adding one job at a time versus adding in bulk. Some of the worst offenders were slower by a couple order of magnitude. Eventually I got it to the fastest I could.
The overlap checking algorithm which is send to the individual processes:
def check_overlap(args):
tolerance = args['tolerance']
this_coords = args['this_coords']
that_coords = args['that_coords']
overlaps = False
distance_x = this_coords[0] - that_coords[0]
if distance_x <= tolerance:
distance_x = pow(distance_x, 2)
distance_y = this_coords[1] - that_coords[1]
if distance_y <= tolerance:
distance = pow(distance_x + pow(distance_y, 2), 0.5)
if distance <= tolerance:
overlaps = True
return overlaps
The processing function:
def process_coords(coords, num_processors=1, tolerance=1):
import multiprocessing as mp
import time
if num_processors > 1:
pool = mp.Pool(num_processors)
start = time.time()
print "Start script w/ multiprocessing"
else:
num_processors = 0
start = time.time()
print "Start script w/ standard processing"
total_overlap_count = 0
# outer loop through nodes
start_index = 0
last_index = len(coords) - 1
while start_index <= last_index:
# nature of the original problem means we can process all pairs of a single node at once, but not multiple, so batch jobs by outer loop
batch_jobs = []
# inner loop against all pairs for this node
start_index += 1
count_overlapping = 0
for i in range(start_index, last_index+1, 1):
if num_processors:
# add job
batch_jobs.append({
'tolerance': tolerance,
'this_coords': coords[start_index],
'that_coords': coords[i]
})
else:
# synchronous processing
this_coords = coords[start_index]
that_coords = coords[i]
distance_x = this_coords[0] - that_coords[0]
if distance_x <= tolerance:
distance_x = pow(distance_x, 2)
distance_y = this_coords[1] - that_coords[1]
if distance_y <= tolerance:
distance = pow(distance_x + pow(distance_y, 2), 0.5)
if distance <= tolerance:
count_overlapping += 1
if num_processors:
res = pool.map_async(check_overlap, batch_jobs)
results = res.get()
for r in results:
if r:
count_overlapping += 1
# stuff normally happens here to process nodes connected to this node
total_overlap_count += count_overlapping
print total_overlap_count
print " time: {0}".format(time.time() - start)
And testing function:
from random import random
coords = []
num_coords = 1000
spread = 100.0
half_spread = 0.5*spread
for i in range(num_coords):
coords.append([
random()*spread-half_spread,
random()*spread-half_spread
])
process_coords(coords, 1)
process_coords(coords, 4)
Still, the non-multiprocessing runs in less than 0.4s consistently and the multiprocessing I can get just under 3.0s as it stands above. I get that maybe the algorithm here is too simple to really reap benefits, but considering the above case has nearly half a million iterations and the real case has significantly more, it's weird to me that the multiprocessing is an order of magnitude slower.
What am I missing / what can I do to improve?

Building O(N**2) 3-element dicts not used in the serialized code, and transmitting them over interprocess pipes, is a pretty good way to guarantee multiprocessing can't help ;-) Nothing comes for free - everything costs.
Below is a rewrite that executes much the same code regardless of whether it's run in serial or multiprocessing modes. No new dicts, etc. In general, the larger len(coords), the more benefit it gets from multiprocessing. On my box, at 20000 the multiprocessing run takes about a third of the wall-clock time.
Key to this is that all processes have their own copy of coords. This is done below by transmitting it just once, when the pool is created. That should work on all platforms. On Linux-y systems, it could happen "by magic" instead via forked process inheritance. Reducing the amount of data sent across processes from O(N**2) to O(N) is a huge improvement.
Getting more out of multiprocessing would require better load balancing. As is, a call to check_overlap(i) compares coords[i] to each value in coords[i+1:]. The larger i, the less work there is for it to do, and for the largest values of i just the cost of transmitting i between processes - and transmitting the result back - swamps the amount of time spent in check_overlap(i).
def init(*args):
global _coords, _tolerance
_coords, _tolerance = args
def check_overlap(start_index):
coords, tolerance = _coords, _tolerance
tsq = tolerance ** 2
overlaps = 0
start0, start1 = coords[start_index]
for i in range(start_index + 1, len(coords)):
that0, that1 = coords[i]
dx = abs(that0 - start0)
if dx <= tolerance:
dy = abs(that1 - start1)
if dy <= tolerance:
if dx**2 + dy**2 <= tsq:
overlaps += 1
return overlaps
def process_coords(coords, num_processors=1, tolerance=1):
global _coords, _tolerance
import multiprocessing as mp
_coords, _tolerance = coords, tolerance
import time
if num_processors > 1:
pool = mp.Pool(num_processors, initializer=init, initargs=(coords, tolerance))
start = time.time()
print("Start script w/ multiprocessing")
else:
num_processors = 0
start = time.time()
print("Start script w/ standard processing")
N = len(coords)
if num_processors:
total_overlap_count = sum(pool.imap_unordered(check_overlap, range(N)))
else:
total_overlap_count = sum(check_overlap(i) for i in range(N))
print(total_overlap_count)
print(" time: {0}".format(time.time() - start))
if __name__ == "__main__":
from random import random
coords = []
num_coords = 20000
spread = 100.0
half_spread = 0.5*spread
for i in range(num_coords):
coords.append([
random()*spread-half_spread,
random()*spread-half_spread
])
process_coords(coords, 1)
process_coords(coords, 4)

Related

How can I implement multithreading in this for loop?

Consider this code snippet
from tqdm import trange
def main_game(depth1, depth2):
# some operator with complexity O(20^max(depth1,depth2))
return depth1+depth2
DEPTH_MAX = 5
total = 0
for depth1 in range(1, DEPTH_MAX + 1):
for depth2 in range(1, DEPTH_MAX + 1):
for i in trange(100):
total += main_game(depth1, depth2)
print(total)
I'm using minimax algorithm in main_game() with branching factor = 10
Now, since the third for-loop has a time-consuming function (up to 100*O(20^5) in time complexity), is there any way I can make it run faster? I'm thinking of parallelizing (multithreading for example). Any suggestion?

Use multiprocessing, and from there Pool().starmap(). starmap() feeds your function with the prepared tuples of arguments in a parallelized manner. And collects the result synchronously.
If the order of the result doesn't matter, you could use the asynchronous version .starmap_async().get().
There are also Pool().apply() and Pool.map() with their _async() versions, but you actually need just to learn Pool().starmap(). It is only some Syntax difference.
import multiprocessing as mp
n_cpu = mp.cpu_count()
# let's say your function is a diadic function (takes two arguments)
def main_game(depth1, depth2):
return depth1 + depth2
DEPTH_MAX = 5
depths = list(range(1, DEPTH_MAX + 1))
# let's pre-prepare the arguments - because that goes fast!
depth1_depth2_pairs = [(d1, d2) for d1 in depths for d2 in depths]
# 1: Init multiprocessing.Pool()
pool = mp.Pool(n_cpu)
# 2: pool.starmap()
results = pool.starmap(main_game, depth_1_depth_2_pairs)
# 3: pool.close()
pool.close()
total = sum(results) # this does your `total +=`
## in this case, you could even use
results = pool.starmap_async(main_game, depth_1_depth_2_pairs).get()
## because the order doesn't matter, if you sum them all up
## which is commutative.
This all you can write slightly more nicer using the with construct (it does the closing automatically, even if an error occurs, so it does not just save you typing but is more secure.
import multiprocessing as mp
n_cpu = mp.cpu_count()
def main_game(depth1, depth2):
return depth1 + depth2
DEPTH_MAX = 5
depths = range(1, DEPTH_MAX + 1)
depth1_depth2_pairs = [(d1, d2) for d1 in depths for d2 in depths]
with mp.Pool(n_cpu) as pool:
results = pool.starmap_async(main_game, depth_1_depth_2_pairs).get()
total = sum(results)

Python multiprocessing: how to create x number of processes and get return value back

I have a program that I created using threads, but then I learned that threads don't run concurrently in python and processes do. As a result, I am trying to rewrite the program using multiprocessing, but I am having a hard time doing so. I have tried following several examples that show how to create the processes and pools, but I don't think it's exactly what I want.
Below is my code with the attempts I have tried. The program tries to estimate the value of pi by randomly placing points on a graph that contains a circle. The program takes two command-line arguments: one is the number of threads/processes I want to create, and the other is the total number of points to try placing on the graph (N).
import math
import sys
from time import time
import concurrent.futures
import random
import multiprocessing as mp
def myThread(arg):
# Take care of imput argument
n = int(arg)
print("Thread received. n = ", n)
# main calculation loop
count = 0
for i in range (0, n):
x = random.uniform(0,1)
y = random.uniform(0,1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count = count + 1
print("Thread found ", count, " points inside circle.")
return count;
# end myThread
# receive command line arguments
if (len(sys.argv) == 3):
N = sys.argv[1] # original ex: 0.01
N = int(N)
totalThreads = sys.argv[2]
totalThreads = int(totalThreads)
print("N = ", N)
print("totalThreads = ", totalThreads)
else:
print("Incorrect number of arguments!")
sys.exit(1)
if ((totalThreads == 1) or (totalThreads == 2) or (totalThreads == 4) or (totalThreads == 8)):
print()
else:
print("Invalid number of threads. Please use 1, 2, 4, or 8 threads.")
sys.exit(1)
# start experiment
t = int(time() * 1000) # begin run time
total = 0
# ATTEMPT 1
# processes = []
# for i in range(totalThreads):
# process = mp.Process(target=myThread, args=(N/totalThreads))
# processes.append(process)
# process.start()
# for process in processes:
# process.join()
# ATTEMPT 2
#pool = mp.Pool(mp.cpu_count())
#total = pool.map(myThread, [N/totalThreads])
# ATTEMPT 3
#for i in range(totalThreads):
#total = total + pool.map(myThread, [N/totalThreads])
# p = mp.Process(target=myThread, args=(N/totalThreads))
# p.start()
# ATTEMPT 4
# with concurrent.futures.ThreadPoolExecutor() as executor:
# for i in range(totalThreads):
# future = executor.submit(myThread, N/totalThreads) # start thread
# total = total + future.result() # get result
# analyze results
pi = 4 * total / N
print("pi estimate =", pi)
delta_time = int(time() * 1000) - t # calculate time required
print("Time =", delta_time, " milliseconds")
I thought that creating a loop from 0 to totalThreads that creates a process for each iteration would work. I also wanted to pass in N/totalThreads (to divide the work), but it seems that processes take in an iterable list rather than an argument to pass to the method.
What is it I am missing with multiprocessing? Is it at all possible to even do what I want to do with processes?
Thank you in advance for any help, it is greatly appreciated :)

I have simplified your code and used some hard-coded values which may or may not be reasonable.
import math
import concurrent.futures
import random
from datetime import datetime
def myThread(arg):
count = 0
for i in range(0, arg[0]):
x = random.uniform(0, 1)
y = random.uniform(0, 1)
d = math.sqrt(x * x + y * y)
if (d < 1):
count += 1
return count
N = 10_000
T = 8
_start = datetime.now()
with concurrent.futures.ThreadPoolExecutor() as executor:
futures = {executor.submit(myThread, (int(N / T),)): _ for _ in range(T)}
total = 0
for future in concurrent.futures.as_completed(futures):
total += future.result()
_end = datetime.now()
print(f'Estimate for PI = {4 * total / N}')
print(f'Run duration = {_end-_start}')
A typical output on my machine looks like this:-
Estimate for PI = 3.1472
Run duration = 0:00:00.008895
Bear in mind that the number of threads you start is effectively managed by the ThreadPoolExecutor (TPE) [ when constructed with no parameters ]. It makes decisions about the number of threads that can run based on your machine's processing capacity (number of cores etc). Therefore you could, if you really wanted to, set T to a very high number and the TPE will block execution of any new threads until it determines that there is capacity.

problem parralleling dask code on single machine

Paralleling with dask is slower than sequential coding.
I have a nested for loops which I am trying to parallel on a local cluster but can't find the right way.
I want to parallel the inside loop.
I have 2 big numpy matrices which I am trying to iterate over and perform a mathematical calculation on a subset of the matrices.
dimensions:
data_mat.shape = (38, 243863)
indicies_mat.shape (243863, 27)
idxX.shape = (19,)
idxY.shape = (19,)
seq_code:
start = datetime.datetime.now()
for i in range(num+1):
if i == 0:
labels = np.array(true_labels)
else:
labels = label_mat[i]
idxX = list(np.where(labels == 1))
idxY = list(np.where(labels == 2))
ansColumn = []
for j in range(indices.shape[0]):
list_of_indices = [[i] for i in indices_slice]
dataX = (data_mat[idxX, list_of_indices]).T
dataY = (data_mat[idxY, list_of_indices]).T
ansColumn.append(calc_func(dataX, dataY))
if i == 0:
ansMat = ansColumn
else:
ansMat = np.c_[ansMat, ansColumn]
end = datetime.datetime.now()
print(end - start)
parallel code:
start = datetime.datetime.now()
cluster = LocalCluster(n_workers=4, processes=False)
client = Client(cluster)
for i in range(num+1):
if i == 0:
labels = np.array(true_labels)
else:
labels = label_mat[i]
idxX = list(np.where(labels == 1))
idxY = list(np.where(labels == 2))
[big_future] = client.scatter([data_mat], broadcast=True)
[idx_b] = client.scatter([idxX], broadcast=True)
[idy_b] = client.scatter([idxY], broadcast=True)
futures = [client.submit(prep_calc_func, idx_b, idy_b, indices[j, :], big_future) for j in range(indices.shape[0])]
ansColumn = []
for fut in dask.distributed.client.as_completed(futures):
ansColumn.append(fut.result())
if i == 0:
ansMat = ansColumn
else:
ansMat = np.c_[ansMat, ansColumn]
end = datetime.datetime.now()
print(end - start)
helper function:
def = prep_calc_func(idxX, idxY, subset_of_indices, data_mat):
list_of_indices = [[i] for i in indices_slice]
dataX = (data_mat[idxX, subset_of_indices]).T
dataY = (data_mat[idxY, subset_of_indices]).T
ret_val = calc_func(dataX, dataY)
return ret_val
local machine: MacBook Pro (Retina, 13-inch, Mid 2014)
Processor: 2.6 GHz Intel Core i5
hw.physicalcpu: 2
hw.logicalcpu: 4
Memory: 8 GB 1600 MHz DDR3
when I execute the seq code it takes 01:52 min to complete (less than 2 minutes)
but when I try the parallel code it takes a lot more than 15 min.
(no matter which method I use: compute, result and client.submit or dask delayed)
(I prefer to use the dask distributed package because the next phase is maybe using remote clusters too.)
Any idea what am I doing wrong?

There are many reasons why something can be slow. There might be a lot of communication. Your tasks might be too small (recall that Dask's overhead is around 1ms per task), or something else entirely. For more information on understanding performance in Dask I recommend the following documents:
https://docs.dask.org/en/latest/delayed-best-practices.html
https://docs.dask.org/en/latest/understanding-performance.html

Vectorizing numpy Multiple Condition Nested Loops

On attempting to produce Automatic Peak Detection in Noisy Periodic and Quasi-Periodic Signals, by Felix Scholkmann, Jens Boss and Martin Wolf in Python, I've hit a stumbling block in the implementation.
Upon attempting to optimise, I've noticed that the nested for loops are creating a bottleneck in processing time (taking 115394 ms on average to complete).
Is there a more efficient means of constructing the nested for loop?
N.B:
The parameter, signal, is a list of co-ordinates to which the algorithm will process which is of the form
-48701.0
-20914.0
-1757.0
-49278.0
-106781.0
-88139.0
-13587.0
28071.0
11880.0
-13375.0
-18056.0
-15248.0
-12476.0
-9832.0
-26365.0
-65734.0
-81657.0
-41566.0
6382.0
872.0
-30666.0
-20261.0
17543.0
6278.0
...
The list is 32768 lines long.
The function returns the indexes of the peaks detected to which is processed in another function.
def ampd(signal):
s_time = range(1, len(signal)+1)
[fitPolynomial, fitError] = np.polyfit(s_time, signal, 1)
fitSignal = np.polyval([fitPolynomial, fitError], s_time)
dtrSignal = signal - fitSignal
N = len(dtrSignal)
L = math.ceil(N/2.0)-1
creation_start = time.time()
np.random.seed(1969)
LSM = np.random.uniform(0, 2, size=(L, N))
creation_elapsedTime = time.time() - creation_start
print('LSM created in %s ms' % int(creation_elapsedTime * 1000))
loop_start = time.time()
for k in range(1, L):
for i in range(k+2, N-k+1):
if signal[i-1]>signal[i-k-1] and signal[i-1]>signal[i+k-1]:
LSM[k,i] = 0
loop_elapsedTime = time.time() - loop_start
print('Loop completed in %s ms' % int(loop_elapsedTime * 1000))
G = np.sum(LSM, axis=1)
l = min(enumerate(G), key=itemgetter(1))[0]
MLSM = LSM[0:l]
S = np.std(MLSM, ddof=1)
found_indices = np.where(MLSM == ((S-1) == 0))
del LSM
del MLSM
return found_indices[1]

Here is a solution which uses only one loop
for k in range(1, L):
mat=1-((signal[k+1:N-k]>signal[1:N-2*k]) & (signal[k+1:N-k]>signal[2*k+1:N]))
LSM[k,k+2:N-k+1]*=mat
it's faster and seems do give the same solutions. You compare slices (as suggested by Ami Tavory), combine the comparisons with a &, which gives a True/False array; with 1-operation, you transform it to zeros and ones, the zeros corresponding to where the conditions are met. And lastly you multiply the row by the result.

comparing large vectors in python

I have two large vectors (~133000 values) of different length. They are each sortet from small to large values. I want to find values that are similar within a given tolerance. This is my solution but it is very slow. Is there a way to speed this up?
import numpy as np
for lv in range(np.size(vector1)):
for lv_2 in range(np.size(vector2)):
if np.abs(vector1[lv_2]-vector2[lv])<.02:
print(vector1[lv_2],vector2[lv],lv,lv_2)
break

Your algorithm is far from optimal. You compare way too much values. Assume you are at a certain position in vector1 and the current value in vector2 is already more than 0.02 bigger. Why would you compare the rest of vector2?
Start with something like
pos1 = 0
pos2 = 0
Now compare the values at those postions in your vectors. If the difference is too big, move the position of the smaller one fowared and check again. Continue until you reach the end of one vector.

haven't tested it, but the following should work. The idea is to exploit the fact that the vectors are sorted
lv_1, lv_2 = 0,0
while lv_1 < len(vector1) and lv_2 < len(vector2):
if np.abs(vector1[lv_2]-vector2[lv_1])<.02:
print(vector1[lv_2],vector2[lv_1],lv_1,lv_2)
lv_1 += 1
lv_2 += 1
elif vector1[lv_1] < vector2[lv_2]: lv_1 += 1
else: lv_2 += 1

The following code gives a nice increase in performance that depends upon how dense the numbers are. Using a set of 1000 random numbers, sampled uniformly between 0 and 100, it runs about 30 times faster than your implementation.
pos_1_start = 0
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
The timing:
time new method: 0.112464904785
time old method: 3.59720897675
Which is produced by the following script:
import random
import numpy as np
import time
# initialize the vectors to be compared
vector1 = [random.uniform(0, 40) for i in range(1000)]
vector2 = [random.uniform(0, 40) for i in range(1000)]
vector1.sort()
vector2.sort()
# the arrays that will contain the results for the first method
results1 = []
# the arrays that will contain the results for the second method
results2 = []
pos1_start = 0
t_start = time.time()
for i in range(np.size(vector1)):
for j in range(pos1_start, np.size(vector2)):
if np.abs(vector1[i] - vector2[j]) < .02:
results1 += [(vector1[i], vector2[j], i, j)]
else:
if vector2[j] < vector1[i]:
pos1_start += 1
else:
break
t1 = time.time() - t_start
print "time new method:", t1
t = time.time()
for lv1 in range(np.size(vector1)):
for lv2 in range(np.size(vector2)):
if np.abs(vector1[lv1]-vector2[lv2])<.02:
results2 += [(vector1[lv1], vector2[lv2], lv1, lv2)]
t2 = time.time() - t_start
print "time old method:", t2
# sort the results
results1.sort()
results2.sort()
print np.allclose(results1, results2)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Python multiprocessing is slower than regular. How can I improve? - python

Related

How can I implement multithreading in this for loop?

Python multiprocessing: how to create x number of processes and get return value back

problem parralleling dask code on single machine

Vectorizing numpy Multiple Condition Nested Loops

comparing large vectors in python

Categories

Resources