I'm still an amature when it comes to thinking about how to optimize. I have this section of code that takes in a list of found peaks and finds where these peaks,+/- some value, are located in a multidimensional array. It then adds +1 to their indices of a zeros array. The code works well, but it takes a long time to execute. For instance it is taking close to 45min to run if ind has 270 values and refVals has a shape of (3050,3130,80). I understand that its a lot of data to churn through, but is there a more efficient way of going about this?
maskData = np.zeros_like(refVals).astype(np.int16)
for peak in ind:
tmpArr = np.ma.masked_outside(refVals,x[peak]-2,x[peak]+2).astype(np.int16)
maskData[tmpArr.mask == False ] += 1
tmpArr = None
maskData = np.sum(maskData,axis=2)
Approach #1 : Memory permitting, here's a vectorized approach using broadcasting -
# Craate +,-2 limits usind ind
r = x[ind[:,None]] + [-2,2]
# Use limits to get inside matches and sum over the iterative and last dim
mask = (refVals >= r[:,None,None,None,0]) & (refVals <= r[:,None,None,None,1])
out = mask.sum(axis=(0,3))
Approach #2 : If running out of memory with the previous one, we could use a loop and use NumPy boolean arrays and that could be more efficient than masked arrays. Also, we would perform one more level of sum-reduction, so that we would be dragging less data with us when moving across iterations. Thus, the alternative implementation would look something like this -
out = np.zeros(refVals.shape[:2]).astype(np.int16)
x_ind = x[ind]
for i in x_ind:
out += ((refVals >= i-2) & (refVals <= i+2)).sum(-1)
Approach #3 : Alternatively, we could replace that limit based comparison with np.isclose in approach #2. Thus, the only step inside the loop would become -
out += np.isclose(refVals,i,atol=2).sum(-1)
Related
I need to iterate over a lot of itemsets and need to remove the outlier. As a threshold, I simply use the standard val > 3* σ. Currently, I have the following solution:
def remove_outlier(data):
data_t = np.array(data).T.tolist()
for ele in data_t:
temp = []
for val in ele:
if (val < (3 * np.std(ele)) + np.mean(ele)) and (val > (np.mean(ele) - 3 * np.std(ele))):
temp.append(val)
data_t[i] = np.asarray(temp)
data = np.asarray(data_t).T
return data
I'm looking for a faster solution, because it takes up to 7 seconds per dataset (foreseeable for a double for-loop).
I've come across scipy's z-score method and since it also supports the axis=1 argument, it seems more valuable and faster than my solution. Is there a shortcut of how I can remove the corresponding z-scores from my dataset?
I played around with numpy.where(), but it returns only certain values if compared above a threshold.
The shape of the data is usually around 1000x8, but can also be transposed without any problem.
Task
Given a numpy or pytorch matrix, find the indices of cells that have values that are larger than a given threshold.
My implementation
#abs_cosine is the matrix
#sim_vec is the wanted
sim_vec = []
for m in range(abs_cosine.shape[0]):
for n in range(abs_cosine.shape[1]):
# exclude diagonal cells
if m != n and abs_cosine[m][n] >= threshold:
sim_vec.append((m, n))
Concerns
Speed. All other computations are built on Pytorch, using numpy is already a compromise, because it has moved computations from GPU to CPU. Pure python for loops will make the whole process even worse (for small data set already 5 times slower). I was wondering if we can move the whole computation to Numpy (or pytorch) without invoking any for loops?
An improvement I can think of (but got stuck...)
bool_cosine = abs_cosine > threshold
which returns a boolean matrix of True and False. But I cannot find a way to quick retrieve the indices of the True cells.
The following is for PyTorch (fully on GPU)
# abs_cosine should be a Tensor of shape (m, m)
mask = torch.ones(abs_cosine.size()[0])
mask = 1 - mask.diag()
sim_vec = torch.nonzero((abs_cosine >= threshold)*mask)
# sim_vec is a tensor of shape (?, 2) where the first column is the row index and second is the column index
The following works in numpy
mask = 1 - np.diag(np.ones(abs_cosine.shape[0]))
sim_vec = np.nonzero((abs_cosine >= 0.2)*mask)
# sim_vec is a 2-array tuple where the first array is the row index and the second array is column index
This is about twice as fast than np.where
import numba as nb
#nb.njit(fastmath=True)
def get_threshold(abs_cosine,threshold):
idx=0
sim_vec=np.empty((abs_cosine.shape[0]*abs_cosine.shape[1],2),dtype=np.uint32)
for m in range(abs_cosine.shape[0]):
for n in range(abs_cosine.shape[1]):
# exclude diagonal cells
if m != n and abs_cosine[m,n] >= threshold:
sim_vec[idx,0]=m
sim_vec[idx,1]=n
idx+=1
return sim_vec[0:idx,:]
The first call takes about 0.2s longer (compilation overhead). If the array is on the GPU, there may be also a way to do the whole computation on the GPU.
Nevertheless I am not really satisfied with the performance, since a simple boolean operation is about 5 times faster than the solution shown above and 10 times faster than np.where. If the order of the indices doesn't matter this problem can also be parallelized.
I have an image of the sun, I found center and radius and now I want to process pixels differently if they are inside or outside the disk. The ideal solution would be to imterpolate the parameters of the processing function, in order to smoothly transition from disk to background.
Here is what I'm doing now:
for index,value in np.ndenumerate(sun_img):
if distance.euclidean(index,center) > radius:
sun_img[index] = processing_function(index,value)
Like this it works but it takes forever to compute the image. I'm sure there is a more efficient way to do that. How would you solve this?
Image shape is around (1000, 1000)
Processing_function is basically not doing anything right now: value += 1
The function should be something like a non-linear "step function" with 0.0 value till radius and 1.0 5px after. something like: _______/''''''''''''''''''''' multiplied by the value of the pixel. The slope should be on the value of the radius. I wanna do this in order to enhance the protuberances
Here's a vectorized way leveraging NumPy broadcasting -
m,n = sun_img.shape
I,J = np.ogrid[:m,:n]
sq_dist = (I - center[0])**2 + (J - center[1])**2
valid_mask = sq_dist > radius**2
Now, for a processing_function that just adds 1 to the valid places, defined by the IF-conditional, do -
sun_img[valid_mask] += 1
If you need to implement a custom operation with processing_function that needs those row, column indices, use np.where to get those indices and then iterate through the valid elements, like so -
r,c = np.where(valid_mask)
for index in zip(r,c):
sun_img[index] = processing_function(index,sun_img[r,c])
If you have a lot of such valid places, then computing r,c might make things slow. In that case, directly use the mask, like so -
for index,value in np.ndenumerate(sun_img):
if valid_mask[index]:
sun_img[index] = processing_function(index,value)
Compared to the original code, the benefit is that we have the conditional values pre-computed before going into the loop. The best way again would be to vectorize processing_function itself so that it works on a bigger chunk of data, but that would depend on its implementation.
1. Consider the following traversal of a numpy.ndarray
for ii in xrange(0,(nxTes-2)):
if ( (xCom-dtaCri-xcTes[ii]) * (xCom-dtaCri-xcTes[ii+1]) ) <= 0.0:
nxL=ii
if ( (xCom+dtaCri-xcTes[ii]) * (xCom+dtaCri-xcTes[ii+1]) ) <= 0.0:
nxR=ii+1
2. xCom, dtaCri and xcTes are of type() numpy.float64, float and numpy.ndarray respectively
3. The full block above is repeated for nyTes and nzTes i.e. a total of three blocks are done in the main algorithm loop. The goal is to create a region of interest with window size dtaCri and center at comparison point xCom using positional data from xcTes
4. The above code is more or less a straight port from Matlab wherein the same block executes at somewhere around three to four times the speed.
5. Question: Is it possible to optimize the block above with respect to execution time and if so how?
6. So far I have tried some minor tweaks such as altering data types and using range() instead of xrange() from which I saw no noticeable changes in performance.
Pre-compute those boolean conditional outputs before going into the loop in a vectorized manner and making use of slicing, which are just views into the input array, like so -
parte1 = ( (xCom-dtaCri-xcTes[:nxTes-2]) * (xCom-dtaCri-xcTes[1:nxTes-1]) ) <=0.0
parte2 = ( (xCom+dtaCri-xcTes[:nxTes-2]) * (xCom+dtaCri-xcTes[1:nxTes-1]) ) <=0.0
We could see few computations are repeated. So, we could use some re-use there -
p = xCom-xcTes[:nxTes-1]
p0 = p - dtaCri
p1 = p + dtaCri
parte1 = p0[:-1]*p0[1:] <= 0.0
parte2 = p1[:-1]*p1[1:] <= 0.0
Then, just use those bools in the loop -
for ii in xrange(0,(nxTes-2)):
if parte1[ii]:
nxL=ii
if parte2[ii]:
nxR=ii+1
The idea is to do minimal work inside the loop with focus on performance.
I am assuming you have more work going in the loop that is using nxL and nxR, because otherwise we are overwriting values into those two variables.
I am trying to find an efficient code instead of the following piece of code (that is only one part of my code), to increase the speed:
for pr in some_list:
Tp = T[partition[pr]].sum(0)
Tpx = np.dot(Tp, xhat)
hp = h[partition[[pr]].sum(0)
up = (uk[partition[pr][:]].sum(0))/len(partition[pr])
hpu = hpu + np.dot(hp.T, up)
Tpu = Tpu + np.dot(Tp.T, up)
I have at least two more similar blocks of code. As you can see, I used fancy indexing three times (really couldn't find another way). In my algorithm, I need this part to be done very quickly, but it's not happening now. I will really appreciate any suggestion.
Thank you all.
Best,
If your partitions are few and have many elements each, you should consider swapping around the indices of your objects. Summing an array of shape (30,1000) along its second dimension should be faster than summing an array of shape (1000,30) along its first dimension, since in the former case you are always summing contiguous blocks of memory (i.e. arr[k,:] for each k) for each remaining index. So if you put the summation index last (and get rid of some trailing singleton dimension while you're at it), you might get speed-up.
As hpaulj noted in a comment, it's not clear how your loop could be vectorized. However, since it's performance-critical, you could still try vectorizing some of the work.
I suggest that you store hp, up and Tp for each partition (following pre-allocation), then perform the scalar/matrix products in a single vectorized step. Also note that Tpx is unused in your example, so I omitted it here (whatever you're doing with it, you can do it similarly to the other examples):
part_len = len(some_list) # number of partitions, N
Tpshape = (part_len,) + T.shape[1:] # (N,30,100) if T was (1000,30,100)
hpshape = (part_len,) + h.shape[1:] # (N,30,1) if h was (1000,30,1)
upshape = (part_len,) + uk.shape[1:] # (N,30,1) if uk was (1000,30,1)
Tp = np.zeros(Tpshape)
hp = np.zeros(hpshape)
up = np.zeros(upshape)
for ipr,pr in enumerate(some_list):
Tp[ipr,:,:] = T[partition[pr]].sum(0)
hp[ipr,:,:] = h[partition[[pr]].sum(0)
up[ipr,:,:] = uk[partition[pr]].sum(0)/len(partition[pr])
# compute vectorized dot products:
#Tpx unclear in original, omitted
# sum over second index (dot), sum over first index (sum in loop)
hpu = np.einsum('abc,abd->cd',hp,up) # shape (1,1)
Tpu = np.einsum('abc,abd->cd',Tp,up) # shape (100,1)
Clearly the key player is numpy.einsum. And of course if hpu and Tpu had some prior values before the loop, you have to increment those values with the results from einsum above.
As for einsum, it performs summations and contractions of arrays of arbitrary dimensions. The pattern apearing above, 'abc,abd->cd', when applied to 3d arrays A and B, will return a 2d array C, with the following definition (math pseudocode):
C(c,d) = sum_a sum_b A(a,b,c)*B(a,b,d)
For a given fix a summation index, what's inside is
sum_b A(a,b,c)*B(a,b,d)
which, if the c and d indices are kept, will be euqivalent to np.dot(A(a,:,:).T,B(a,:,:)). Since we're summing these matrices with respect to a too, we're supposed to do exactly what your loopy version does, adding up each np.dot() contribution of the total sums.