I am trying to iterate over numpy arrays and generate an output, which has conditions similar to that described below:
min1 = 3
max1 = 1
a1 = np.array([1, 2, 5, 3, 4])
a2 = np.array([5, 2, 6, 2, 1])
output = np.zeros(5)
for i in range(0, 5):
if((a1[i] - a2[i]) > min1):
output[i] = 3 * (a1[i] - a2[i])
if((a1[i] - a2[i]) < max1):
output = 5 * (a1[i] - a2[i])
I need to optimize the above code, so that I utilize the numpy functionalities as the best and also avoid using a loop. How should I do it?
While functions like select and where can condense the code, I think it's a good idea to know how to do this with basic boolean masking. It's applicable in many cases, and nearly always as fast.
Calculate the difference which is used several times:
In [432]: diff = a1-a2
In [433]: diff
Out[433]: array([-4, 0, -1, 1, 3])
In [435]: output = np.zeros_like(a1)
find those cases where it meets the first condition, and set the corresponding elements of output:
In [436]: mask1 = diff>min1
In [437]: output[mask1] = 3*diff[mask1]
repeat for the second condtion:
In [438]: mask2 = diff<max1
In [439]: output[mask2] = 5*diff[mask2]
and again if there are more conditions.
In [440]: output
Out[440]: array([-20, 0, -5, 0, 0])
In this example, only the -4 and -1 met condition 2, and none condition 1.
Welcome to SO! First, a tip for questions:
Even your current loopy code doesn't work, as you're assigning values to output instead of output[i]. Try to make sure that if you're asking for a code refactor your original code works (and other than numpy tags, asking for code refactoring on SO will normally get you downvoted).
You're going to want a nested np.where statement like this
output = np.where((a1 - a2) > min1, 3 * (a1 - a2), (np.where((a1 - a2) < max1, 5 * (a1 - a2), 0)))
This way you don't need to initialize output, and no more loopy code.
If you have lots of conditions, you can also use np.select
d = a1 - a2
condlist = [d > min1, d < max1]
choicelist = [3 * d, 5 * d]
output = np.select(condlist, choicelist)
Your question is a bit vague. Here's a possible solution depending on what you really want.
This solution will return 2 arrays of that contains values that belong to the min and max that you indicated. What this does is that it does the - operation to your two arrays then instead of doing if, numpy has the function called where that does this for you without iterating through the whole array.
import numpy as np
min1=3
max1=1
a1=np.array([1,2,5,3,4])
a2=np.array([5,2,6,2,1])
array_op = a1-a2
min_output = 3*(array_op[np.where((array_op)>min1)])
max_output = 5*(array_op[np.where((array_op)<max1)])
The solution is
import numpy as np
min1=3
max1=1
a1=np.array([1,2,5,3,4])
a2=np.array([5,2,6,2,1])
output=np.zeros(5)
diff = a1-a2
output[np.where(diff < max1)] = diff[np.where(diff < max1)]*5
output[np.where(diff > min1)] = diff[np.where(diff > min1)]*3
The operation a1[i]-a2[i] is done 4 times in a single iteration
Save it as a variable to save 3*5 = 15 computations
Related
I have a very big dataframe with this structure:
Timestamp Val1
Here you can see a real sample:
Timestamp Temp
0 1622471518.92911 36.443
1 1622471525.034114 36.445
2 1622471531.148139 37.447
3 1622471537.284337 36.449
4 1622471543.622588 43.345
5 1622471549.734765 36.451
6 1622471556.2518 36.454
7 1622471562.361368 41.461
8 1622471568.472718 42.468
9 1622471574.826475 36.470
What I want to do is compare the Temp column with itself and if is higher than "X", for example 4, and the time between they is lower than "Y", for example 180 min, then I save some data of they.
Now I'm using two for loops one inside the other, but this expends to much time and usually pandas has an option to avoid this.
This is my code:
cap_time, maxim = 180, 4
cap_time = cap_time * 60
temps= df['Temperature'].values
times = df['Timestamp'].values
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
print(i,j,len(temps))
if float(temps[j]) > float(temps[i])*maxim:
timeIn = dt.datetime.fromtimestamp(float(times[i]))
timeOut = dt.datetime.fromtimestamp(float(times[j]))
diff = timeOut - timeIn
tdiff = diff.total_seconds()
if dd > cap_time:
break
else:
res = [temps[i], temps[j], times[i], times[j], tdiff/60, cap_time/60, maxim]
results.append(res)
break
# Then I save it in a dataframe and another actions
Can Pandas help me to achieve my goal and reduce the execution time? I found dataFrame.diff() but I'm not sure is what I want (or I don`t know how to use it).
Thank you very much.
Short of avoiding the nested for loops, you can already speed things up by avoiding all unnecessary calculations and conversions within the loops. In particular, you can use NumPy broadcasting to define a Boolean array beforehand, in which you can look up whether the condition is met:
import numpy as np
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results = []
for i in range(len(temps)):
for j in range(i+1, len(temps)):
if condition[i, j]:
results.append([temps[i], temps[j],
times[i], times[j],
times_diff[i, j]])
results
[[36.443, 43.345, 1622471518.92911, 1622471543.622588, 24.693477869033813],
...
[36.454, 42.468, 1622471556.2518, 1622471568.472718, 12.22091794013977]]
To avoid the loops altogether, you could define a 3-dimensional full results array and then use the condition array as a Boolean mask to filter out the results you want:
import numpy as np
n = len(temps)
temps_diff = temps - temps[:, None]
times_diff = times - times[:, None]
condition = np.logical_and(temps_diff > maxim,
times_diff < cap_time)
results_full = np.stack([np.repeat(temps[:, None], n, axis=1),
np.tile(temps, (n, 1)),
np.repeat(times[:, None], n, axis=1),
np.tile(times, (n, 1)),
times_diff])
results = results_full[np.stack(results_full.shape[0] * [condition])]
results.reshape((5, -1)).T
array([[ 3.64430000e+01, 4.33450000e+01, 1.62247152e+09,
1.62247154e+09, 2.46934779e+01],
...
[ 3.64540000e+01, 4.24680000e+01, 1.62247156e+09,
1.62247157e+09, 1.22209179e+01],
...
])
As you can see, the resulting numbers are the same as above, although this time the results array will contain more rows, because we didn't use the shortcut of starting the inner loop at i+1.
For a given 2D matrix np.array([[1,3,1],[2,0,5]]) if one needs to calculate the max of each row in a matrix excluding its own column, with expected example return np.array([[3,1,3],[5,5,2]]), what would be the most efficient way to do so?
Currently I implemented it with a loop to exclude its own col index:
n=x.shape[0]
row_max_mat=np.zeros((n,n))
rng=np.arange(n)
for i in rng:
row_max_mat[:,i] = np.amax(s_a_array_sum[:,rng!=i],axis=1)
Is there a faster way to do so?
Similar idea to yours (exclude columns one by one), but with indexing:
mask = ~np.eye(cols, dtype=bool)
a[:,np.where(mask)[1]].reshape((a.shape[0], a.shape[1]-1, -1)).max(1)
Output:
array([[3, 1, 3],
[5, 5, 2]])
You could do this using np.accumulate. Compute the forward and backward accumulations of maximums along the horizontal axis and then combine them with an offset of one:
import numpy as np
m = np.array([[1,3,1],[2,0,5]])
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
print(r)
# [[3 1 3]
# [5 5 2]]
This will require 3x the size of your matrix to process (although you could take that down to 2x if you want an in-place update). Adding a 3rd&4th dimension could also work using a mask but that will require columns^2 times matrix's size to process and will likely be slower.
If needed, you can apply the same technique column wise or to both dimensions (by combining row wise and column wise results).
a = np.array([[1,3,1],[2,0,5]])
row_max = a.max(axis=1).reshape(-1,1)
b = (((a // row_max)+1)%2)
c = b*row_max
d = (a // row_max)*((a*b).max(axis=1).reshape(-1,1))
c+d # result
Since, we are looking to get max excluding its own column, basically the output would have each row filled with the max from it, except for the max element position, for which we will need to fill in with the second largest value. As such, argpartition seems would fit right in there. So, here's one solution with it -
def max_exclude_own_col(m):
out = np.full(m.shape, m.max(1, keepdims=True))
sidx = np.argpartition(-m,2,axis=1)
R = np.arange(len(sidx))
s0,s1 = sidx[:,0], sidx[:,1]
mask = m[R,s0]>m[R,s1]
L1c,L2c = np.where(mask,s0,s1), np.where(mask,s1,s0)
out[R,L1c] = m[R,L2c]
return out
Benchmarking
Other working solution(s) for large arrays -
# #Alain T.'s soln
def max_accum(m):
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
return r
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
So, we will test out with large arrays of various shapes for timings and speedups -
In [54]: import benchit
In [55]: funcs = [max_exclude_own_col, max_accum]
In [170]: inputs = [np.random.randint(0,100,(100000,n)) for n in [10, 20, 50, 100, 200, 500]]
In [171]: T = benchit.timings(funcs, inputs, indexby='shape')
In [172]: T
Out[172]:
Functions max_exclude_own_col max_accum
Shape
100000x10 0.017721 0.014580
100000x20 0.028078 0.028124
100000x50 0.056355 0.089285
100000x100 0.103563 0.200085
100000x200 0.188760 0.407956
100000x500 0.439726 0.976510
# Speedups with max_exclude_own_col over max_accum
In [173]: T.speedups(ref_func_by_index=1)
Out[173]:
Functions max_exclude_own_col Ref:max_accum
Shape
100000x10 0.822783 1.0
100000x20 1.001660 1.0
100000x50 1.584334 1.0
100000x100 1.932017 1.0
100000x200 2.161241 1.0
100000x500 2.220725 1.0
I have two numpy arrays:
A= [ 3.8357 3.2450]
B= [ 5.6132 3.2415 3.6086 3.5666 3.8769 4.3587]
I want to compare A to B and only keep the value in A that is unique - outside of a +/-0.04 tolerance (i.e. A=[3.8357]).
Any ideas as to how I can do this?
Approach #1
We could use broadcasting -
A[(np.abs(np.subtract.outer(A,B)) > 0.04).all(1)]
Approach #2
We could leverage searchsorted to have a generic numpy.isin with tolerance specifier for use in generic problems, like so -
def isin_tolerance(A, B, tol):
A = np.asarray(A)
B = np.asarray(B)
Bs = np.sort(B) # skip if already sorted
idx = np.searchsorted(Bs, A)
linvalid_mask = idx==len(B)
idx[linvalid_mask] = len(B)-1
lval = Bs[idx] - A
lval[linvalid_mask] *=-1
rinvalid_mask = idx==0
idx1 = idx-1
idx1[rinvalid_mask] = 0
rval = A - Bs[idx1]
rval[rinvalid_mask] *=-1
return np.minimum(lval, rval) <= tol
Hence, to solve our case -
out = A[~isin_tolerance(A, B, tol=0.04)]
Sample run -
In [294]: A
Out[294]: array([13.8357, 3.245 , 3.8357])
In [295]: B
Out[295]: array([5.6132, 3.2415, 3.6086, 3.5666, 3.8769, 4.3587])
In [296]: A[~isin_tolerance(A, B, tol=0.04)]
Out[296]: array([13.8357, 3.8357])
Is there a way to get rid of the loop in the code below and replace it with vectorized operation?
Given a data matrix, for each row I want to find the index of the minimal value that fits within ranges defined (per row) in a separate array.
Here's an example:
import numpy as np
np.random.seed(10)
# Values of interest, for this example a random 6 x 100 matrix
data = np.random.random((6,100))
# For each row, define an inclusive min/max range
ranges = np.array([[0.3, 0.4],
[0.35, 0.5],
[0.45, 0.6],
[0.52, 0.65],
[0.6, 0.8],
[0.75, 0.92]])
# For each row, find the index of the minimum value that fits inside the given range
result = np.zeros(6).astype(np.int)
for i in xrange(6):
ind = np.where((ranges[i][0] <= data[i]) & (data[i] <= ranges[i][1]))[0]
result[i] = ind[np.argmin(data[i,ind])]
print result
# Result: [35 8 22 8 34 78]
print data[np.arange(6),result]
# Result: [ 0.30070006 0.35065639 0.45784951 0.52885388 0.61393513 0.75449247]
Approach #1 : Using broadcasting and np.minimum.reduceat -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
r,c = np.nonzero(mask)
cut_idx = np.unique(r, return_index=1)[1]
out = np.minimum.reduceat(data[mask], cut_idx)
Improvement to avoid np.nonzero and compute cut_idx directly from mask :
cut_idx = np.concatenate(( [0], np.count_nonzero(mask[:-1],1).cumsum() ))
Approach #2 : Using broadcasting and filling invalid places with NaNs and then using np.nanargmin -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
result = np.nanargmin(np.where(mask, data, np.nan), axis=1)
out = data[np.arange(6),result]
Approach #3 : If you are not iterating enough (just like you have a loop of 6 iterations in the sample), you might want to stick to a loop for memory efficiency, but make use of more efficient masking with a boolean array instead -
out = np.zeros(6)
for i in xrange(6):
mask_i = (ranges[i,0] <= data[i]) & (data[i] <= ranges[i,1])
out[i] = np.min(data[i,mask_i])
Approach #4 : There is one more loopy solution possible here. The idea would be to sort each row of data. Then, use the two range limits for each row to decide on the start and stop indices with help from np.searchsorted. Further, we would use those indices to slice and then get the minimum values. Benefit with slicing that way is, we would be working with views and as such would be very efficient, both on memory and performance.
The implementation would look something like this -
out = np.zeros(6)
sdata = np.sort(data, axis=1)
for i in xrange(6):
start = np.searchsorted(sdata[i], ranges[i,0])
stop = np.searchsorted(sdata[i], ranges[i,1], 'right')
out[i] = np.min(sdata[i,start:stop])
Furthermore, we could get those start, stop indices in a vectorized manner following an implementation of vectorized searchsorted.
Based on suggestion by #Daniel F for the case when we are dealing with ranges that are within the limits of given data, we could simply use the start indices -
out[i] = sdata[i, start]
Assuming at least one value in range, you don't even have to bother with the upper limit:
result = np.empty(6)
for i in xrange(6):
lt = (ranges[i,0] >= data[i]).sum()
result[i] = np.argpartition(data[i], lt)[lt]
Actually, you could even vectorize the whole thing using argpartition
lt = (ranges[:,None,0] >= data).sum(1)
result = np.argpartition(data, lt)[np.arange(data.shape[0]), lt]
Of course, this is only efficient if data.shape[0] << data.shape[1], as otherwise you're basically sorting
I'm writing a few Python lines of code doing the following:
I have two arrays a and b, b contains (non strictly) increasing integers.
I want to extract from a the values for which the values of b is a multiple of 20 but I don't want duplicates, in the sense that if b has values : ...,40,40,41,... I only want the first value in a corresponding the 40 not the second one.
That's why a[b%20==0] does not work.
I've been using:
factors = [20*i for i in xrange(1,int(b[-1]/20 +1))]
sample = numpy.array([a[numpy.nonzero(b==factor)[0][0]] for factor in factors])
but it is both slow and fairly inelegant.
Is there a Pythonista 'cute' way of doing it?
a[(b % 20 == 0) & np.r_[True, np.diff(b) > 0]]
The b % 20 == 0 part gives a binary mask that selects all the elements of b that are a factor of 20. The np.r_[True, np.diff(b) > 0] part creates a binary mask that selects only the elements that differ from the previous element (we explicitly add a True at the beginning, as the first element does not have a previous element). Add the masks together and voila!
Let's say we create a boolean array wich marks the unique values on b:
c = np.zeros(b.shape, dtype=np.bool)
c[np.unique(b, return_index = True)[1]] = True
Now you can do:
a[np.logical_and(b % 20 == 0, c)]
If your b is sorted, using diff should be a bit faster than using unique:
import numpy
a = numpy.random.random_integers(0, 1000, 1000)
b = numpy.random.random_integers(0, 1000, 1000)
b.sort()
subset = a[(numpy.diff(b) != 0) * (b[:-1]%20 == 0)]