Computing the conditional column mean of a 2D numpy array - python

I'm new in Python.
I have a 2D np.array (ex. 50 rows and 12 columns) and I need the mean of the 3rd column when the 1st column==x and the 9th column==y.
I can't figure out how to do it without using ifs...
Any help would be appreciated.

Let's assume your array is called arr. In this case, you want to apply two different filters first 1st column==x the second 9th column==y. To begin with, you should create each filter (mask) separately and then see what you want to do with them in terms of logical relation between them and your expected output.
mask1 = arr[:, 0] == x # 1st column==x
mask1 = arr[:, 8] == y # 9th column==y
Now you can use or, and, or any other logical operator to create your final mask which in this case it's and. For that sake in numpy you can use logical functions.
final_mask = np.logical_and(mask1, mask2)
And finally all you need is to filter your array based on the final_mask and perform the calculations you intended to do:
filtered_3rd_column = arr[:, final_mask]
_mean = filtered_3rd_column.mean()

You can use np.where():
x = 1
y = 2
a[np.where((a[:, 0] == x) & (a[:, 8] == y)), 3].mean()

I solved the problem as follows (thanks to Kasrâmvd):
mask1 = arr[:, 0] == x # 1st column==x
mask1 = arr[:, 8] == y # 9th column==y
final_mask = np.logical_and(mask1, mask2)
filtered_arr = arr[final_mask,:]
mean_3rd_column = filtered_arr[:,2].mean()

Related

What's a more efficient way to calculate the max of each row in a matrix excluding its own column?

For a given 2D matrix np.array([[1,3,1],[2,0,5]]) if one needs to calculate the max of each row in a matrix excluding its own column, with expected example return np.array([[3,1,3],[5,5,2]]), what would be the most efficient way to do so?
Currently I implemented it with a loop to exclude its own col index:
n=x.shape[0]
row_max_mat=np.zeros((n,n))
rng=np.arange(n)
for i in rng:
row_max_mat[:,i] = np.amax(s_a_array_sum[:,rng!=i],axis=1)
Is there a faster way to do so?
Similar idea to yours (exclude columns one by one), but with indexing:
mask = ~np.eye(cols, dtype=bool)
a[:,np.where(mask)[1]].reshape((a.shape[0], a.shape[1]-1, -1)).max(1)
Output:
array([[3, 1, 3],
[5, 5, 2]])
You could do this using np.accumulate. Compute the forward and backward accumulations of maximums along the horizontal axis and then combine them with an offset of one:
import numpy as np
m = np.array([[1,3,1],[2,0,5]])
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
print(r)
# [[3 1 3]
# [5 5 2]]
This will require 3x the size of your matrix to process (although you could take that down to 2x if you want an in-place update). Adding a 3rd&4th dimension could also work using a mask but that will require columns^2 times matrix's size to process and will likely be slower.
If needed, you can apply the same technique column wise or to both dimensions (by combining row wise and column wise results).
a = np.array([[1,3,1],[2,0,5]])
row_max = a.max(axis=1).reshape(-1,1)
b = (((a // row_max)+1)%2)
c = b*row_max
d = (a // row_max)*((a*b).max(axis=1).reshape(-1,1))
c+d # result
Since, we are looking to get max excluding its own column, basically the output would have each row filled with the max from it, except for the max element position, for which we will need to fill in with the second largest value. As such, argpartition seems would fit right in there. So, here's one solution with it -
def max_exclude_own_col(m):
out = np.full(m.shape, m.max(1, keepdims=True))
sidx = np.argpartition(-m,2,axis=1)
R = np.arange(len(sidx))
s0,s1 = sidx[:,0], sidx[:,1]
mask = m[R,s0]>m[R,s1]
L1c,L2c = np.where(mask,s0,s1), np.where(mask,s1,s0)
out[R,L1c] = m[R,L2c]
return out
Benchmarking
Other working solution(s) for large arrays -
# #Alain T.'s soln
def max_accum(m):
fmax = np.maximum.accumulate(m,axis=1)
bmax = np.maximum.accumulate(m[:,::-1],axis=1)[:,::-1]
r = np.full(m.shape,np.min(m))
r[:,:-1] = np.maximum(r[:,:-1],bmax[:,1:])
r[:,1:] = np.maximum(r[:,1:],fmax[:,:-1])
return r
Using benchit package (few benchmarking tools packaged together; disclaimer: I am its author) to benchmark proposed solutions.
So, we will test out with large arrays of various shapes for timings and speedups -
In [54]: import benchit
In [55]: funcs = [max_exclude_own_col, max_accum]
In [170]: inputs = [np.random.randint(0,100,(100000,n)) for n in [10, 20, 50, 100, 200, 500]]
In [171]: T = benchit.timings(funcs, inputs, indexby='shape')
In [172]: T
Out[172]:
Functions max_exclude_own_col max_accum
Shape
100000x10 0.017721 0.014580
100000x20 0.028078 0.028124
100000x50 0.056355 0.089285
100000x100 0.103563 0.200085
100000x200 0.188760 0.407956
100000x500 0.439726 0.976510
# Speedups with max_exclude_own_col over max_accum
In [173]: T.speedups(ref_func_by_index=1)
Out[173]:
Functions max_exclude_own_col Ref:max_accum
Shape
100000x10 0.822783 1.0
100000x20 1.001660 1.0
100000x50 1.584334 1.0
100000x100 1.932017 1.0
100000x200 2.161241 1.0
100000x500 2.220725 1.0

How to eliminate data that consist dual three consecutive numbers in python?

Here's my data
id
123246512378
632746378456
378256364036
159204652855
327445634589
I want to make data that consist of data that consist dual three consecutive numbers like 123246512378, 3274456|34589 is reduced
id
632746378456
378256364036
159204652855
First, turn df.id into a an array of single digit integers.
a = np.array(list(map(list, map(str, df.id))), dtype=int)
Then check to see if one digit is one less than the next digit... twice
first = a[:, :-2] == a[:, 1:-1] - 1
second = a[:, 1:-1] == a[:, 2:] - 1
Create a mask for when we have this happen more than once
mask = np.count_nonzero(first & second, axis=1) < 2
df[mask]
id
1 632746378456
2 378256364036
3 159204652855
Not sure if this is faster than #piRSquared as I'm not good enough with pandas to generate my own test data, but it seems like it should be:
def mask_cons(df):
a = np.array(list(map(list, df.id.astype(str))), dtype = float)
# same as piRSquared, but float
g_a = np.gradient(a, axis = 1)[:,1:-1]
# 3 consecutive values will give grad(a) = +/-1
mask = (np.abs(g_a) == 1).sum(1) > 1
# this assumes 4 consecutive values count as 2 instances of 3 consecutive values
# otherwise more complicated methods are needed (probably #jit)
return df[mask]

Vectorize a numpy.argmin search with a variable range per matrix row

Is there a way to get rid of the loop in the code below and replace it with vectorized operation?
Given a data matrix, for each row I want to find the index of the minimal value that fits within ranges defined (per row) in a separate array.
Here's an example:
import numpy as np
np.random.seed(10)
# Values of interest, for this example a random 6 x 100 matrix
data = np.random.random((6,100))
# For each row, define an inclusive min/max range
ranges = np.array([[0.3, 0.4],
[0.35, 0.5],
[0.45, 0.6],
[0.52, 0.65],
[0.6, 0.8],
[0.75, 0.92]])
# For each row, find the index of the minimum value that fits inside the given range
result = np.zeros(6).astype(np.int)
for i in xrange(6):
ind = np.where((ranges[i][0] <= data[i]) & (data[i] <= ranges[i][1]))[0]
result[i] = ind[np.argmin(data[i,ind])]
print result
# Result: [35 8 22 8 34 78]
print data[np.arange(6),result]
# Result: [ 0.30070006 0.35065639 0.45784951 0.52885388 0.61393513 0.75449247]
Approach #1 : Using broadcasting and np.minimum.reduceat -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
r,c = np.nonzero(mask)
cut_idx = np.unique(r, return_index=1)[1]
out = np.minimum.reduceat(data[mask], cut_idx)
Improvement to avoid np.nonzero and compute cut_idx directly from mask :
cut_idx = np.concatenate(( [0], np.count_nonzero(mask[:-1],1).cumsum() ))
Approach #2 : Using broadcasting and filling invalid places with NaNs and then using np.nanargmin -
mask = (ranges[:,None,0] <= data) & (data <= ranges[:,None,1])
result = np.nanargmin(np.where(mask, data, np.nan), axis=1)
out = data[np.arange(6),result]
Approach #3 : If you are not iterating enough (just like you have a loop of 6 iterations in the sample), you might want to stick to a loop for memory efficiency, but make use of more efficient masking with a boolean array instead -
out = np.zeros(6)
for i in xrange(6):
mask_i = (ranges[i,0] <= data[i]) & (data[i] <= ranges[i,1])
out[i] = np.min(data[i,mask_i])
Approach #4 : There is one more loopy solution possible here. The idea would be to sort each row of data. Then, use the two range limits for each row to decide on the start and stop indices with help from np.searchsorted. Further, we would use those indices to slice and then get the minimum values. Benefit with slicing that way is, we would be working with views and as such would be very efficient, both on memory and performance.
The implementation would look something like this -
out = np.zeros(6)
sdata = np.sort(data, axis=1)
for i in xrange(6):
start = np.searchsorted(sdata[i], ranges[i,0])
stop = np.searchsorted(sdata[i], ranges[i,1], 'right')
out[i] = np.min(sdata[i,start:stop])
Furthermore, we could get those start, stop indices in a vectorized manner following an implementation of vectorized searchsorted.
Based on suggestion by #Daniel F for the case when we are dealing with ranges that are within the limits of given data, we could simply use the start indices -
out[i] = sdata[i, start]
Assuming at least one value in range, you don't even have to bother with the upper limit:
result = np.empty(6)
for i in xrange(6):
lt = (ranges[i,0] >= data[i]).sum()
result[i] = np.argpartition(data[i], lt)[lt]
Actually, you could even vectorize the whole thing using argpartition
lt = (ranges[:,None,0] >= data).sum(1)
result = np.argpartition(data, lt)[np.arange(data.shape[0]), lt]
Of course, this is only efficient if data.shape[0] << data.shape[1], as otherwise you're basically sorting

python numpy: iterate for different conditions without using a loop

I am trying to iterate over numpy arrays and generate an output, which has conditions similar to that described below:
min1 = 3
max1 = 1
a1 = np.array([1, 2, 5, 3, 4])
a2 = np.array([5, 2, 6, 2, 1])
output = np.zeros(5)
for i in range(0, 5):
if((a1[i] - a2[i]) > min1):
output[i] = 3 * (a1[i] - a2[i])
if((a1[i] - a2[i]) < max1):
output = 5 * (a1[i] - a2[i])
I need to optimize the above code, so that I utilize the numpy functionalities as the best and also avoid using a loop. How should I do it?
While functions like select and where can condense the code, I think it's a good idea to know how to do this with basic boolean masking. It's applicable in many cases, and nearly always as fast.
Calculate the difference which is used several times:
In [432]: diff = a1-a2
In [433]: diff
Out[433]: array([-4, 0, -1, 1, 3])
In [435]: output = np.zeros_like(a1)
find those cases where it meets the first condition, and set the corresponding elements of output:
In [436]: mask1 = diff>min1
In [437]: output[mask1] = 3*diff[mask1]
repeat for the second condtion:
In [438]: mask2 = diff<max1
In [439]: output[mask2] = 5*diff[mask2]
and again if there are more conditions.
In [440]: output
Out[440]: array([-20, 0, -5, 0, 0])
In this example, only the -4 and -1 met condition 2, and none condition 1.
Welcome to SO! First, a tip for questions:
Even your current loopy code doesn't work, as you're assigning values to output instead of output[i]. Try to make sure that if you're asking for a code refactor your original code works (and other than numpy tags, asking for code refactoring on SO will normally get you downvoted).
You're going to want a nested np.where statement like this
output = np.where((a1 - a2) > min1, 3 * (a1 - a2), (np.where((a1 - a2) < max1, 5 * (a1 - a2), 0)))
This way you don't need to initialize output, and no more loopy code.
If you have lots of conditions, you can also use np.select
d = a1 - a2
condlist = [d > min1, d < max1]
choicelist = [3 * d, 5 * d]
output = np.select(condlist, choicelist)
Your question is a bit vague. Here's a possible solution depending on what you really want.
This solution will return 2 arrays of that contains values that belong to the min and max that you indicated. What this does is that it does the - operation to your two arrays then instead of doing if, numpy has the function called where that does this for you without iterating through the whole array.
import numpy as np
min1=3
max1=1
a1=np.array([1,2,5,3,4])
a2=np.array([5,2,6,2,1])
array_op = a1-a2
min_output = 3*(array_op[np.where((array_op)>min1)])
max_output = 5*(array_op[np.where((array_op)<max1)])
The solution is
import numpy as np
min1=3
max1=1
a1=np.array([1,2,5,3,4])
a2=np.array([5,2,6,2,1])
output=np.zeros(5)
diff = a1-a2
output[np.where(diff < max1)] = diff[np.where(diff < max1)]*5
output[np.where(diff > min1)] = diff[np.where(diff > min1)]*3
The operation a1[i]-a2[i] is done 4 times in a single iteration
Save it as a variable to save 3*5 = 15 computations

Pythonista way of extracting elements from an array

I'm writing a few Python lines of code doing the following:
I have two arrays a and b, b contains (non strictly) increasing integers.
I want to extract from a the values for which the values of b is a multiple of 20 but I don't want duplicates, in the sense that if b has values : ...,40,40,41,... I only want the first value in a corresponding the 40 not the second one.
That's why a[b%20==0] does not work.
I've been using:
factors = [20*i for i in xrange(1,int(b[-1]/20 +1))]
sample = numpy.array([a[numpy.nonzero(b==factor)[0][0]] for factor in factors])
but it is both slow and fairly inelegant.
Is there a Pythonista 'cute' way of doing it?
a[(b % 20 == 0) & np.r_[True, np.diff(b) > 0]]
The b % 20 == 0 part gives a binary mask that selects all the elements of b that are a factor of 20. The np.r_[True, np.diff(b) > 0] part creates a binary mask that selects only the elements that differ from the previous element (we explicitly add a True at the beginning, as the first element does not have a previous element). Add the masks together and voila!
Let's say we create a boolean array wich marks the unique values on b:
c = np.zeros(b.shape, dtype=np.bool)
c[np.unique(b, return_index = True)[1]] = True
Now you can do:
a[np.logical_and(b % 20 == 0, c)]
If your b is sorted, using diff should be a bit faster than using unique:
import numpy
a = numpy.random.random_integers(0, 1000, 1000)
b = numpy.random.random_integers(0, 1000, 1000)
b.sort()
subset = a[(numpy.diff(b) != 0) * (b[:-1]%20 == 0)]

Categories