numpy.where with more than one condition - python

I am here because I have a question with the function numpy.where.
I need to develop a program that rounds the grades of a student in the danish grading scale.
(Danish grading scale is a 7-step-scale from the best one (12) to the worst one (-3) : 12 10 7 4 02 00 −3)
Here is the array of the grades :
grades=np.array([[-3,-2,-1,0],[1,2,3,4],[5,6,7,8],[9,10,11,12]])
and what I am trying to do is this :
gradesrounded=np.where(grades<-1.5, -3, grades)
gradesrounded=np.where(-1.5<=grades and grades<1, 0, grades)
gradesrounded=np.where(grades>=1 and grades<3, 2, grades)
gradesrounded=np.where(grades>=3 and grades<5.5, 4, grades)
gradesrounded=np.where(grades>=5.5 and grades<8.5, 7, grades)
gradesrounded=np.where(grades>=8.5 and grades<11, 10, grades)
gradesrounded=np.where(grades>=11, 12, grades)
print(gradesrounded)
and what I found out is that np.where works when there is one condition (so grades below -1.5 works and grades over 11 works for example) but if there are 2 different conditions (for example this one : np.where(grades>=1 and grades<3, 2, grades)) it won't work.
Do you know how I could fix this ?
Thank you very much.

Another way is np.searchsorted:
scales = np.array([-3,0,2,4,7,10,12])
grades=np.array([[-3,-2,-1,0],[1,2,3,4],[5,6,7,8],[9,10,11,12]])
thresh = [-1.5, 0.5 ,2.5,5.5,8.5,10]
out = scales[np.searchsorted(thresh, grades)]
# or
# thresh = [-3, -1.5, 1, 3, 5.5, 8.5, 11]
# out = scales[np.searchsorted(thresh, grades, side='right')-1]
Out:
array([[-3, -3, 0, 0],
[ 2, 2, 4, 4],
[ 4, 7, 7, 7],
[10, 10, 12, 12]])

You are using the logical operator and which doesn't work for array operations. Use bitwise operators instead that will operate element by element.
np.where((grades>=1) & (grades<3), 2, grades))
Have a look at this: link

This is an excellent case for the np.select() function. The docs can be found here.
The setup is simple:
Create a list of Danish system grades.
Create a list of mappings. The case below uses the logical and & operator to link multiple conditions.
Setup:
import numpy as np
# Sample grades.
x = np.array([-3, -2, -1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
# Define limits and lookup.
grades = [12, 10, 7, 4, 2, 0, -3]
scale = [(x >= 11),
(x >= 8.5) & (x < 11),
(x >= 5.5) & (x < 8.5),
(x >= 3.0) & (x < 5.5),
(x >= 1.0) & (x < 3.0),
(x >= -1.5) & (x < 1.0 ),
(x < -1.5)]
Use:
Call the np.select function and pass in the two lists created above.
# Map grades to Danish system.
np.select(condlist=scale, choicelist=grades)
Output:
array([-3, -3, 0, 0, 2, 2, 4, 4, 4, 7, 7, 7, 10, 10, 12, 12])

If you want to put it as a string you could also use a pandas dataframe and the query function on it. Here is an example:
df = df.query('grades>=1 & grades<3')

Related

Applying a mask to a dataframe, but only over a certain range inside the dataframe

I currently have some code that uses a mask to calculate the mean of values that are overloads, and values that are baseline values. It does this over the entire length of the dataframe. However, now I want to only apply this to a certain range in the dataframe column, between first and last values (ie, a specified region in the column, dictated by user input). Here is my code as it stands:
mask_number = 5
no_overload_cycles = 1
hyst = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})
list_test = []
for i in range(0,len(hyst)-1,mask_number):
for x in range(no_overload_cycles):
list_test.append(i+x)
mask = np.array(list_test)
print(mask)
[0 1 5 10 15 20]
first = 4
last = 17
regression_area = hyst.iloc[first:last]
mean_range_overload = regression_area.loc[np.where(mask == regression area.index)]['test'].mean()
mean_range_baseline = regression_area.drop(mask[first:last])['test'].mean()
So the overload mean would be be cycles, 5, 10, and 15 in test, and the baseline mean would be from positions 4 to 17, excluding 5, 10 and 15. This would be my expected output from this:
print (mean_range_overload)
4
print(mean_range_baseline)
4.545454
However, the no_overload_cycles value can change, and may for example, be 3, which would then create a mask of this:
mask_number = 5
no_overload_cycles = 3
hyst = pd.DataFrame({"test":[12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})
list_test = []
for i in range(0,len(hyst)-1,mask_number):
for x in range(no_overload_cycles):
list_test.append(i+x)
mask = np.array(list_test)
print(mask)
[0 1 2 5 6 7 10 11 12 15 16 17 20]
So the mean_range_overload would be mean of the values at 5,6,7,10,11,12,15,16,17, and the mean_range_baseline would be the values inbetween these, in the range of first and last in the dataframe column.
Any help on this would be greatly appreciated!
Assuming no_overload_cycles == 1 always, you can simply use slice objects to index the DataFrame.
Say you wish to, in your example, specifically pick cycles 5, 10 and 15 and use them as overload. Then you can get them by doing df.loc[5:15:5].
On the other hand, if you wish to pick the 5th, 10th and 15th cycles from the range you selected, you can get them by doing df.iloc[5:15+1:5] (iloc does not include the right index, so we add one). No loops required.
As mentioned in the comments, your question is slightly confusing, and it'd be helpful if you gave a better description and some expected results; in general I'd also advise you to decouple the domain-specific part of your problem before asking it in a forum, since not everyone knows what you mean by "overload", "baseline", "cycles" etc. I'm not commenting that since I still don't have enough reputation to do so.
I renamed a few of the variables, so what I called a "mask" is not exactly what you called a mask, but I reckon this is what you were trying to make:
mask_length = 5
overload_cycles_per_mask = 3
df = pd.DataFrame({"test": [12, 4, 5, 4, 1, 3, 2, 5, 10, 9, 7, 5, 3, 6, 3, 2 ,1, 5, 2]})
selected_range = (4, 17)
overload_indices = []
baseline_indices = []
# `range` does not include the right hand side so we add one
# ideally you would specify the range as (4, 18) instead
for i in range(selected_range[0], selected_range[1]+1):
if i % mask_length < overload_cycles_per_mask:
overload_indices.append(i)
else:
baseline_indices.append(i)
print(overload_indices)
print(df.iloc[overload_indices].test.mean())
print(baseline_indices)
print(df.iloc[baseline_indices].test.mean())
Basically, the DataFrame rows inside selected_range are divided into segments of length mask_length, each of which has their first overload_cycles_per_mask elements marked as overload, and any others, as baseline.
With that, you get two lists of indices, which you can directly pass to df.iloc, as according to the documentation it supports a list of integers.
Here is the output for mask_length = 5 and overload_cycles_per_mask = 1:
[5, 10, 15]
4.0
[4, 6, 7, 8, 9, 11, 12, 13, 14, 16, 17]
4.545454545454546
And here is for mask_length = 5 and overload_cycles_per_mask = 3:
[5, 6, 7, 10, 11, 12, 15, 16, 17]
3.6666666666666665
[4, 8, 9, 13, 14]
5.8
I do believe calling this a single mask makes things more confusing. In any case, I would tuck the logic for getting the indices away in some separate function to the one which calculates the mean.

Conditional selection in array

I have following list with arrays:
[array([10, 1, 7, 3]),
array([ 0, 14, 12, 13]),
array([ 3, 10, 7, 8]),
array([7, 5]),
array([ 5, 12, 3]),
array([14, 8, 10])]
What I want is to mark rows as "1" or "0", conditional on whether the row matches "10" AND "7" OR "10" AND "3".
np.where(output== 10 & output == 7 ) | (output == 10 & output == 3 ) | (output == 10 & output == 8 ), 1, 0)
returns
array(0)
What's the correct syntax to get into the array of the array?
Expected output:
[ 1, 0, 1, 0, 0, 1 ]
Note:
What is output? After training an CountVectorizer/LDA topic classifier in Scikit, the following script assigns topic probabilities to new documents. Topics above the threshold of 0.2 are then stored in an array.
def sortthreshold(x, thresh):
idx = np.arange(x.size)[x > thresh]
return idx[np.argsort(x[idx])]
output = []
for x in newdoc:
y = lda.transform(bowvectorizer.transform([x]))
output.append(sortthreshold(y[0], 0.2))
Thanks!
Your input data is a plain Python list of Numpy arrays of unequal length, thus it can't be simply converted to a 2D Numpy array, and so it can't be directly processed by Numpy. But it can be process using the usual Python list processing tools.
Here's a list comprehension that uses numpy.isin to test if a row contains any of (3, 7, 8). We first use simple == testing to see if the row contains 10, and only call isin if it does so; the Python and operator will not evaluate its second operand if the first operand is false-ish.
We use np.any to see if any row item passes each test. np.any returns a Boolean value of False or True, but we can pass those values to int to convert them to 0 or 1.
import numpy as np
data = [
np.array([10, 1, 7, 3]), np.array([0, 14, 12, 13]),
np.array([3, 10, 7, 8]), np.array([7, 5]),
np.array([5, 12, 3]), np.array([14, 8, 10]),
]
mask = np.array([3, 7, 8])
result = [int(np.any(row==10) and np.any(np.isin(row, mask)))
for row in data]
print(result)
output
[1, 0, 1, 0, 0, 1]
I've just performed some timeit tests. Curiously, Reblochon Masque's code is faster on the data given in the question, presumably because of the short-circuiting behaviour of plain Python any, and & or. Also, it appears that numpy.in1d is faster than numpy.isin, even though the docs recommend using the latter in new code.
Here's a new version that's about 10% slower than Reblochon's.
mask = np.array([3, 7, 8])
result = [int(any(row==10) and any(np.in1d(row, mask)))
for row in data]
Of course, the true speed on large amounts of real data may vary from what my tests indicate. And time may not be an issue: even on my slow old 32 bit single core 2GHz machine I can process the data in the question almost 3000 times in one second.
hpaulj has suggested an even faster way. Here's some timeit test info, comparing the various versions. These tests were performed on my old machine, YMMV.
import numpy as np
from timeit import Timer
the_data = [
np.array([10, 1, 7, 3]), np.array([0, 14, 12, 13]),
np.array([3, 10, 7, 8]), np.array([7, 5]),
np.array([5, 12, 3]), np.array([14, 8, 10]),
]
def rebloch0(data):
result = []
for output in data:
result.append(1 if np.where((any(output == 10) and any(output == 7)) or
(any(output == 10) and any(output == 3)) or
(any(output == 10) and any(output == 8)), 1, 0) == True else 0)
return result
def rebloch1(data):
result = []
for output in data:
result.append(1 if np.where((any(output == 10) and any(output == 7)) or
(any(output == 10) and any(output == 3)) or
(any(output == 10) and any(output == 8)), 1, 0) else 0)
return result
def pm2r0(data):
mask = np.array([3, 7, 8])
return [int(np.any(row==10) and np.any(np.isin(row, mask)))
for row in data]
def pm2r1(data):
mask = np.array([3, 7, 8])
return [int(any(row==10) and any(np.in1d(row, mask)))
for row in data]
def hpaulj0(data):
mask=np.array([3, 7, 8])
return [int(any(row==10) and any((row[:, None]==mask).flat))
for row in data]
def hpaulj1(data, mask=np.array([3, 7, 8])):
return [int(any(row==10) and any((row[:, None]==mask).flat))
for row in data]
functions = (
rebloch0,
rebloch1,
pm2r0,
pm2r1,
hpaulj0,
hpaulj1,
)
# Verify that all functions give the same result
for func in functions:
print('{:8}: {}'.format(func.__name__, func(the_data)))
print()
def time_test(loops, data):
timings = []
for func in functions:
t = Timer(lambda: func(data))
result = sorted(t.repeat(3, loops))
timings.append((result, func.__name__))
timings.sort()
for result, name in timings:
print('{:8}: {:.6f}, {:.6f}, {:.6f}'.format(name, *result))
print()
time_test(1000, the_data)
typical output
rebloch0: [1, 0, 1, 0, 0, 1]
rebloch1: [1, 0, 1, 0, 0, 1]
pm2r0 : [1, 0, 1, 0, 0, 1]
pm2r1 : [1, 0, 1, 0, 0, 1]
hpaulj0 : [1, 0, 1, 0, 0, 1]
hpaulj1 : [1, 0, 1, 0, 0, 1]
hpaulj1 : 0.140421, 0.154910, 0.156105
hpaulj0 : 0.154224, 0.154822, 0.167101
rebloch1: 0.281700, 0.282764, 0.284599
rebloch0: 0.339693, 0.359127, 0.375715
pm2r1 : 0.367677, 0.368826, 0.371599
pm2r0 : 0.626043, 0.628232, 0.670199
Nice work, hpaulj!
You need to use np.any combined with np.where, and avoid using | and & which are binary operators in python.
import numpy as np
a = [np.array([10, 1, 7, 3]),
np.array([ 0, 14, 12, 13]),
np.array([ 3, 10, 7, 8]),
np.array([7, 5]),
np.array([ 5, 12, 3]),
np.array([14, 8, 10])]
for output in a:
print(np.where(((any(output == 10) and any(output == 7))) or
(any(output == 10) and any(output == 3)) or
(any(output == 10) and any(output == 8 )), 1, 0))
output:
1
0
1
0
0
1
If you want it as a list as the edited question shows:
result = []
for output in a:
result.append(1 if np.where(((any(output == 10) and any(output == 7))) or
(any(output == 10) and any(output == 3)) or
(any(output == 10) and any(output == 8 )), 1, 0) == True else 0)
result
result:
[1, 0, 1, 0, 0, 1]

Numpy array: get upper diagonal and lower diagonal for a given element

import numpy
square = numpy.reshape(range(0,16),(4,4))
square
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15]])
In the above array, how do I access the primary diagonal and secondary diagonal of any given element? For example 9.
by primary diagonal, I mean - [4,9,14],
by secondary diagonal, I mean - [3,6,9,12]
I can't use numpy.diag() cause it takes the entire array to get the diagonal.
Base on your description, with np.where, np.diagonal and np.fliplr
import numpy as np
x,y=np.where(square==9)
np.diagonal(square, offset=-(x-y))
Out[382]: array([ 4, 9, 14])
x,y=np.where(np.fliplr(square)==9)
np.diagonal(np.fliplr(square), offset=-(x-y))
# base on the op's comment it should be np.diagonal(np.fliplr(square), offset=-(x-y))
Out[396]: array([ 3, 6, 9, 12])
For the first diagonal, use the fact that both x_coordiante and y_coordinate increase with 1 each step:
def first_diagonal(x, y, length_array):
if x < y:
return zip(range(x, length_array), range(length_array - x))
else:
return zip(range(length_array - y), range(y, length_array))
For the secondary diagonal, use the fact that the x_coordinate + y_coordinate = constant.
def second_diagonal(x, y, length_array):
tot = x + y
return zip(range(tot+1), range(tot, -1, -1))
This gives you two lists you can use to access your matrix.
Of course, if you have a non square matrix these functions will have to be reshaped a bit.
To illustrate how to get the desired output:
a = np.reshape(range(0,16),(4,4))
first = first_diagonal(1, 2, len(a))
second = second_diagonal(1,2, len(a))
primary_diagonal = [a[i[0]][i[1]] for i in first]
secondary_diagonal = [a[i[0]][i[1]] for i in second]
print(primary_diagonal)
print(secondary_diagonal)
this outputs:
[4, 9, 14]
[3, 6, 9, 12]

Finding the point of a slope change as a free parameter- Python

Say I have two lists of data as follows:
x = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14]
That is, it's pretty clear that merely fitting a line to this data doesn't work, but instead the slope changed at a point in the data. (Obviously, one can pinpoint from this data set pretty easily where that change is, but it's not as clear in the set I'm working with so let's ignore that.) Something with the derivative, I'm guessing, but the point here is I want to treat this as a free parameter where I say "it's this point, +/- this uncertainty, and here is the linear slope before and after this point."
Note, I can do this with an array if it's easier. Thanks!
Here is a plot of your data:
You need to find two slopes (== taking two derivatives). First, find the slope between every two points (using numpy):
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10],dtype=np.float)
y = np.array([1, 2, 3, 4, 5, 6, 8, 10, 12, 14],dtype=np.float)
m = np.diff(y)/np.diff(x)
print (m)
# [ 1. 1. 1. 1. 1. 2. 2. 2. 2.]
Clearly, slope changes from 1 to 2 in the sixth interval (between sixth and seventh points). Then take the derivative of this array, which tells you when the slope changes:
print (np.diff(m))
[ 0. 0. 0. 0. 1. 0. 0. 0.]
To find the index of the non-zero value:
idx = np.nonzero(np.diff(m))[0]
print (idx)
# 4
Since we took one derivative with respect to x, and indices start from zero in Python, idx+2 tells you that the slope is different before and after the sixth point.
I'm not sure to understand very well what you want but you can see the evolution this way (derivative):
>>> y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14]
>>> dy=[y[i+1]-y[i] for i in range(len(y)-1)]
>>> dy
[1, 1, 1, 1, 1, 2, 2, 2, 2]
and then find the point where it change (second derivative):
>>> dpy=[dy[i+1]-dy[i] for i in range(len(dy)-1)]
>>> dpy
[0, 0, 0, 0, 1, 0, 0, 0]
if you want the index of this point :
>>> dpy.index(1)
4
that can give you the value of the last point before change of slope :
>>> change=dpy.index(1)
>>> y[change]
5
In your y = [1, 2, 3, 4, 5, 6, 8, 10, 12, 14] the change happen at the index [4] (list indexing start to 0) and the value of y at this point is 5.
You can calculate the slope as the difference between each pair of points (the first derivative). Then check where the slope changes (the second derivative). If it changes, append the index location to idx, the collection of points where the slope changes.
Note that the first point does not have a unique slope. The second pair of points will give you the slope, but you need the third pair before you can measure the change in slope.
idx = []
prior_slope = float(y[1] - y[0]) / (x[1] - x[0])
for n in range(2, len(x)): # Start from 3rd pair of points.
slope = float(y[n] - y[n - 1]) / (x[n] - x[n - 1])
if slope != prior_slope:
idx.append(n)
prior_slope = slope
>>> idx
[6]
Of course this could be done more efficiently in Pandas or Numpy, but I am just giving you a simple Python 2 solution.
A simple conditional list comprehension should also be pretty efficient, although it is more difficult to understand.
idx = [n for n in range(2, len(x))
if float(y[n] - y[n - 1]) / (x[n] - x[n - 1])
!= float(y[n - 1] - y[n - 2]) / (x[n - 1] - x[n - 2])]
Knee point might be a potential solution.
from kneed import KneeLocator
import numpy as np
x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
y = np.array([1, 2, 3, 4, 5, 6, 8, 10, 12, 14])
kn = KneeLocator(x, y, curve='convex', direction='increasing')
# You can use array y to automatically determine 'convex' and 'increasing' if y is well-behaved
idx = (np.abs(x - kn.knee)).argmin()
>>> print(x[idx], y[idx])
6 6

Elegant list comprehension to extract values in one dimension of an array based on values in another dimension

I'm looking for an elegant solution to this:
data = np.loadtxt(file)
# data[:,0] is a time
# data[:,1] is what I want to extract
mean = 0.0
count = 0
for n in xrange(np.size(data[:,0])):
if data[n,0] >= tstart and data[n,0] <= tend:
mean = mean + data[n,1]
count = count + 1
mean = mean / float(count)
I'm guessing I could alternatively first extract my 2D array and then apply np.mean on it but I feel like there could be some list comprehension goodness to make this more elegant (I come from a FORTRAN background...). I was thinking something like (obviously wrong since i would not be an index):
np.mean([x for x in data[i,1] for i in data[:,0] if i >= tstart and i <= tend])
In numpy, rather than listcomps you can use lists and arrays for indexing purposes. To be specific, say we have a 2D array like the one you're working with:
>>> import numpy as np
>>> data = np.arange(20).reshape(10, 2)
>>> data
array([[ 0, 1],
[ 2, 3],
[ 4, 5],
[ 6, 7],
[ 8, 9],
[10, 11],
[12, 13],
[14, 15],
[16, 17],
[18, 19]])
We can get the first column:
>>> ts = data[:,0]
>>> ts
array([ 0, 2, 4, 6, 8, 10, 12, 14, 16, 18])
And create a boolean array corresponding to the terms we want:
>>> (ts >= 2) & (ts <= 6)
array([False, True, True, True, False, False, False, False, False, False], dtype=bool)
Then we can use this to select elements of the column we're interested in:
>>> data[:,1][(ts >= 2) & (ts <= 6)]
array([3, 5, 7])
and finally take its mean:
>>> np.mean(data[:,1][(ts >= 2) & (ts <= 6)])
5.0
Or, in one line:
>>> np.mean(data[:,1][(data[:,0] >= 2) & (data[:,0] <= 6)])
5.0
[Edit: data[:,1][(data[:,0] >= 2) & (data[:,0] <= 6)].mean() will work too; I always forget you can use methods.]

Categories