scikit learn Random Forest Classifier probability threshold - python

I'm using sklearn RandomForestClassifier for a prediction task.
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=300, n_jobs=-1)
model.fit(x_train,y_train)
model.predict_proba(x_test)
There are 171 classes to predict.
I want to predict only those classes, where predict_proba(class) is at least 90%. Everything below should be set to 0.
For example, given the following:
1 2 3 4 5 6 7
0 0.0 0.0 0.1 0.9 0.0 0.0 0.0
1 0.2 0.1 0.1 0.3 0.1 0.0 0.2
2 0.1 0.1 0.1 0.1 0.1 0.4 0.1
3 1.0 0.0 0.0 0.0 0.0 0.0 0.0
my expected output is:
0 4
1 0
2 0
3 1

You can use numpy.argwhere as follows:
from sklearn.ensemble import RandomForestClassifier
import numpy as np
model = RandomForestClassifier(n_estimators=300, n_jobs=-1)
model.fit(x_train,y_train)
preds = model.predict_proba(x_test)
#preds = np.array([[0.0, 0.0, 0.1, 0.9, 0.0, 0.0, 0.0],
# [ 0.2, 0.1, 0.1, 0.3, 0.1, 0.0, 0.2],
# [ 0.1 ,0.1, 0.1, 0.1, 0.1, 0.4, 0.1],
# [ 1.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0]])
r = np.zeros(preds.shape[0], dtype=int)
t = np.argwhere(preds>=0.9)
r[t[:,0]] = t[:,1]+1
r
array([4, 0, 0, 1])

You can use list comprehensions:
import numpy as np
# dummy predictions - 3 samples, 3 classes
pred = np.array([[0.1, 0.2, 0.7],
[0.95, 0.02, 0.03],
[0.08, 0.02, 0.9]])
# first, keep only entries >= 0.9:
out_temp = np.array([[x[i] if x[i] >= 0.9 else 0 for i in range(len(x))] for x in pred])
out_temp
# result:
array([[0. , 0. , 0. ],
[0.95, 0. , 0. ],
[0. , 0. , 0.9 ]])
out = [0 if not x.any() else x.argmax()+1 for x in out_temp]
out
# result:
[0, 1, 3]

Related

Numpy/Pandas correlate multiple arrays of different length

I can correlate two arrays of different length using this method:
import pandas as pd
import numpy as np
from scipy.stats.stats import pearsonr
a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
df = pd.DataFrame(dict(x=a))
CORR_VALS = np.array(b)
def get_correlation(vals):
return pearsonr(vals, CORR_VALS)[0]
df['correlation'] = df.rolling(window=len(CORR_VALS)).apply(get_correlation)
I get a result like this:
In [1]: df
Out[1]:
x correlation
0 0.0 NaN
1 0.4 NaN
2 0.2 NaN
3 0.4 NaN
4 0.2 NaN
5 0.4 0.527932
6 0.2 -0.159167
7 0.5 0.189482
First of all, the pearson coeff should just be the highest number in this dataset...
Secondly, how could I do this for multiple sets of data? I would like an output like I would get in df.corr(). With the indices and columns labeled appropriately.
for example, say I have the following datasets:
a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
c = [ 0, 0.4, 0.2, 0.4, 0.2, 0.45, 0.2, 0.52, 0.52, 0.4, 0.21, 0.2, 0.4, 0.51]
d = [ 0.4, 0.2, 0.5]
I want a correlation matrix of sixteen Pearson coeffs...
import pandas as pd
import numpy as np
from scipy.stats.stats import pearsonr
a = [0, 0.4, 0.2, 0.4, 0.2, 0.4, 0.2, 0.5]
b = [25, 40, 62, 58, 53, 54]
c = [ 0, 0.4, 0.2, 0.4, 0.2, 0.45, 0.2, 0.52, 0.52, 0.4, 0.21, 0.2, 0.4, 0.51]
d = [ 0.4, 0.2, 0.5]
# To store the data
dict_series = {'a': a,'b': b,'c':c,'d':d}
list_series_names = [i for i in dict_series.keys()]
def get_max_correlation_from_lists(a, b):
# This is to make sure the longest list is in the dataframe
if len(b)>=len(a):
a_old = a
a = b
b= a_old
# Taking the body from the original code.
df = pd.DataFrame(dict(x=a))
CORR_VALS = np.array(b)
def get_correlation(vals):
return pearsonr(vals, CORR_VALS)[0]
# Collecting the max
return df.rolling(window=len(CORR_VALS)).apply(get_correlation).max().values[0]
# This is to create the "correlations" matrix
correlations_matrix = pd.DataFrame(index=list_series_names,columns=list_series_names )
for i in list_series_names:
for j in list_series_names:
correlations_matrix.loc[i,j]=get_max_correlation_from_lists(dict_series[i], dict_series[j])
print(correlations_matrix)
a b c d
a 1.0 0.527932 0.995791 1.0
b 0.527932 1.0 0.52229 0.427992
c 0.995791 0.52229 1.0 0.992336
d 1.0 0.427992 0.992336 1.0

How can a tensor in tensorflow be sliced ​using elements of another array as an index?

I'm looking for a similar function to tf.unsorted_segment_sum, but I don't want to sum the segments, I want to get every segment as a tensor.
So for example, I have this code:
(In real, I have a tensor with shapes of (10000, 63), and the number of segments would be 2500)
to_be_sliced = tf.constant([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.3, 0.2, 0.2, 0.6, 0.3],
[0.9, 0.8, 0.7, 0.6, 0.5],
[2.0, 2.0, 2.0, 2.0, 2.0]])
indices = tf.constant([0, 2, 0, 1])
num_segments = 3
tf.unsorted_segment_sum(to_be_sliced, indices, num_segments)
The output would be here
array([sum(row1+row3), row4, row2]
What I am looking for is 3 tensor with different shapes (maybe a list of tensors), first containing the first and third rows of the original (shape of (2, 5)), the second contains the 4th row (shape of (1, 5)), the third contains the second row, like this:
[array([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.9, 0.8, 0.7, 0.6, 0.5]]),
array([[2.0, 2.0, 2.0, 2.0, 2.0]]),
array([[0.3, 0.2, 0.2, 0.6, 0.3]])]
Thanks in advance!
You can do that like this:
import tensorflow as tf
to_be_sliced = tf.constant([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.3, 0.2, 0.2, 0.6, 0.3],
[0.9, 0.8, 0.7, 0.6, 0.5],
[2.0, 2.0, 2.0, 2.0, 2.0]])
indices = tf.constant([0, 2, 0, 1])
num_segments = 3
result = [tf.boolean_mask(to_be_sliced, tf.equal(indices, i)) for i in range(num_segments)]
with tf.Session() as sess:
print(*sess.run(result), sep='\n')
Output:
[[0.1 0.2 0.3 0.4 0.5]
[0.9 0.8 0.7 0.6 0.5]]
[[2. 2. 2. 2. 2.]]
[[0.3 0.2 0.2 0.6 0.3]]
For your case, you can do Numpy slicing in Tensorflow. So this will work:
sliced_1 = to_be_sliced[:3, :]
# [[0.4 0.5 0.5 0.7 0.8]
# [0.3 0.2 0.2 0.6 0.3]
# [0.3 0.2 0.2 0.6 0.3]]
sliced_2 = to_be_sliced[3, :]
# [0.3 0.2 0.2 0.6 0.3]
Or a more general option, you can do it in the following way:
to_be_sliced = tf.constant([[0.1, 0.2, 0.3, 0.4, 0.5],
[0.3, 0.2, 0.2, 0.6, 0.3],
[0.9, 0.8, 0.7, 0.6, 0.5],
[2.0, 2.0, 2.0, 2.0, 2.0]])
first_tensor = tf.gather_nd(to_be_sliced, [[0], [2]])
second_tensor = tf.gather_nd(to_be_sliced, [[3]])
third_tensor = tf.gather_nd(to_be_sliced, [[1]])
concat = tf.concat([first_tensor, second_tensor, third_tensor], axis=0)

Getting the location of value from numpy.where() as single value and append it to another array

I have an array in python created from numpy as:
a = [[1. 0.5 0.3 ... 0.71 0.72 0.73]
[0. 0.4 0.6 ... 0.74 0.75 0.76]
[0. 0.3 0. ... 0.72 0.73 0.74]
...
[0. 0.2 0.3 ... 0.56 0.57 0.58]
[0. 0.1 0.3 ... 0.67 0.68 0.69]]
and another array
b = [[1. 0.5 0.6 ... 0.74 0.75 0.76]]
which i got from np.max(a, axis=0). Now I need the index of the array where the value in array 'a' is equal to the corresponding value in 'b' for which i used:
locn = []
for i in range(0, len(b[0])):
for j in range(0, len(a)):
fav = np.where(a[j][i] == b[0][j])
locn.append(fav)
print(locn)
I get the output as
[(array([0]),), (array([0]),), (array([0]),), (array([0]),), (array([], dtype=int64),), (array([], dtype=int64),), (array([], dtype=int64),), (array([], dtype=int64),), (array([0]),), (array([0]),), (array([0]),), (array([0]),), (array([], dtype=int64),), ............
I could have used np.where(a == np.max(a)) to get the location on maximum, but that is not my problem. I need the exact location (like 1st element of 1st array.. or something like that) append the index of array in loc[]. For example: for the first round 1 is the highest, i just need to append the index value 0 to a new list locn[] as 0 is the index for first round where the element of inner array is equal to the maximum value.
How can I do this? Thanks in advance.
You can use the function argmax instead of just max. For example
a = np.random.randint(10, size=(4, 5))
[[8 9 6 4 7] [6 4 0 3 6] [7 5 9 1 6] [1 4 8 8 9]]
np.max(a, axis=0)
array([8, 9, 9, 8, 9])
np.argmax(a, axis=0)
array([0, 0, 2, 3, 3], dtype=int64)
If you want to print the info the way you are describing then you can do
b = np.argmax(a, axis=0)
print('locn'+str(b))
locn[0 0 2 3 3]
Even if the to find elements are not the maxima but for example randomly chosen, we can still use argmax on a==b.
Example:
# generate random data
>>> n = 10
>>> a = np.round(np.random.random((n, n)), 1)
>>> a
array([[0.3, 0.2, 0.2, 0.4, 0.1, 0.6, 0.8, 0.9, 0.8, 0.1],
[0.7, 1. , 0.1, 0.1, 0.4, 1. , 0.7, 0.8, 0.6, 0.5],
[0.1, 0.5, 1. , 0.4, 0.6, 0.8, 0.9, 0.3, 0.2, 0.4],
[0.2, 0.6, 0.2, 0. , 0.7, 0.8, 0.9, 0.6, 0. , 0.1],
[0.4, 0. , 0.8, 0.2, 0.1, 0.8, 0.2, 0.6, 0.1, 0. ],
[0.1, 0.2, 0.4, 0.4, 0. , 0.6, 0.6, 0.9, 0.6, 0.3],
[0.9, 1. , 0.8, 0.8, 0.3, 0.5, 0.5, 0.2, 0.4, 0.7],
[0.5, 0.5, 0.2, 0.8, 0.8, 0.1, 0.7, 0.5, 0.9, 0.5],
[0. , 0.4, 0.5, 0.5, 0.6, 0.2, 0.5, 0.9, 0.6, 0.9],
[0.8, 0.5, 0.1, 0.9, 0.7, 0.1, 0.8, 0. , 0.9, 0.8]])
# randomly pick an index each column
>>> choice = np.random.randint(0, n, (n,))
>>>
# retrieve values at chosen locations
>>> b = a[choice, range(n)]
>>> b
array([0.4, 0.2, 0.8, 0.4, 0.6, 0.6, 0.8, 0.9, 0.6, 0.5])
>>>
# now recover `choice`, or if the same as the chosen value occurs
# earlier in that column return the index of the first occurrence.
>>> recover = np.argmax(a==b, axis=0)
>>> recover
array([4, 0, 4, 0, 2, 0, 0, 0, 1, 1])
>>>
# check result:
>>> recover <= choice
array([ True, True, True, True, True, True, True, True, True,
True])
>>> a[recover, range(n)] == b
array([ True, True, True, True, True, True, True, True, True,
True])
As a nice little bonus this takes advantage of the fact that max/argmax short-ciruits on booleans (a==b is, however, still evaluated everywhere):
>>> timeit('np.argmax(x)', globals={'np': np, 'x': np.ones(1000000, bool)}, number=100000)
0.10291801800121902
>>> timeit('np.argmax(x)', globals={'np': np, 'x': np.zeros(1000000, bool)}, number=100000)
4.172021539001435

how to add value within certain intervals only python

I have a dataframe and values in a columns ranges from -1 to 1. I want to add 0.1 to all value between -1 to 0.6 only. Is it possible to do it?
suppose a is my list:
a = ([-1. , -0.5, 0.1 , 0.2, 0.45, 0.7, 0.64, 1])
and I want this:
([-0.9, -0.4, 0.2, 0.3, 0.55, 0.7, 0.74, 1])
Yes, it's possible:
a = [-1. , -0.5, 0.1 , 0.2, 0.45, 0.7, 0.64, 1]
a = [x + 0.1 if -1 <= x <= 0.6 else x for x in a]
print a
Results:
[-0.9, -0.4, 0.2, 0.3, 0.55, 0.7, 0.64, 1]

Finding the Steady State Output of a Linear Recurrent Network

I'm taking a Computational Neuroscience class on Coursera. So far it's been going great! However, I'm getting a little stuck on one of the quiz problems.
I am not taking this class for a certificate or anything. Solely for fun. I already took the quiz and after awhile, I guessed the answer, so this is not even going to be answering the quiz.
The question is framed as the following:
Suppose that we had a linear recurrent network of 5 input nodes and 5 output nodes. Let us say that our network's weight matrix W is:
W = [0.6 0.1 0.1 0.1 0.1]
[0.1 0.6 0.1 0.1 0.1]
[0.1 0.1 0.6 0.1 0.1]
[0.1 0.1 0.1 0.6 0.1]
[0.1 0.1 0.1 0.1 0.6]
(Essentially, all 0.1, besides 0.6 on the diagonals.)
Suppose that we have a static input vector u:
u = [0.6]
[0.5]
[0.6]
[0.2]
[0.1]
Finally, suppose that we have a recurrent weight matrix M:
M = [-0.25, 0, 0.25, 0.25, 0]
[0, -0.25, 0, 0.25, 0.25]
[0.25, 0, -0.25, 0, 0.25]
[0.25, 0.25, 0, -0.25, 0]
[0, 0.25, 0.25, 0, -0.25]
Which of the following is the steady state output v_ss of the network?
(Hint: See the lecture on recurrent networks, and consider writing some Octave or Matlab code to handle the eigenvectors/values (you may use the "eig" function))'
The notes for the class can be found here. Specifically, the equation for the steady state formula can be found on slides 5 and 6.
I have the following code.
import numpy as np
# Construct W, the network weight matrix
W = np.ones((5,5))
W = W / 10.
np.fill_diagonal(W, 0.6)
# Construct u, the static input vector
u = np.zeros(5)
u[0] = 0.6
u[1] = 0.5
u[2] = 0.6
u[3] = 0.2
u[4] = 0.1
# Connstruct M, the recurrent weight matrix
M = np.zeros((5,5))
np.fill_diagonal(M, -0.25)
for i in range(3):
M[2+i][i] = 0.25
M[i][2+i] = 0.25
for i in range(2):
M[3+i][i] = 0.25
M[i][3+i] = 0.25
# We need to matrix multiply W and u together to get h
# NOTE: cannot use W * u, that's going to do a scalar multiply
# it's element wise otherwise
h = W.dot(u)
print 'This is h'
print h
# Ok then the big deal is:
# h dot e_i
# v_ss = sum_(over all eigens) ------------ e_i
# 1 - lambda_i
eigs = np.linalg.eig(M)
eigenvalues = eigs[0]
eigenvectors = eigs[1]
v_ss = np.zeros(5)
for i in range(5):
v_ss += (np.dot(h,eigenvectors[:, i]))/((1.0-eigenvalues[i])) * eigenvectors[:,i]
print 'This is our steady state v_ss'
print v_ss
The correct answer is:
[0.616, 0.540, 0.609, 0.471, 0.430]
This is what I am getting:
This is our steady state v_ss
[ 0.64362264 0.5606784 0.56007018 0.50057043 0.40172501]
Can anyone spot my bug? Thank you so much! I greatly appreciate it and apologize for the long blog post. Essentially, all you need to look at, is slide 5 and 6 on that top link.
I tryied your solution with my matrices:
W = np.array([[0.6 , 0.1 , 0.1 , 0.1 , 0.1],
[0.1 , 0.6 , 0.1 , 0.1 , 0.1],
[0.1 , 0.1 , 0.6 , 0.1 , 0.1],
[0.1 , 0.1 , 0.1 , 0.6 , 0.1],
[0.1 , 0.1 , 0.1 , 0.1 , 0.6]])
u = np.array([.6, .5, .6, .2, .1])
M = np.array([[-0.75 , 0 , 0.75 , 0.75 , 0],
[0 , -0.75 , 0 , 0.75 , 0.75],
[0.75 , 0 , -0.75 , 0 , 0.75],
[0.75 , 0.75 , 0.0 , -0.75 , 0],
[0 , 0.75 , 0.75 , 0 , -0.75]])
and your code generated the right solution:
This is h
[ 0.5 0.45 0.5 0.3 0.25]
This is our steady state v_ss
[ 1.663354 1.5762684 1.66344153 1.56488258 1.53205348]
Maybe the problem is with the Test on coursera. Have you tryed to contact them on the forum?

Categories