Slicing the output of multilabel predict_proba - python

I am using scikit-learn MultiOutputClassifier and create a multi-label output for 7 distinct classes using:
multilabel_model.predict_proba(X_test)
which gives me an array with shape (7, 14545, 2) that contains both probabilities for the class being 0 and being 1:
[array([[9.7169727e-01, 2.8302711e-02],
[9.9807453e-01, 1.9254771e-03],
[9.9955606e-01, 4.4392250e-04],
...,
[9.9957782e-01, 4.2216384e-04],
[9.9833119e-01, 1.6688267e-03],
[9.9959826e-01, 4.0173010e-04]], dtype=float32),
array([[9.7968739e-01, 2.0312620e-02],
[9.9961036e-01, 3.8966016e-04],
[9.9990100e-01, 9.8974662e-05],
...,
Now I am looking for a way to slice the array such that the output only contains the probability for each of the 7 possible classes to equal 1 i.e. an output that would looks like this:
[[0.3,0.45,0.2,0.1,0.1,0.45,0.2],
[0.1,0.45,0.2,0.3,0.45,0.2,0.1],
...]
Is there a way of using some slicing magic to achieve this or does this require a sophisticated custom function?

To just extract the probabilities for 1 (2nd position) use:
probas = multilabel_model.predict_proba(X_test)
# probas.shape == (7, 14545, 2)
one_probas = probas[:, :, 1].reshape((probas.shape[1], probas.shape[0]))
# one_probas.shape == (14545, 7)

Related

Keras MeanSquaredError calculate loss per individual sample

I'm trying to get the MeanSquaredError of each individal sample in my tensors.
Here is some sample code to show my problem.
src = np.random.uniform(size=(2, 5, 10))
tgt = np.random.uniform(size=(2, 5, 10))
srcTF = tf.convert_to_tensor(src)
tgtTF = tf.convert_to_tensor(tgt)
print(srcTF, tgtTF)
lf = tf.keras.losses.MeanSquaredError(reduction=tf.compat.v1.losses.Reduction.NONE)
flowResults = lf(srcTF, tgtTF)
print(flowResults)
Here are the results:
(2, 5, 10) (2, 5, 10)
(2, 5)
I want to keep all the original dimensions of my tensors, and just calculate loss on the individual samples. Is there a way to do this in Tensorflow?
Note that pytorch's torch.nn.MSELoss(reduction = 'none') does exactly what I want, so is there an alternative that's more like that?
Here is a way to do it:
[ins] In [97]: mse = tf.keras.losses.MSE(tf.expand_dims(srcTF, axis=-1) , tf.expand_dims(tgtTF, axis=-1))
[ins] In [98]: mse.shape
Out[98]: TensorShape([2, 5, 10])
I think the key here is samples. Since MSE is being computed on the last axis, you lose that axis as that's what's being "reduced". Each point in that five dimensional vector represents the mean squared error of the 10 dimensions in the last axis. So in order to get back the original shape, essentially, we have to do the MSE of each scalar, for which we need to expand the dimensions. Essentially, we are saying that (2, 5, 10) is the number of batches we have, and each scalar is our sample/prediction, which is what tf.expand_dims(<tensor>, -1) accomplishes.

One hot encoding using sklearn preprocessing Label Binarizer

I am trying to use sklearn.preprocessing.LabelBinarizer() to create a one hot encoding of only a two-column labels, i.e. I only want to categorize two set of objects. In this case, when I use fit(range(0,2)), it just returns a one dimensional array, instead of 2x1. This is fine, but when I want to use them in Tensorflow, the shape should really be (2,1) for dimensional consistency. Please advise how I can resolve it.
Here is the code:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(range(0, 3))
Calling lb.transform([1, 0]), the result is:
[[0 1 0]
[1 0 0]]
whereas when we change 3 to 2, i.e. lb.fit(range(0, 2)), the result would be
[[1]
[0]]
instead of
[[0 1]
[1 0]]
This will create problems in the algorithms that work consistently with arrays with n dimensions. Is it any way to resolve this issue?
labelBinarizer()'s purpose according to the documentation is
Binarize labels in a one-vs-all fashion
Several regression and binary classification algorithms are available in scikit-learn.
A simple way to extend these algorithms to the multi-class classification case is to use > the so-called one-vs-all scheme.
If your data has only two types of labels, then you can directly feed that to binary classifier. Hence, one column is good enough to capture two classes in One-Vs-Rest fashion.
Binary targets transform to a column vector
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit_transform(['yes', 'no', 'no', 'yes'])
array([[1],
[0],
[0],
[1]])
If your intention is just creating one-hot encoding, use the following method.
from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit_transform([['yes'], ['no'], ['no'], ['yes']]).toarray()
array([[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.]])
Hope this clarifies, your question of why Sklearn labelBinarizer() does not convert the 2 class data into two column output.
As already said as a comment, this is not an issue of the method. According to the documentation: Binary targets transform to a column vector. You can build the array you want from the colomn vector result, in the case the dimension is 2.
A direct and simple way to do this is:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(range(2) # range(0, 2) is the same as range(2)
a = lb.transform([1, 0])
result_2d = np.array([[item[0], 0 if item[0] else 1] for item in a])

Scale (apply function?) sparse matrix logarithmically

I am using scikit-learn preprocessing scaling for sparse matrices.
My goal is to "scale" each feature-column by taking the logarithm-base the column maximum value. My wording may be inexact. I try to explain.
Say feature-column has values: 0, 8, 2:
Max value = 8
Log-8 of feature value 0 should be 0.0 = math.log(0+1, 8+1) (the +1 is to cope with zeros; so yes, we are actually taking log-base 9)
Log-8 of feature value 8 should be 1.0 = math.log(8+1, 8+1)
Log-8 of feature value 2 should be 0.5 = math.log(2+1, 8+1)
Yes, I can easily apply any arbitrary function-based transformer with FunctionTransformer, but I want the base of the log change (based on) each column (in particular, the maximum value). That is, I want to do something like the MaxAbsScaler, only taking logarithms.
I see that MaxAbsScaler gets first a vector (scale) of the maximum values of each column (code) and then multiples the original matrix times 1 / scale in code.
However, I don't know what to do if I want to take the logarithms-based on the scale vector. Is it even possible to transform the logarithm operation to a multiplication (?) or do I have other possibilities of scipy sparse operations that are efficient?
I hope my intent is clear (and possible).
Logarithm of x in base b is the same as log(x)/log(b), where logs are natural. So, the process you describe amounts to first applying log(x+1) transformation to everything, and then scaling by max absolute value. Conveniently, log(x+1) is a built-in function, log1p. Example:
from sklearn.preprocessing import FunctionTransformer, maxabs_scale
from scipy.sparse import csc_matrix
import numpy as np
logtran = FunctionTransformer(np.log1p, accept_sparse=True)
X = csc_matrix([[ 1., 0, 8], [ 2., 0, 0], [ 0, 1., 2]])
Y = maxabs_scale(logtran.transform(X))
Output (sparse matrix Y):
(0, 0) 0.630929753571
(1, 0) 1.0
(2, 1) 1.0
(0, 2) 1.0
(2, 2) 0.5

Scipy fitting polynomial model to some data

I do try to find an appropriate function for the permeability of cells under varying conditions. If I assume constant permeability, I can fit it to the experimental data and use Sklearns PolynomialFeatures together with a LinearModel (As explained in this post) in order to determine a correlation between the conditions and the permeability. However, the permeability is not constant and now I try to fit my model with the permeability as a function of the process conditions. The PolynomialFeature module of sklearn is quite nice to use.
Is there an equivalent function within scipy or numpy which allows me to create a polynomial model (including interaction terms e.g. a*x[0]*x[1] etc.) of varying order without writing the whole function by hand ?
The standard polynomial class in numpy seems not to support interaction terms.
I'm not aware of such a function that does exactly what you need, but you can achieve it using a combination of itertools and numpy.
If you have n_features predictor variables, you essentially must generate all vectors of length n_features whose entries are non-negative integers and sum to the specified order. Each new feature column is the component-wise power using these vectors who sum to a given order.
For example, if order = 3 and n_features = 2, one of the new features will be the old features raise to the respective powers, [2,1]. I've written some code below for arbitrary order and number of features. I've modified the generation of vectors who sum to order from this post.
import itertools
import numpy as np
from scipy.special import binom
def polynomial_features_with_cross_terms(X, order):
"""
X: numpy ndarray
Matrix of shape, `(n_samples, n_features)`, to be transformed.
order: integer, default 2
Order of polynomial features to be computed.
returns: T, powers.
`T` is a matrix of shape, `(n_samples, n_poly_features)`.
Note that `n_poly_features` is equal to:
`n_features+order-1` Choose `n_features-1`
See: https://en.wikipedia.org\
/wiki/Stars_and_bars_%28combinatorics%29#Theorem_two
`powers` is a matrix of shape, `(n_features, n_poly_features)`.
Each column specifies the power by row of the respective feature,
in the respective column of `T`.
"""
n_samples, n_features = X.shape
n_poly_features = int(binom(n_features+order-1, n_features-1))
powers = np.zeros((n_features, n_poly_features))
T = np.zeros((n_samples, n_poly_features), dtype=X.dtype)
combos = itertools.combinations(range(n_features+order-1), n_features-1)
for i,c in enumerate(combos):
powers[:,i] = np.array([
b-a-1 for a,b in zip((-1,)+c, c+(n_features+order-1,))
])
T[:,i] = np.prod(np.power(X, powers[:,i]), axis=1)
return T, powers
Here's some example usage:
>>> X = np.arange(-5,5).reshape(5,2)
>>> T,p = polynomial_features_with_cross_terms(X, order=3)
>>> print X
[[-5 -4]
[-3 -2]
[-1 0]
[ 1 2]
[ 3 4]]
>>> print p
[[ 0. 1. 2. 3.]
[ 3. 2. 1. 0.]]
>>> print T
[[ -64 -80 -100 -125]
[ -8 -12 -18 -27]
[ 0 0 0 -1]
[ 8 4 2 1]
[ 64 48 36 27]]
Finally, I should mention that the SVM polynomial kernel achieves exactly this effect without explicitly computing the polynomial map. There are of course pro's and con's to this, but I figured I should mentioned it for you to consider if you have not, yet.

PyBrain addSample multi-dimensional array

In all of the examples it seems that addSample(input, target) is used with 1 dimensional arrays, such as:
INPUT = 5
OUTPUT = 1
input = [5, 5, 5, 5, 5]
target = [1]
ds = SequentialDataSet(5, 1)
#add data using addSample
How does one do this when the input is multi-dimensional in this way:
input = [[5, 5, 5, 5, 5], [5, 5, 5, 5, 5]]
target = [1]
How does one use addSample with such structures? I tried this:
ds = SequentialDataSet(2, 1)
ds.addSample(input, target)
and get the error message:
Could not broadcast input array from shape (2, 5) into shape 2.
Meaning the SequentialDataSet(2, 1) does not work for this structure, but SequentialDataSet((2, 5), 1) also errors. This should be easy but I cannot find the answer.
It looks like you're trying to train some sort of Feed Forward network, perhaps a multi-layer perceptron? 5 layers in, one or more hidden layers, and a single output layer but it's not clear so this is a leap on my end.
Either way your input layer should be a single array. If you have a structure, or multi-dimensional array you'll need to collapse it and feed it in as a single set of data. So for your 5x2 suggestion you'd simply have 10 elements on the input, and you would be responsible for "parsing" your input structures consistently as they're fed into the network. For a 5x5 structure you'd have 25 inputs etc.
In my experience a big part of the success/challenge with ANNs is structuring the data in so that the input form is normalized and represented in a way that the network can mathematically find a pattern with.
According to the post linked beneath you should just input one array:
Pybrain multi dimensional data input
For SequentialDataSet I used this example:
data = [(1,2), (1,3), (10,2), (2,0), (2,9), (4,3), (1,2), (10,5)]
ds = SequentialDataSet(2,2)
for sample, next_sample in zip(data, cycle(test_data[1:])):
ds.addSample(sample, next_sample)

Categories