I am trying to build first classifier with Pybrain neural network along with specialized ClassificationDataSet and I am not sure I fully understand it works.
So I have a pandas dataframe with 6 feature columns and 1 column for class label (Survived, just 0 or 1).
I build a dataset out of it:
ds = ClassificationDataSet(6, 1, nb_classes=2)
for i in df[['Gender', 'Pclass', 'AgeFill', 'FamilySize', 'FarePerPerson', 'Deck','Survived']].values:
ds.addSample(tuple(i[:-1]), i[-1])
ds._convertToOneOfMany()
return ds
Ok, I check how dataset looks like:
for i, m in ds:
i, m
(array([ 1., 3., 2., 2., 1., 8.]), array([1, 0]))
(array([ 0., 1., 1., 2., 0., 2.]), array([0, 1]))
And I already have a problem. What means [1,0] or [0,1]? Is it just '0' or '1' of original 'survived' column? How to get back to original values?
Later, when I finish with training of my network:
net = buildNetwork(6, 6, 2, hiddenclass=TanhLayer, bias=True, outclass=SoftmaxLayer)
trainer = BackpropTrainer(net, ds)
trainer.trainEpochs(10)
I will try to activate it on my another dataset (for which I want to do actual classification) and I will get a pairs of activation results for each of 2 output neurons, but how to understand which output neuron corresponds to which original class? Probably this is something obvious, but I am not able to understand it from the docs, unfortunately.
Ok, looks like pybrain uses position to determine which class it means by (0,1) or (1,0).
To go back to original 0 or 1 mark you need to use argmax() function. So for example if I already have a trained network and I want to validate it on the same data as I used for training I could do this:
for inProp, num in ds:
out = net.activate(inProp).argmax()
if out == num.argmax():
true+=1
total+=1
res = true/total
inProp will look like a tuple of my input values for activation, num - a tuple of expected two-neuron output (either (0,1) or (1,0)) and num.argmax() will translate it into just 0 or 1 - real output.
I might be wrong since this is a pure heuristic, but it works in my example.
Related
I am trying to use sklearn.preprocessing.LabelBinarizer() to create a one hot encoding of only a two-column labels, i.e. I only want to categorize two set of objects. In this case, when I use fit(range(0,2)), it just returns a one dimensional array, instead of 2x1. This is fine, but when I want to use them in Tensorflow, the shape should really be (2,1) for dimensional consistency. Please advise how I can resolve it.
Here is the code:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(range(0, 3))
Calling lb.transform([1, 0]), the result is:
[[0 1 0]
[1 0 0]]
whereas when we change 3 to 2, i.e. lb.fit(range(0, 2)), the result would be
[[1]
[0]]
instead of
[[0 1]
[1 0]]
This will create problems in the algorithms that work consistently with arrays with n dimensions. Is it any way to resolve this issue?
labelBinarizer()'s purpose according to the documentation is
Binarize labels in a one-vs-all fashion
Several regression and binary classification algorithms are available in scikit-learn.
A simple way to extend these algorithms to the multi-class classification case is to use > the so-called one-vs-all scheme.
If your data has only two types of labels, then you can directly feed that to binary classifier. Hence, one column is good enough to capture two classes in One-Vs-Rest fashion.
Binary targets transform to a column vector
>>> lb = preprocessing.LabelBinarizer()
>>> lb.fit_transform(['yes', 'no', 'no', 'yes'])
array([[1],
[0],
[0],
[1]])
If your intention is just creating one-hot encoding, use the following method.
from sklearn.preprocessing import OneHotEncoder
>>> enc = OneHotEncoder()
>>> enc.fit_transform([['yes'], ['no'], ['no'], ['yes']]).toarray()
array([[0., 1.],
[1., 0.],
[1., 0.],
[0., 1.]])
Hope this clarifies, your question of why Sklearn labelBinarizer() does not convert the 2 class data into two column output.
As already said as a comment, this is not an issue of the method. According to the documentation: Binary targets transform to a column vector. You can build the array you want from the colomn vector result, in the case the dimension is 2.
A direct and simple way to do this is:
from sklearn import preprocessing
lb = preprocessing.LabelBinarizer()
lb.fit(range(2) # range(0, 2) is the same as range(2)
a = lb.transform([1, 0])
result_2d = np.array([[item[0], 0 if item[0] else 1] for item in a])
I am trying to multiply two Gaussian distributions to obtain posterior for GMM data. In order to do that, I am trying to use .prob() function from tf.contrib.distributions.MultivariateNormalDiag, but every time I am getting the same error, even if I am providing the argument with float64.
I am using TensorFlow 1.8 version.
x = tf.placeholder(tf.float64, [None,2], name="input")
likelihood = tf.contrib.distributions.MultivariateNormalDiag(loc = [0., 0., 0.], scale_diag= [1., 1., 1.])
y_LL = likelihood.prob(x).eval()
TypeError: Input had dtype <dtype: 'float32'> but expected <dtype: 'float64'>.
I am confused whether I am doing it the wrong way, or what? Can someone please help me with this?
For this example, you are using x as a tf.float64. Unless you explicitly specify, tensorflow will auto-convert list inputs to tf.float32. You want to do something like (not executable code, but demonstrating you need to signal float64):
import numpy as np
likelihood = tf.contrib.distributions.MultivariateNormalDiag(loc=np.float64([0., 0., 0.]), scale_diag=np.float64([1., 1., 1.]))
y_LL = likelihood.prob(x).eval()
The function torch.nn.functional.softmax takes two parameters: input and dim. According to its documentation, the softmax operation is applied to all slices of input along the specified dim, and will rescale them so that the elements lie in the range (0, 1) and sum to 1.
Let input be:
input = torch.randn((3, 4, 5, 6))
Suppose I want the following, so that every entry in that array is 1:
sum = torch.sum(input, dim = 3) # sum's size is (3, 4, 5, 1)
How should I apply softmax?
softmax(input, dim = 0) # Way Number 0
softmax(input, dim = 1) # Way Number 1
softmax(input, dim = 2) # Way Number 2
softmax(input, dim = 3) # Way Number 3
My intuition tells me that is the last one, but I am not sure. English is not my first language and the use of the word along seemed confusing to me because of that.
I am not very clear on what "along" means, so I will use an example that could clarify things. Suppose we have a tensor of size (s1, s2, s3, s4), and I want this to happen
Steven's answer is not correct. See the snapshot below. It is actually the reverse way.
Image transcribed as code:
>>> x = torch.tensor([[1,2],[3,4]],dtype=torch.float)
>>> F.softmax(x,dim=0)
tensor([[0.1192, 0.1192],
[0.8808, 0.8808]])
>>> F.softmax(x,dim=1)
tensor([[0.2689, 0.7311],
[0.2689, 0.7311]])
The easiest way I can think of to make you understand is: say you are given a tensor of shape (s1, s2, s3, s4) and as you mentioned you want to have the sum of all the entries along the last axis to be 1.
sum = torch.sum(input, dim = 3) # input is of shape (s1, s2, s3, s4)
Then you should call the softmax as:
softmax(input, dim = 3)
To understand easily, you can consider a 4d tensor of shape (s1, s2, s3, s4) as a 2d tensor or matrix of shape (s1*s2*s3, s4). Now if you want the matrix to contain values in each row (axis=0) or column (axis=1) that sum to 1, then, you can simply call the softmax function on the 2d tensor as follows:
softmax(input, dim = 0) # normalizes values along axis 0
softmax(input, dim = 1) # normalizes values along axis 1
You can see the example that Steven mentioned in his answer.
Let's consider the example in two dimensions
x = [[1,2],
[3,4]]
do you want your final result to be
y = [[0.27,0.73],
[0.27,0.73]]
or
y = [[0.12,0.12],
[0.88,0.88]]
If it's the first option then you want dim = 1. If it's the second option you want dim = 0.
Notice that the columns or zeroth dimension is normalized in the second example hence it is normalized along the zeroth dimension.
Updated 2018-07-10: to reflect that zeroth dimension refers to columns in pytorch.
I am not 100% sure what your question means but I think your confusion is simply that you don't understand what dim parameter means. So I will explain it and provide examples.
If we have:
m0 = nn.Softmax(dim=0)
what that means is that m0 will normalize elements along the zeroth coordinate of the tensor it receives. Formally if given a tensor b of size say (d0,d1) then the following will be true:
sum^{d0}_{i0=1} b[i0,i1] = 1, forall i1 \in {0,...,d1}
you can easily check this with a Pytorch example:
>>> b = torch.arange(0,4,1.0).view(-1,2)
>>> b
tensor([[0., 1.],
[2., 3.]])
>>> m0 = nn.Softmax(dim=0)
>>> b0 = m0(b)
>>> b0
tensor([[0.1192, 0.1192],
[0.8808, 0.8808]])
now since dim=0 means going through i0 \in {0,1} (i.e. going through the rows) if we choose any column i1 and sum its elements (i.e. the rows) then we should get 1. Check it:
>>> b0[:,0].sum()
tensor(1.0000)
>>> b0[:,1].sum()
tensor(1.0000)
as expected.
Note we do get all rows sum to 1 by "summing out the rows" with torch.sum(b0,dim=0), check it out:
>>> torch.sum(b0,0)
tensor([1.0000, 1.0000])
We can create a more complicated example to make sure it's really clear.
a = torch.arange(0,24,1.0).view(-1,3,4)
>>> a
tensor([[[ 0., 1., 2., 3.],
[ 4., 5., 6., 7.],
[ 8., 9., 10., 11.]],
[[12., 13., 14., 15.],
[16., 17., 18., 19.],
[20., 21., 22., 23.]]])
>>> a0 = m0(a)
>>> a0[:,0,0].sum()
tensor(1.0000)
>>> a0[:,1,0].sum()
tensor(1.0000)
>>> a0[:,2,0].sum()
tensor(1.0000)
>>> a0[:,1,0].sum()
tensor(1.0000)
>>> a0[:,1,1].sum()
tensor(1.0000)
>>> a0[:,2,3].sum()
tensor(1.0000)
so as we expected if we sum all the elements along the first coordinate from the first value to the last value we get 1. So everything is normalized along the first dimension (or first coordiante i0).
>>> torch.sum(a0,0)
tensor([[1.0000, 1.0000, 1.0000, 1.0000],
[1.0000, 1.0000, 1.0000, 1.0000],
[1.0000, 1.0000, 1.0000, 1.0000]])
Also along the dimension 0 means that you vary the coordinate along that dimension and consider each element. Sort of like having a for loop going through the values the first coordinates can take i.e.
for i0 in range(0,d0):
a[i0,b,c,d]
import torch
import torch.nn.functional as F
x = torch.tensor([[1, 2], [3, 4]], dtype=torch.float)
s1 = F.softmax(x, dim=0)
tensor([[0.1192, 0.1192],
[0.8808, 0.8808]])
s2 = F.softmax(x, dim=1)
tensor([[0.2689, 0.7311],
[0.2689, 0.7311]])
torch.sum(s1, dim=0)
tensor([1., 1.])
torch.sum(s2, dim=1)
tensor([1., 1.])
Think of what softmax is trying to achieve. It outputs probability of one outcome against the other. Let's say you are trying to predict two outcomes: is it A or is it B. If p(A) is greater than p(B) then the next step is to convert the outcome into Boolean( i.e. the outcome would be A if p(A) > 50% or B if p(B) > 50% Since we are dealing with probabilities they should add-up to 1.
Therefore what you want is sum probabilities OF EACH ROW to be 1. Therefore you specify dim=1 or row sum
On the other hand if your model is designed to predict more than two variables the output tensor will look something like [p(a), p(b), p(c)...p(i)]
What matters here is that p(a) + p(b) + p(c) +...p(i) = 1
then you would use dim = 0
It all depends on how you define your output layer.
My dataset contains one numerical feature and one categorical feature. It only has 20 observations (for the question purpose).
X is a numpy array of shape (20,1) and is like:
array([[10],
[465],
[3556],
[899],
[090],
....]]
encoded_x is a numpy array of shape (20,4) and is like:
array([[ 0., 1., 0., 0.],
[ 1., 0., 0., 0.],
[ 0., 0., 1., 0.],
[ 0., 0., 1., 0.],
...................]]
Question: Now, how can I merge those array to give them as input to Xgboost?
How should the final array look like?
My understanding is that numerical features should not be encoded, that is why I have two distinct arrays.
XGBoost approach is a bit different from, say, neural networks. It requires you to have one numerical matrix for the input, and this makes you think differently about what a feature is.
From your point of view, there are 2 features: one categorical and one numerical. But XGBoost sees 5 features, 4 of which, for some reason, take just two values: 0 or 1. XGBoost doesn't know about one-hot encoding, it sees only numbers.
As a result, no matter how you encode your categorical feature (ordinal or one-hot), you should just concatenate all of result arrays into a single 2D array and fit it to the model.
x1 = np.arange(20).reshape([-1, 1]) # numerical feature
x2 = np.random.randint(0, 2, size=[20, 4]) # not one-hot, but still ok for XGBoost
x = np.concatenate([x1, x2], axis=1) # now it's 5 XGBoost features
I would like to implement a KNeighborsClassifier with scikit-learn module (http://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)
I retrieve from my image solidity, elongation and Humoments features.
How can i prepare these datas for training and validation?
I must create a list with the 3 features [Hm, e, s] for every object i retrieved from my images (from 1 image have more objects)?
I read this example(http://scikit-learn.org/dev/modules/generated/sklearn.neighbors.KNeighborsClassifier.html):
X = [[0], [1], [2], [3]]
y = [0, 0, 1, 1]
from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=3)
neigh.fit(X, y)
print(neigh.predict([[1.1]]))
print(neigh.predict_proba([[0.9]]))
X and y are 2 features?
samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
from sklearn.neighbors import NearestNeighbors
neigh = NearestNeighbors(n_neighbors=1)
neigh.fit(samples)
print(neigh.kneighbors([1., 1., 1.]))
Why in first example use X and y and now sample?
Your first segment of code defines a classifier on 1d data.
X represents the feature vectors.
[0] is the feature vector of the first data example
[1] is the feature vector of the second data example
....
[[0],[1],[2],[3]] is a list of all data examples,
each example has only 1 feature.
y represents the labels.
Below graph shows the idea:
Green nodes are data with label 0
Red nodes are data with label 1
Grey nodes are data with unknown labels.
print(neigh.predict([[1.1]]))
This is asking the classifier to predict a label for x=1.1.
print(neigh.predict_proba([[0.9]]))
This is asking the classifier to give membership probability estimate for each label.
Since both grey nodes located closer to the green, below outputs make sense.
[0] # green label
[[ 0.66666667 0.33333333]] # green label has greater probability
The second segment of code actually has good instructions on scikit-learn:
In the following example, we construct a NeighborsClassifier class from an array representing our data set and ask who’s the closest point to [1,1,1]
>>> samples = [[0., 0., 0.], [0., .5, 0.], [1., 1., .5]]
>>> from sklearn.neighbors import NearestNeighbors
>>> neigh = NearestNeighbors(n_neighbors=1)
>>> neigh.fit(samples)
NearestNeighbors(algorithm='auto', leaf_size=30, ...)
>>> print(neigh.kneighbors([1., 1., 1.]))
(array([[ 0.5]]), array([[2]]...))
There is no target value here because this is only a NearestNeighbors class, it's not a classifier, hence no labels are needed.
For your own problem:
Since you need a classifier, you should resort to KNeighborsClassifier if you want to use KNN approach. You might want to construct your feature vector X and label y as below:
X = [ [h1, e1, s1],
[h2, e2, s2],
...
]
y = [label1, label2, ..., ]