sci-kit learn output interpretation - python

When using sklearn, I sometimes have issues correctly assigning the output to the right label. When calling different methods on the result of a fit, sklearn only returns numpy arrays with no further labeling. For example, fitting a simple LDA that is trying to classify into two different groups will give me this output.
result = sklearn_lda.fit(X_train, y_train)
print "Prior probabilities are: \n", result.priors_
print "Group means are: \n", result.means_
Output
Prior probabilities are:
[0.49198397 0.50801603]
Group means are:
[[ 0.04279022 0.03389409]
[-0.03954635 -0.03132544]]
How do I know which prior probability is associated with which class label? Same with the group means. For coefficients I know that sklearn outputs them in the same order as they are put in. In this case I am a little confused.

Use result.classes_ to get the array of classes seen by the model.
All other attributes will be in the order of this array.
Most probably this will be alphabetically sorted. So if you have classes A and B, then the order will be:
['A', 'B']
Please see the documentation for available attributes.

Related

Is there a way to iterate through the vectors of Gensim's Word2Vec?

I'm trying to perform a simple task which requires iterations and interactions with specific vectors after loading it into gensim's Word2Vec.
Basically, given a txt file of the form:
t1 -0.11307 -0.63909 -0.35103 -0.17906 -0.12349
t2 0.54553 0.18002 -0.21666 -0.090257 -0.13754
t3 0.22159 -0.13781 -0.37934 0.39926 -0.25967
where t1 is the name of the vector and what follows are the vectors themselves. I load it in using the function vecs = KeyedVectors.load_word2vec_format(datapath(f), binary=False).
Now, I want to iterate through the vectors I have and make a calculation, take summing up all of the vectors as an example. If this was read in using with open(f), I know I can just use .split(' ') on it, but since this is now a KeyedVector object, I'm not sure what to do.
I've looked through the word2vec documentation, as well as used dir(KeyedVectors) but I'm still not sure if there is an attribute like KeyedVectors.vectors or something that allows me to perform this task.
Any tips/help/advice would be much appreciated!
There's a list of all words in the KeyedVectors object in its .index_to_key property. So one way to sum all the vectors would be to retrieve each by name in a list comprehension:
np.sum([vecs[key] for key in vecs.index_to_key], axis=0)
But, if all you really wanted to do is sum the vectors – and the keys (word tokens) aren't an important part of your calculation, the set of all the raw word-vectors is available in the .vectors property, as a numpy array with one vector per row. So you could also do:
np.sum(vecs.vectors, axis=0)

How to compute auc score manually without using sklearn?

I want to compute auc_score with out using sklearn.
I have a csv file with 2 columns (actual,predicted(probability)). And I want to compute auc score using numpy.trapz() function .
And here is my code
from tqdm import tqdm
def AUC_SCORE(x):
t=[]
f=[]
x=x.sort_values(by=["proba"],ascending=False)
for t in tqdm(x["proba"].unique()):
x['y_pred'] =np.where( x['proba']>=t,1,0)
tp=(x["y"]==1)&(x["y_pred"]==1).sum()
fp=(x["y"]==0)&(x["y_pred"]==1).sum()
tn=(x["y"]==0)&(x["y_pred"]==0).sum()
fn=(x["y"]==1)&(x["y_pred"]==0).sum()
tpr= tp/(fp+fn)
fpr= fp/(tn+fp)
t.append(tpr)
f.append(fpr)
return np.trapz(t,f)
e=AUC_SCORE(a)
and i have around 10100 points and it almost takes above 1 hr using google colab.
and i din't get my result and i am getting errors while modifying my code.
is there there any better/any way to compute auc score with out using sklearn.
The problem with your implementation seems to be here:
x=x.sort_values(by=["proba"],ascending=False)
for t in tqdm(x["proba"].unique()):
You seem to get through each unique values of probabilities, but these are in range 0-1 (probably) and are most likely barely unique, which leads to very long run. You need to translate probability into the label. If you are using binary labels (which from your attempt seems so), you can do following list comprehension:
df["prediction"] = [0 if x<0.5 else 1 for x in df["proba"]]
This way you translate the probability to label and then can sort according to prediction and use unique values in predictions. If you use multilabel predictions, you can extend the above condition according to your needs.
For the performance matter of 1 hr,try to emove tqdm in the loop
for t in tqdm(x["proba"].unique()):
so modify it to:
for t in (x["proba"].unique()):
tqdm is used to show a progess line in loop with ranges, and does not have any effect in the caluculated results.
But I do not know its effects in performance with x["proba"].unique(), it was tested against direct sequences like ranges.
I am anxious to know the result of your try
waiting for your test results in comments.

Confused about X in GaussianHMM.fit([X])

With this code:
X = numpy.array(range(0,5))
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
I get
tuple index out of range
self.n_features = obs[0].shape[1]
So what are you supposed to pass .fit() exactly? The hidden states AND emissions in a tuple? If so in what order? The documentation is less than helpful.
I noticed it likes being passed tuples as this does not give an error:
X = numpy.column_stack([range(0,5),range(0,5)])
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
Edit:
Let me clarify a bit, the documentation indicates that the ordinality of the array must be:
List of array-like observation sequences (shape (n_i, n_features)).
This would almost indicate that you pass a tuple for each sample that indicates in a binary fashion which observations are present. However their example indicates otherwise:
# pack diff and volume for training
X = np.column_stack([diff, volume])
hence the confusion
It would appear the GaussianHMM function is for multivariate-emission-only HMM problems, hence the requirement to have >1 emission vectors. When the documentation refers to 'n_features' they are not referring to the number of ways emissions can express themselves but the number of orthogonal emission vectors.
Hence, "features" (the orthogonal emission vectors) are not to be confused with "symbols" which, in sklearn's parlance (which is likely shared with the greater hmm community for all I know), refer to what actual unique values the system is capable of emitting.
For univariate emission-vector problems, use MultinomialHMM.
Hope that clarifies for anyone else who want to use this stuff without becoming the world's foremost authority on HMMs :)
I realize this is an old thread but the problem in the example code is still there. I believe the example is now at this link and still giving the same error:
tuple index out of range
self.n_features = obs[0].shape[1]
The offending line of code is:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit(X)
Which should be:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit([X])

Support Vector Machine training using scikit-learn

I'm trying to train a Support Vector Machine using scikit-learn. It is correctly giving the output for trainingdata1. But it is always not giving the expected result for trainingdata2 (trainingdata2 is what I actually need). What is wrong?
from sklearn import svm
trainingdata1 = [[11.0, 2, 2, 1.235, 5.687457], [11.3, 2, 2,7.563, 10.107477]]
#trainingdata2 = [[1.70503083,7.531671404747827,1.4804916998015452,3.0767991352604387,6.5742], [11.3, 2, 2,7.563, 10.107477]]
clf = svm.OneClassSVM()
clf.fit(trainingdata1)
def alert(data):
if clf.predict(data) < 0:
print ('\n\nThere is something wrong')
else:
print('\nCorrect')
alert([11.3, 2, 2,7.563, 10.107477])
#alert([1.70503083,7.531671404747827,1.4804916998015452,3.0767991352604387,6.5742])
Well I have to admit I actually hadn't heard about one-class SVMs before. As far as I understand, their goal is to find out if test examples are similar to previously provided example. Now, the difference between the two cases is that the two vectors are quite similar in the first, working example, and kind of different in the other one (if we compare the numeric values of the different components of the value). Could it be this is actually behaving as intended? Note that SVM training does not necessarily mean that all training examples are classified as labelled for training, due to generalization.

ml-py svm converges but classifying wrongly

I am trying to do some classification task with python and SVM.
From collected data I extracted the feature vectors for each class and created a training set. The feature vectors have n-dimensions(39 or more). So, say for 2 classes I have a set of 39-d feature vectors and a single array of class labels corresponding to each entry in the feature vector.Currently, I am using mlpy and doing something like this:
import numpy as np
import mlpy
svm=mlpy.Svm('gaussian') #tried a linear kernel too but not having the convergence
instance= np.vstack((featurevector1,featurevector1))
label=np.hstack((np.ones((1,len(featurevector1),dtype=int),-1*np.ones((1,len(featurevector2),dtype=int)))
#Assigning a label(+1/-1) for each entry in instance, (+1 for entries coming from
#featurevector 1 and -1 for featurevector2
svm.compute(instance,label) #it converges and outputs 1
svm.predict(testdata) #This one says all class label are 1 only whereas I ve testing data from both classes
Am I doing some mistake here? Or should I use some other library? Please help.
I don't use mlpy, but np.ones((1,len(featurevector1)) should perhaps be just np.ones(len(featurevector1)) --
print .shape of each to see the difference.
(If you have a link to public data anything like yours, could you post it please ?)

Categories