I am trying to do some classification task with python and SVM.
From collected data I extracted the feature vectors for each class and created a training set. The feature vectors have n-dimensions(39 or more). So, say for 2 classes I have a set of 39-d feature vectors and a single array of class labels corresponding to each entry in the feature vector.Currently, I am using mlpy and doing something like this:
import numpy as np
import mlpy
svm=mlpy.Svm('gaussian') #tried a linear kernel too but not having the convergence
instance= np.vstack((featurevector1,featurevector1))
label=np.hstack((np.ones((1,len(featurevector1),dtype=int),-1*np.ones((1,len(featurevector2),dtype=int)))
#Assigning a label(+1/-1) for each entry in instance, (+1 for entries coming from
#featurevector 1 and -1 for featurevector2
svm.compute(instance,label) #it converges and outputs 1
svm.predict(testdata) #This one says all class label are 1 only whereas I ve testing data from both classes
Am I doing some mistake here? Or should I use some other library? Please help.
I don't use mlpy, but np.ones((1,len(featurevector1)) should perhaps be just np.ones(len(featurevector1)) --
print .shape of each to see the difference.
(If you have a link to public data anything like yours, could you post it please ?)
Related
I want to compute auc_score with out using sklearn.
I have a csv file with 2 columns (actual,predicted(probability)). And I want to compute auc score using numpy.trapz() function .
And here is my code
from tqdm import tqdm
def AUC_SCORE(x):
t=[]
f=[]
x=x.sort_values(by=["proba"],ascending=False)
for t in tqdm(x["proba"].unique()):
x['y_pred'] =np.where( x['proba']>=t,1,0)
tp=(x["y"]==1)&(x["y_pred"]==1).sum()
fp=(x["y"]==0)&(x["y_pred"]==1).sum()
tn=(x["y"]==0)&(x["y_pred"]==0).sum()
fn=(x["y"]==1)&(x["y_pred"]==0).sum()
tpr= tp/(fp+fn)
fpr= fp/(tn+fp)
t.append(tpr)
f.append(fpr)
return np.trapz(t,f)
e=AUC_SCORE(a)
and i have around 10100 points and it almost takes above 1 hr using google colab.
and i din't get my result and i am getting errors while modifying my code.
is there there any better/any way to compute auc score with out using sklearn.
The problem with your implementation seems to be here:
x=x.sort_values(by=["proba"],ascending=False)
for t in tqdm(x["proba"].unique()):
You seem to get through each unique values of probabilities, but these are in range 0-1 (probably) and are most likely barely unique, which leads to very long run. You need to translate probability into the label. If you are using binary labels (which from your attempt seems so), you can do following list comprehension:
df["prediction"] = [0 if x<0.5 else 1 for x in df["proba"]]
This way you translate the probability to label and then can sort according to prediction and use unique values in predictions. If you use multilabel predictions, you can extend the above condition according to your needs.
For the performance matter of 1 hr,try to emove tqdm in the loop
for t in tqdm(x["proba"].unique()):
so modify it to:
for t in (x["proba"].unique()):
tqdm is used to show a progess line in loop with ranges, and does not have any effect in the caluculated results.
But I do not know its effects in performance with x["proba"].unique(), it was tested against direct sequences like ranges.
I am anxious to know the result of your try
waiting for your test results in comments.
When using sklearn, I sometimes have issues correctly assigning the output to the right label. When calling different methods on the result of a fit, sklearn only returns numpy arrays with no further labeling. For example, fitting a simple LDA that is trying to classify into two different groups will give me this output.
result = sklearn_lda.fit(X_train, y_train)
print "Prior probabilities are: \n", result.priors_
print "Group means are: \n", result.means_
Output
Prior probabilities are:
[0.49198397 0.50801603]
Group means are:
[[ 0.04279022 0.03389409]
[-0.03954635 -0.03132544]]
How do I know which prior probability is associated with which class label? Same with the group means. For coefficients I know that sklearn outputs them in the same order as they are put in. In this case I am a little confused.
Use result.classes_ to get the array of classes seen by the model.
All other attributes will be in the order of this array.
Most probably this will be alphabetically sorted. So if you have classes A and B, then the order will be:
['A', 'B']
Please see the documentation for available attributes.
I am given a test set without the response variable. I have already built the model and need to predict the response variable in the testing set.
I am having trouble formatting the test design matrix so that it would be compatible.
I am using patsy library to construct the matrix.
I want to do something like this, except the code below does not work:
X = dmatrices('Response ~ var1 + var2', test, return_type = 'dataframe')
What is the right approach? thanks
If you used patsy to fit the model in the first place, then you should tell it "hey, you know how you built my first design matrix? build me another the same way":
# Set up training data
train_Y, train_X = dmatrices("Response ~ ...", train, return_type="dataframe")
# Save patsy's record of how it built this matrix:
design_info = train_X.design_info
# Re-use it to build the test matrix
test_X = dmatrix(design_info, test, return_type="dataframe")
Alternatively, you could build a new matrix from scratch:
# Use 'dmatrix' and leave out the left-hand-side of the formula
test_X = dmatrix("~ ...", test, return_type="dataframe")
The first approach is better if you can do it. For example, suppose you have a categorical variable that you're letting patsy encode for you. And suppose that there are 10 categories that show up in your training set, but only 5 of them occur in your test set. If you use the first approach, then patsy will remember what the 10 categories where, and generate a test matrix with 10 columns (some of them all-zeros). If you use the second approach, then patsy will generate a training matrix with 10 columns and a test matrix with 5 columns, and then your model code is probably going to crash because the matrix isn't the shape it expects.
Another case where this matters is if you use patsy's center function to center a variable: with the first approach it will automatically remember what value it subtracted off from the training data and re-use it for the test data, which is what you want. With the second approach it will recompute the center using the test data, which can lead to you silently getting really really wrong results.
With this code:
X = numpy.array(range(0,5))
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
I get
tuple index out of range
self.n_features = obs[0].shape[1]
So what are you supposed to pass .fit() exactly? The hidden states AND emissions in a tuple? If so in what order? The documentation is less than helpful.
I noticed it likes being passed tuples as this does not give an error:
X = numpy.column_stack([range(0,5),range(0,5)])
model = GaussianHMM(n_components=3,covariance_type='full', n_iter=1000)
model.fit([X])
Edit:
Let me clarify a bit, the documentation indicates that the ordinality of the array must be:
List of array-like observation sequences (shape (n_i, n_features)).
This would almost indicate that you pass a tuple for each sample that indicates in a binary fashion which observations are present. However their example indicates otherwise:
# pack diff and volume for training
X = np.column_stack([diff, volume])
hence the confusion
It would appear the GaussianHMM function is for multivariate-emission-only HMM problems, hence the requirement to have >1 emission vectors. When the documentation refers to 'n_features' they are not referring to the number of ways emissions can express themselves but the number of orthogonal emission vectors.
Hence, "features" (the orthogonal emission vectors) are not to be confused with "symbols" which, in sklearn's parlance (which is likely shared with the greater hmm community for all I know), refer to what actual unique values the system is capable of emitting.
For univariate emission-vector problems, use MultinomialHMM.
Hope that clarifies for anyone else who want to use this stuff without becoming the world's foremost authority on HMMs :)
I realize this is an old thread but the problem in the example code is still there. I believe the example is now at this link and still giving the same error:
tuple index out of range
self.n_features = obs[0].shape[1]
The offending line of code is:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit(X)
Which should be:
model = GaussianHMM(n_components=5, covariance_type="diag", n_iter=1000).fit([X])
I'd like to use sklearn.mixture.GMM to fit a mixture of Gaussians to some data, with results similar to the ones I get using R's "Mclust" package.
The data looks like this:
So here's how I cluster the data using R, it gives me 14 nicely separated clusters and is easy as falling down stairs:
data <- read.table('~/gmtest/foo.csv',sep=",")
library(mclust)
D = Mclust(data,G=1:20)
summary(D)
plot(D, what="classification")
And here's what I say when I try it with python:
from sklearn import mixture
import numpy as np
import os
import pyplot
os.chdir(os.path.expanduser("~/gmtest"))
data = np.loadtxt(open('foo.csv',"rb"),delimiter=",",skiprows=0)
gmm = mixture.GMM( n_components=14,n_iter=5000, covariance_type='full')
gmm.fit(data)
classes = gmm.predict(data)
pyplot.scatter(data[:,0], data[:,1], c=classes)
pyplot.show()
Which assigns all points to the same cluster. I've also noticed that the AIC for the fit is lowest when I tell it to find excatly 1 cluster, and increases linearly with increasing numbers of clusters. What am I doing wrong? Are there additional parameters I need to consider?
Is there a difference in the models used by Mclust and by sklearn.mixture?
But more important: what is the best way in sklearn to cluster my data?
The trick is to set GMM's min_covar. So in this case I get good results from:
mixture.GMM( n_components=14,n_iter=5000, covariance_type='full',min_covar=0.0000001)
The large default value for min_covar assigns all points to one cluster.