Support Vector Machine training using scikit-learn - python

I'm trying to train a Support Vector Machine using scikit-learn. It is correctly giving the output for trainingdata1. But it is always not giving the expected result for trainingdata2 (trainingdata2 is what I actually need). What is wrong?
from sklearn import svm
trainingdata1 = [[11.0, 2, 2, 1.235, 5.687457], [11.3, 2, 2,7.563, 10.107477]]
#trainingdata2 = [[1.70503083,7.531671404747827,1.4804916998015452,3.0767991352604387,6.5742], [11.3, 2, 2,7.563, 10.107477]]
clf = svm.OneClassSVM()
clf.fit(trainingdata1)
def alert(data):
if clf.predict(data) < 0:
print ('\n\nThere is something wrong')
else:
print('\nCorrect')
alert([11.3, 2, 2,7.563, 10.107477])
#alert([1.70503083,7.531671404747827,1.4804916998015452,3.0767991352604387,6.5742])

Well I have to admit I actually hadn't heard about one-class SVMs before. As far as I understand, their goal is to find out if test examples are similar to previously provided example. Now, the difference between the two cases is that the two vectors are quite similar in the first, working example, and kind of different in the other one (if we compare the numeric values of the different components of the value). Could it be this is actually behaving as intended? Note that SVM training does not necessarily mean that all training examples are classified as labelled for training, due to generalization.

Related

How to compute auc score manually without using sklearn?

I want to compute auc_score with out using sklearn.
I have a csv file with 2 columns (actual,predicted(probability)). And I want to compute auc score using numpy.trapz() function .
And here is my code
from tqdm import tqdm
def AUC_SCORE(x):
t=[]
f=[]
x=x.sort_values(by=["proba"],ascending=False)
for t in tqdm(x["proba"].unique()):
x['y_pred'] =np.where( x['proba']>=t,1,0)
tp=(x["y"]==1)&(x["y_pred"]==1).sum()
fp=(x["y"]==0)&(x["y_pred"]==1).sum()
tn=(x["y"]==0)&(x["y_pred"]==0).sum()
fn=(x["y"]==1)&(x["y_pred"]==0).sum()
tpr= tp/(fp+fn)
fpr= fp/(tn+fp)
t.append(tpr)
f.append(fpr)
return np.trapz(t,f)
e=AUC_SCORE(a)
and i have around 10100 points and it almost takes above 1 hr using google colab.
and i din't get my result and i am getting errors while modifying my code.
is there there any better/any way to compute auc score with out using sklearn.
The problem with your implementation seems to be here:
x=x.sort_values(by=["proba"],ascending=False)
for t in tqdm(x["proba"].unique()):
You seem to get through each unique values of probabilities, but these are in range 0-1 (probably) and are most likely barely unique, which leads to very long run. You need to translate probability into the label. If you are using binary labels (which from your attempt seems so), you can do following list comprehension:
df["prediction"] = [0 if x<0.5 else 1 for x in df["proba"]]
This way you translate the probability to label and then can sort according to prediction and use unique values in predictions. If you use multilabel predictions, you can extend the above condition according to your needs.
For the performance matter of 1 hr,try to emove tqdm in the loop
for t in tqdm(x["proba"].unique()):
so modify it to:
for t in (x["proba"].unique()):
tqdm is used to show a progess line in loop with ranges, and does not have any effect in the caluculated results.
But I do not know its effects in performance with x["proba"].unique(), it was tested against direct sequences like ranges.
I am anxious to know the result of your try
waiting for your test results in comments.

sci-kit learn output interpretation

When using sklearn, I sometimes have issues correctly assigning the output to the right label. When calling different methods on the result of a fit, sklearn only returns numpy arrays with no further labeling. For example, fitting a simple LDA that is trying to classify into two different groups will give me this output.
result = sklearn_lda.fit(X_train, y_train)
print "Prior probabilities are: \n", result.priors_
print "Group means are: \n", result.means_
Output
Prior probabilities are:
[0.49198397 0.50801603]
Group means are:
[[ 0.04279022 0.03389409]
[-0.03954635 -0.03132544]]
How do I know which prior probability is associated with which class label? Same with the group means. For coefficients I know that sklearn outputs them in the same order as they are put in. In this case I am a little confused.
Use result.classes_ to get the array of classes seen by the model.
All other attributes will be in the order of this array.
Most probably this will be alphabetically sorted. So if you have classes A and B, then the order will be:
['A', 'B']
Please see the documentation for available attributes.

formatting design matrix for regression

I am given a test set without the response variable. I have already built the model and need to predict the response variable in the testing set.
I am having trouble formatting the test design matrix so that it would be compatible.
I am using patsy library to construct the matrix.
I want to do something like this, except the code below does not work:
X = dmatrices('Response ~ var1 + var2', test, return_type = 'dataframe')
What is the right approach? thanks
If you used patsy to fit the model in the first place, then you should tell it "hey, you know how you built my first design matrix? build me another the same way":
# Set up training data
train_Y, train_X = dmatrices("Response ~ ...", train, return_type="dataframe")
# Save patsy's record of how it built this matrix:
design_info = train_X.design_info
# Re-use it to build the test matrix
test_X = dmatrix(design_info, test, return_type="dataframe")
Alternatively, you could build a new matrix from scratch:
# Use 'dmatrix' and leave out the left-hand-side of the formula
test_X = dmatrix("~ ...", test, return_type="dataframe")
The first approach is better if you can do it. For example, suppose you have a categorical variable that you're letting patsy encode for you. And suppose that there are 10 categories that show up in your training set, but only 5 of them occur in your test set. If you use the first approach, then patsy will remember what the 10 categories where, and generate a test matrix with 10 columns (some of them all-zeros). If you use the second approach, then patsy will generate a training matrix with 10 columns and a test matrix with 5 columns, and then your model code is probably going to crash because the matrix isn't the shape it expects.
Another case where this matters is if you use patsy's center function to center a variable: with the first approach it will automatically remember what value it subtracted off from the training data and re-use it for the test data, which is what you want. With the second approach it will recompute the center using the test data, which can lead to you silently getting really really wrong results.

Python's implementation of Mutual Information

I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular :
sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None)
(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html)
I am trying to implement the example I find in the Stanford NLP tutorial site:
The site is found here : http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2
The problem is I keep getting different results, without figuring out the reason yet.
I get the concept of Mutual Information and feature selection, I just don't understand how it is implemented in Python. What I do is that I provide the mutual_info_score method with two arrays based on the NLP site example, but it outputs different results. The other interesting fact is that anyhow you play around and change numbers on those arrays you are most likely to get the same result. Am I supposed to use another data structure specific to Python or what is the issue behind this? If anyone has used this function successfully in the past it would be of a great help to me, thank you for your time.
I encountered the same issue today. After a few trials I found the real reason: you take log2 if you strictly followed NLP tutorial, but sklearn.metrics.mutual_info_score uses natural logarithm(base e, Euler's number). I didn't find this detail in sklearn documentation...
I verified this by:
import numpy as np
def computeMI(x, y):
sum_mi = 0.0
x_value_list = np.unique(x)
y_value_list = np.unique(y)
Px = np.array([ len(x[x==xval])/float(len(x)) for xval in x_value_list ]) #P(x)
Py = np.array([ len(y[y==yval])/float(len(y)) for yval in y_value_list ]) #P(y)
for i in xrange(len(x_value_list)):
if Px[i] ==0.:
continue
sy = y[x == x_value_list[i]]
if len(sy)== 0:
continue
pxy = np.array([len(sy[sy==yval])/float(len(y)) for yval in y_value_list]) #p(x,y)
t = pxy[Py>0.]/Py[Py>0.] /Px[i] # log(P(x,y)/( P(x)*P(y))
sum_mi += sum(pxy[t>0]*np.log2( t[t>0]) ) # sum ( P(x,y)* log(P(x,y)/( P(x)*P(y)) )
return sum_mi
If you change this np.log2 to np.log, I think it would give you the same answer as sklearn. The only difference is that when this method returns 0, sklearn will return a number very near to 0. ( And of course, use sklearn if you don't care about log base, my piece of code is just for demo, it gives poor performance...)
FYI, 1)sklearn.metrics.mutual_info_score takes lists as well as np.array; 2) the sklearn.metrics.cluster.entropy uses also log, not log2
Edit: as for "same result", I'm not sure what you really mean. In general, the values in the vectors don't really matter, it is the "distribution" of values that matters. You care about P(X=x), P(Y=y) and P(X=x,Y=y), not the value x,y.
The code below should provided a result: 0.00011053558610110256
c=np.concatenate([np.ones(49), np.zeros(27652), np.ones(141), np.zeros(774106) ])
t=np.concatenate([np.ones(49), np.ones(27652), np.zeros(141), np.zeros(774106)])
computeMI(c,t)

ml-py svm converges but classifying wrongly

I am trying to do some classification task with python and SVM.
From collected data I extracted the feature vectors for each class and created a training set. The feature vectors have n-dimensions(39 or more). So, say for 2 classes I have a set of 39-d feature vectors and a single array of class labels corresponding to each entry in the feature vector.Currently, I am using mlpy and doing something like this:
import numpy as np
import mlpy
svm=mlpy.Svm('gaussian') #tried a linear kernel too but not having the convergence
instance= np.vstack((featurevector1,featurevector1))
label=np.hstack((np.ones((1,len(featurevector1),dtype=int),-1*np.ones((1,len(featurevector2),dtype=int)))
#Assigning a label(+1/-1) for each entry in instance, (+1 for entries coming from
#featurevector 1 and -1 for featurevector2
svm.compute(instance,label) #it converges and outputs 1
svm.predict(testdata) #This one says all class label are 1 only whereas I ve testing data from both classes
Am I doing some mistake here? Or should I use some other library? Please help.
I don't use mlpy, but np.ones((1,len(featurevector1)) should perhaps be just np.ones(len(featurevector1)) --
print .shape of each to see the difference.
(If you have a link to public data anything like yours, could you post it please ?)

Categories