Why can't I predict new data using SVM and KNN? - python

I'm new to machine learning and I just learned KNN and SVM with sklearn. How do I make a prediction for new data using SVM or KNN? I have tried both to make prediction. They make good prediction only when the data is already known. But when I try to predict new data, they give an incorrect prediction.
Here is my code:
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVC(kernel='linear')
clf.fit(x, y)
print(clf.predict([[20]]))
print(clf.score(x, y))
0utput:
[12.]
1.0
This code will make a good prediction as long as the data to predict is within the range x_train. But when I try to predict for example 20, or anything above the range x_train, the output will always be 12 which is the last element of y. I don't know what I do wrong in the code.

The code is behaving as mathematically described by a support vector machine.
You must understand how your data are being interpreted by the algorithm. You have 11 data points, and you are giving each one a different class. The SVM ends up basically dividing the number line into 11 segments (for the 11 classes you defined):
data = [(x, clf.predict([[x]])[0]) for x in np.linspace(1, 20, 300)]
plt.scatter([p[0] for p in data], [p[1] for p in data])
plt.show()
The answer by AILearning tells you how to fit your given toy problem, but make sure you also understand why your code wasn't doing what you thought it was. For any finite set of examples there are infinitely many functions that fit the data. Your fundamental issue is you are confusing regression and classification. From the sounds of it, you want a simple regression model to extrapolate a fit function from the data points, but your code is for a classification model.

You have to use a regression model rather than a classification model. For svm based regression use svm.SVR()
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVR(kernel='linear')
clf.fit(x, y)
print(clf.predict([[50]]))
print(clf.score(x, y))
output:
[50.12]
0.9996

Related

lin_reg.coef_ function returns coefficients that are too high when using StandardScaler to scale the training data

I'm using the StandardScalar() and lin_reg.coef_ function in the following context:
for i in range(100):
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=i)
scaler = StandardScaler().fit(x_train)
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
lin_reg = LinearRegression().fit(x_train, y_train)
if i == 0:
print(lin_reg.coef_)
if i == 1:
print(lin_reg.coef_)
This leads to the following output:
Code Output
So, as have been expected, the coef_ function returns the coefficients for the 22 different features I am passing into the linear regression. However, for the second output, some of the coefficients are way too large (e.g. 1.61e+14). I am pretty sure that the scaling with StandardScaler() works as it should be. However, if I do not scale the training data before applying the coef_ function, I do not get these high coefficients. One important thing that I should mention is that the last 13 features are binary features, whereas the first 9 features are continuous (such as age). I can imagine that the problem is somehow related to this fact, although, for the first binary feature, the coefficients are properly computed (just the last 12 binary features have too large coefficients).
You should use Standardization when the data come from a Gaussian distribution. Using StandardScal() on binary data doesn't make any sense.
You should scale only the first 9 nine variables, and then pass them all in the linear regression.
https://www.atoti.io/when-to-perform-a-feature-scaling/
Avoid scaling binary columns in sci-kit learn StandsardScaler

Python: statsmodels - what does .predict(X) actually predict?

I'm a bit confused as to what the line model.predict(X) actually predicts. I can't find anything on it with a Google search.
import statsmodels.api as sm
# Step 1) Load data into dataframe
df = pd.read_csv('my_data.csv')
# Step 2) Separate dependent and independent variables
X = df['independent_variable']
y = df["dependent_variable"]
# Step 3) using OLS -fit a linear regression
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make predictions
predictions
I'm not sure what predictions is showing? Is it predicting the next x amount of rows or something? Aren't I just passing in my independent variables?
You are fitting an OLS model from your data, which is most likely interpreted as an array. The predict method will returns an array of fitted values given the trained model.
In other words, from statsmodels documentation:
Return linear predicted values from a design matrix.
Similar to the sk-learn. After model = sm.OLS(y, X).fit(), you will have a model, then predictions = model.predict(X) is not predict next x amount of rows, it will predict from your X, the training dataset. The model using ordinary least squares will be a function of "x" and the output should be:
$$ \hat{y}=f(x) $$
If you want to predict the new X, you need to split X into training and testing dataset.
Actually you are doing it wrong
The predict method is use to predict next values
After separating dependent and I dependent values
You can split the data in two part train and test
From sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,0.2)
This will make X_train ur 80% of total data with only independent variable
And you can put your y_test in predict method to check how well the model is performing

Multiclass classification using Gaussian Mixture Models with scikit learn

I am trying to use sklearn.mixture.GaussianMixture for classification of pixels in an hyper-spectral image. There are 15 classes (1-15). I tried using the method http://scikit-learn.org/stable/auto_examples/mixture/plot_gmm_covariances.html. In here the mean is initialize with means_init,I also tried this but my accuracy is poor (about 10%). I also tried to change type of covariance, threshold, maximum iterations and number of initialization but the results are same.
Am I doing correct? Please provide inputs.
import numpy as np
from sklearn.mixture import GaussianMixture
import scipy.io as sio
from sklearn.model_selection import train_test_split
uh_data =sio.loadmat('/Net/hico/data/users/nikhil/contest_uh_casi.mat')
data = uh_data['contest_uh_casi']
uh_labels = sio.loadmat('/Net/hico/data/users/nikhil/contest_gt_tr.mat')
labels = uh_labels['contest_gt_tr']
reshaped_data = np.reshape(data,(data.shape[0]*data.shape[1],data.shape[2]))
print 'reshaped data :',reshaped_data.shape
reshaped_label = np.reshape(labels,(labels.shape[0]*labels.shape[1],-1))
print 'reshaped label :',reshaped_label.shape
con_data = np.hstack((reshaped_data,reshaped_label))
pre_data = con_data[con_data[:,144] > 0]
total_data = pre_data[:,0:144]
total_label = pre_data[:,144]
train_data, test_data, train_label, test_label = train_test_split(total_data, total_label, test_size=0.30, random_state=42)
classifier = GaussianMixture(n_components = 15 ,covariance_type='diag',max_iter=100,random_state = 42,tol=0.1,n_init = 1)
classifier.means_init = np.array([train_data[train_label == i].mean(axis=0)
for i in range(1,16)])
classifier.fit(train_data)
pred_lab_train = classifier.predict(train_data)
train_accuracy = np.mean(pred_lab_train.ravel() == train_label.ravel())*100
print 'train accuracy:',train_accuracy
pred_lab_test = classifier.predict(test_data)
test_accuracy = np.mean(pred_lab_test.ravel()==test_label.ravel())*100
print 'test accuracy:',test_accuracy
My data has 66485 pixels and 144 features each. I also tried to do after applying some feature reduction techniques like PCA, LDA, KPCA etc, but the results are still the same.
Gaussian Mixture is not a classifier. It is a density estimation method, and expecting that its components will magically align with your classes is not a good idea. You should try out actual supervised techniques, since you clearly do have access to labels. Scikit-learn offers lots of these, including Random Forest, KNN, SVM, ... pick your favourite. GMM simply tries to fit mixture of Gaussians into your data, but there is nothing forcing it to place them according to the labeling (which is not even provided in the fit call). From time to time this will work - but only for trivial problems, where classes are so well separated that even Naive Bayes would work, in general however it is simply invalid tool for the problem.
GMM is not a classifier, but generative model. You can use it to a classification problem by applying Bayes theorem. It's not true that classification based on GMM works only for trivial problems. However it's based on mixture of Gauss components, so fits the best problems with high level features.
Your code incorrectly use GMM as classifier. You should use GMM as a posterior distribution, one GMM per each class.

All zeros when using OneVsRestClassifier

I am trying to use OneCsRestClassifier on my data set. I extracted the features on which model will be trained and fitted Linear SVC on it. After model fitting, when I try to predict on the same data on which the model was fitted, I get all zeros. Is it because of some implementation issues or because my feature extraction is not good enough. I think since I am predicting on the same data on which my model was fitted I should get 100% accuracy. But instead my model predicts all zeros. Here is my code-
#arrFinal contains all the features and the labels. Last 16 columns are labels and features are from 1 to 521. 17th column from the last is not taken
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear'))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
Is there something wrong with my implementation of OneVsRestClassifier?
After looking at your data, it appears the values may be too small for the C value. Try using a sklearn.preprocessing.StandardScaler.
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
scaler = StandardScaler()
X = scaler.fit_transform(X)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear', C=100))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
From here, you should look at parameter tuning on the C using cross validation. Either with a learning curve or using a grid search.

SVR prediction syntax explanation

I am working on a school project where we are using SVR to predict the next value of a series of values (e.g. stock values). We found some example code on scikit (Python) which we aren't able to understand the syntax for.
Could someone help us decipher this?
X = np.sort(5 * np.random.rand(40, 1), axis=0)
Y = np.sin(X).ravel()
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
y_rbf = svr_rbf.fit(X, Y).predict(X)
I understand the first 4 lines of this code... my issue is more with the y_rbf line... how exactly does this work? Are we doing a curve fit based on the training set and then predicting based on the same input vector?
I am not sure what the syntax means. Any help is appreciated.
Thank you.
The last line can be broken up into:
svr_rbf.fit(X, Y) # 1
y_rbf = svr_rbf.predict(X) # 2
You build a model of how the output y depends on X. According to the documentation you:
Fit the SVM model according to the given training data.
Here you use the model you built previously to predict the values (y) for each point. As the documention puts it:
Perform regression on samples in X.
This is fine for an experiment, but just to make you aware: in general you'll want to test your model on data other than what was used to fit the model to avoid overfitting. You can read about cross-validation if you are not familiar with it.

Categories