SVR prediction syntax explanation - python

I am working on a school project where we are using SVR to predict the next value of a series of values (e.g. stock values). We found some example code on scikit (Python) which we aren't able to understand the syntax for.
Could someone help us decipher this?
X = np.sort(5 * np.random.rand(40, 1), axis=0)
Y = np.sin(X).ravel()
from sklearn.svm import SVR
svr_rbf = SVR(kernel='rbf', C=1e3, gamma=0.1)
y_rbf = svr_rbf.fit(X, Y).predict(X)
I understand the first 4 lines of this code... my issue is more with the y_rbf line... how exactly does this work? Are we doing a curve fit based on the training set and then predicting based on the same input vector?
I am not sure what the syntax means. Any help is appreciated.
Thank you.

The last line can be broken up into:
svr_rbf.fit(X, Y) # 1
y_rbf = svr_rbf.predict(X) # 2
You build a model of how the output y depends on X. According to the documentation you:
Fit the SVM model according to the given training data.
Here you use the model you built previously to predict the values (y) for each point. As the documention puts it:
Perform regression on samples in X.
This is fine for an experiment, but just to make you aware: in general you'll want to test your model on data other than what was used to fit the model to avoid overfitting. You can read about cross-validation if you are not familiar with it.

Related

Random Forest Regressor Feature Importance all zero

I'm running a random forest regressor using scikit learn, but all the predictions end up being the same.
I realized that when I fit the data, all the feature importance are zero which is probably why all the predictions are the same.
This is the code that I'm using:
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
import pandas as pd
merged_df = pd.read_csv("/home/jovyan/efs/vliu/combined_data.csv")
target = merged_df["400kmDensity"]
merged_df.drop("400kmDensity", axis = 1, inplace = True)
features_list = list(merged_df.columns)
#Set training and testing groups
train_features, test_features, train_target, test_target = train_test_split(merged_df, target, random_state = 16)
#Train model
rf = RandomForestRegressor(n_estimators = 150, random_state = 16)
ran = rf.fit(train_features, train_target)
print("Feature importances: ", rf.feature_importances_)
#Make predictions and calculate error
predictions = ran.predict(test_features)
print("Predictions: ", predictions)
Here's a link to the data file:
https://drive.google.com/file/d/1ECgKAH82wxIvt2OCv4W5ir1te_Vr3N_r/view?usp=sharing
If anybody can see what I did wrong before fitting the data that would result in the feature importances all being zero, that would be much appreciated.
Both your variables "400kmDensity" and "410kmDensity" share a correlation coefficient of >0.99:
np.corrcoef(merged_df["400kmDensity"], merged_df["410kmDensity"])
This practically means that you can predict "400kmDensity" almost exclusively with "410kmDensity". On a scatter plot they form an almost perfect line:
In order to actually explore what affects the values of "400kmDensity", you should exclude "410kmDensity" as a regressor (an explanatory variable). The feature importance can help to identify explanatory variables afterward.
Note that feature importance may not be a perfect metric to determine actual feature importance. Maybe you want to take a look into other available methods like Boruta Algorithm/Permutation Importance/...
In regard to the initial question: I'm not really sure why, but RandomForestRegressor seems to have a problem with your very low target variable(?). I was able to get feature importances after I scaled train_target and train_features in rf.fit(). However, this should not actually be necessary at all in order to apply Random Forest! You maybe want to take a look into the respective documentation or extend your search in this direction. Hope this serves as a hint.
fitted.rf = rf.fit(scale(train_features), scale(train_target))
As mentioned before, the feature importances after this change unsurprisingly look like this:
Also, the column "second" holds only the value zero, which does not explain anything! Your first step should be always EDA (Explanatory Data Analysis) to get a feeling for the data, like checking correlations between columns or generating histograms in order to explore data distributions [...].
There is much more to it, but I hope this gives you a leg-up!

Python: statsmodels - what does .predict(X) actually predict?

I'm a bit confused as to what the line model.predict(X) actually predicts. I can't find anything on it with a Google search.
import statsmodels.api as sm
# Step 1) Load data into dataframe
df = pd.read_csv('my_data.csv')
# Step 2) Separate dependent and independent variables
X = df['independent_variable']
y = df["dependent_variable"]
# Step 3) using OLS -fit a linear regression
model = sm.OLS(y, X).fit()
predictions = model.predict(X) # make predictions
predictions
I'm not sure what predictions is showing? Is it predicting the next x amount of rows or something? Aren't I just passing in my independent variables?
You are fitting an OLS model from your data, which is most likely interpreted as an array. The predict method will returns an array of fitted values given the trained model.
In other words, from statsmodels documentation:
Return linear predicted values from a design matrix.
Similar to the sk-learn. After model = sm.OLS(y, X).fit(), you will have a model, then predictions = model.predict(X) is not predict next x amount of rows, it will predict from your X, the training dataset. The model using ordinary least squares will be a function of "x" and the output should be:
$$ \hat{y}=f(x) $$
If you want to predict the new X, you need to split X into training and testing dataset.
Actually you are doing it wrong
The predict method is use to predict next values
After separating dependent and I dependent values
You can split the data in two part train and test
From sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y,0.2)
This will make X_train ur 80% of total data with only independent variable
And you can put your y_test in predict method to check how well the model is performing

Why can't I predict new data using SVM and KNN?

I'm new to machine learning and I just learned KNN and SVM with sklearn. How do I make a prediction for new data using SVM or KNN? I have tried both to make prediction. They make good prediction only when the data is already known. But when I try to predict new data, they give an incorrect prediction.
Here is my code:
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVC(kernel='linear')
clf.fit(x, y)
print(clf.predict([[20]]))
print(clf.score(x, y))
0utput:
[12.]
1.0
This code will make a good prediction as long as the data to predict is within the range x_train. But when I try to predict for example 20, or anything above the range x_train, the output will always be 12 which is the last element of y. I don't know what I do wrong in the code.
The code is behaving as mathematically described by a support vector machine.
You must understand how your data are being interpreted by the algorithm. You have 11 data points, and you are giving each one a different class. The SVM ends up basically dividing the number line into 11 segments (for the 11 classes you defined):
data = [(x, clf.predict([[x]])[0]) for x in np.linspace(1, 20, 300)]
plt.scatter([p[0] for p in data], [p[1] for p in data])
plt.show()
The answer by AILearning tells you how to fit your given toy problem, but make sure you also understand why your code wasn't doing what you thought it was. For any finite set of examples there are infinitely many functions that fit the data. Your fundamental issue is you are confusing regression and classification. From the sounds of it, you want a simple regression model to extrapolate a fit function from the data points, but your code is for a classification model.
You have to use a regression model rather than a classification model. For svm based regression use svm.SVR()
import numpy as np
from sklearn import svm
x=np.array([[1],[2],[3],[4],[5],[6],[7],[8],[9],[10],[11]], dtype=np.float64)
y=np.array([2,3,4,5,6,7,8,9,10,11,12], dtype=np.float64)
clf = svm.SVR(kernel='linear')
clf.fit(x, y)
print(clf.predict([[50]]))
print(clf.score(x, y))
output:
[50.12]
0.9996

Feature selection with sklearn - ValueError: X has a different shape than during fitting

:) Very sorry in advance if my code looks like something a total newbie would write. Down below is a portion of my code in python. I am fiddling with sklearn and machine learning techniques.
I trained several Naive Bayes Model based on different datasets and stored them in trained_models
Prior this step i created an object chi_squared of the SelectPercentile class using the chi2 function for feature selection. From my understanding, i should write data_feature_reduced = chi_squared.transform(some_data) then use data_feature_reduced at the time of training like this, ie: nb.fit(data_feature_reduced, data.target)
This is what did, and stored the results objects nb ( and some other informations in the list trained_models.
I am now attempting to apply these models on a different set of data ( actually from the same source, if that matters to the question )
for name, model, intra_result, dev, training_data, chi_squarer in trained_models:
cross_results = []
new_vect= StemmedVectorizer(ngram_range=(1, 4), stop_words='english', max_df=0.90, min_df=2)
for data in demframes:
data_name = data[0]
X_test_data = new_vect.fit_transform(data[1].values.astype('U'))
Y_test_data = data[2]
chi_squared_test_data = chi_squarer.transform(X_test_data)
final_results.append((name, "applied to", data[0], model.score(X_test_data,Y_test_data)))
I have to admit that I am a bit of stranger to the feature selection part.
Here is the error that i get :
ValueError: X has a different shape than during fitting.
at line chi_squared_test_data = chi_squarer.transform(X_test_data)
I am assuming I am doing feature selection in an incorrect manner, Where did I go wrong ?
Thanks to everyone for their help!
I will just paste the comment that helped me solve my problem from #Vivek-Kumar.
This error is due to this line new_vect.fit_transform(). Like your
trained models, you should use the same StemmedVectorizer which was
used at training time.
The same StemmedVectorize object will transform the X_test_data to same shape, what it had during the training. Currently, you are using different object and fitting on it (fit_transform is fit and transform), hence the shape is different. Hence the error.
why not use a pipeline to make it simple? that way you dont have to transform twice and take care of the shapes.
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
chi_squarer = SelectKBest(chi2, k=100) # change accordingly
lr = LogisticRegression() # or naive bayes
clf = pipeline.Pipeline([('chi_sq', chi_squarer), ('model', lr)])
# for training:
clf.fit(training_data, targets)
# for predictions:
clf.predict(test_data)
you can also add the new_vect in the pipeline

All zeros when using OneVsRestClassifier

I am trying to use OneCsRestClassifier on my data set. I extracted the features on which model will be trained and fitted Linear SVC on it. After model fitting, when I try to predict on the same data on which the model was fitted, I get all zeros. Is it because of some implementation issues or because my feature extraction is not good enough. I think since I am predicting on the same data on which my model was fitted I should get 100% accuracy. But instead my model predicts all zeros. Here is my code-
#arrFinal contains all the features and the labels. Last 16 columns are labels and features are from 1 to 521. 17th column from the last is not taken
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear'))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
Is there something wrong with my implementation of OneVsRestClassifier?
After looking at your data, it appears the values may be too small for the C value. Try using a sklearn.preprocessing.StandardScaler.
X=np.array(arrFinal[:,1:-17])
X=X.astype(float)
scaler = StandardScaler()
X = scaler.fit_transform(X)
Xtest=np.array(X)
Y=np.array(arrFinal[:,522:]).astype(float)
clf = OneVsRestClassifier(SVC(kernel='linear', C=100))
clf.fit(X, Y)
ans=clf.predict(Xtest)
print(ans)
print("\n\n\n")
From here, you should look at parameter tuning on the C using cross validation. Either with a learning curve or using a grid search.

Categories