I built a simple linear regression model to predict students' final grade using this dataset https://archive.ics.uci.edu/ml/datasets/Student+Performance.
While my accuracy is very good, the errors seem to be big.
I'm not sure if I'm just not understanding the meaning of the errors correctly or if I made some errors in my code. I thought for the accuracy of 92, the errors should be way smaller and closer to 0.
Here's my code:
data = pd.read_csv("/Users/.../student/student-por.csv", sep=";")
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1, random_state=42)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
linear_accuracy = round(linear.score(x_test, y_test) , 5)
linear_mean_abs_error = metrics.mean_absolute_error(y_test, linear_prediction)
linear_mean_sq_error = metrics.mean_squared_error(y_test, linear_prediction)
linear_root_mean_sq_error = np.sqrt(metrics.mean_squared_error(y_test, linear_prediction))
Did I make any errors in the code or errors do make sense in this case?
The accuracy metric in sklearn linear regression is the R^2 metric. It essentially tells you the percent of the variation in the dependent variable explained by the model predictors. 0.92 is a very good score, but it does not mean that your errors will be 0. I looked your work and it seems that you used all the numeric variables as your predictors and your target was G3. The code seems fine and the results seem accurate too. In regression tasks it is really hard to get 0 errors. Please let me know if you have any questions. Cheers
Related
I'm trying to graph the mean squared error of my data and I'm having a little difficulty figuring out just how to do it. I know you need both the "true" value and the "predicted" value in order to get the mse, but the way my project is laid out is quite confusing.
I have a method in which I generate a model like so:
def fit_curve(X, y, degree):
poly_features = PolynomialFeatures(degree = degree)
x_poly = poly_features.fit_transform(X)
linreg = LinearRegression()
model = linreg.fit(x_poly, y)
return model
This returns a model that's already trained.
Then, I'm supposed to find the mean squared error for said model. I'm not sure how I'm supposed to do this since the model has already been trained without returning the predicted values.
Right now my method that calculates mse is:
def mse(X, y, degree, model):
poly_features = PolynomialFeatures(degree = degree)
linreg = LinearRegression()
x_poly = poly_features.fit_transform(X)
linreg.fit(x_poly, y)
y_predict = linreg.predict(x_poly)
mse = mean_squared_error(y_predict, y)
return mse
I feel like a lot of the code I use in mse is very redundant when compared to fit_curve. Unfortunately, guidelines say that this is the way I need to do it (with mse taking X, y, degree, and model.
I think it's also worth noting that my current mse works correctly until about 13-14 degrees, where the answer it generates on the graph does not match the solution I was given. I'm not sure why it's not working perfectly, because I thought this was the right idea.
Things should be done in that way:
1) Split your X and y into train and test sets. You can use train_test_split for that. You can choose your test_size (I put 0.33 as an example) and random_state (this one helps with reproducibility).
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)
2) Fit your model (hereby, a linear regression) using X_train and y_train. You have some feature generation method (polynomial one), it's great. Use it with the training data.
poly_features = PolynomialFeatures(degree=degree)
linreg = LinearRegression()
X_train_poly = poly_features.fit_transform(X_train)
linreg.fit(X_train_poly, y_train)
3) Evaluate your fitted model by looking whether it can correctly predict on unseen data (X_test). For that, you can indeed use mean_squared_error with model.predict(X_test) and y_test. Caution, you must apply the same transformation to X_test than what you did for X_train (that's why we first use poly_features.transform)
X_test_poly = poly_features.transform(X_test)
print(mean_squared_error(linreg.predict(X_test_poly), y_test))
Hope that helps.
I am trying to get something magic from the boston dataset on sklearn. Wihtout making any change I did a regression with sklearn and another with statsmodels to easily get the p-value of my each of the variables used. However, my reults are completely different results.
Here it is:
boston_houses=load_boston()
boston=pd.DataFrame(data=boston_houses.data, columns=boston_houses.feature_names)
boston['MEDV']=boston_houses.target
boston.head()
X,y=boston.drop(columns='MEDV'),boston['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42)
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
pred= lin_model.predict(X_test)
from sklearn.metrics import r2_score,mean_squared_error
rSq=r2_score(y_test,pred)
rmse=np.sqrt(mean_squared_error(y_test,pred))
print ('The R-squared for this model {}'.format(rSq))
print ('The Root mean square error for this model {}'.format(rmse))
###### scipy now ###
The R-squared for this model 0.7261570836552478
The Root mean square error for this model 4.55236459846306
X_new=sm.tools.tools.add_constant(X_train)
estimator= sm.OLS(y_train, X_new)
estimator.fit()
print(estimator.fit().summary())
I get 0.739 for the R-squared with statsmodel,Why??
If you are wondering why it is not the same as R-squared which you received from sklearn.metrics.r2_score than the reason is that you have used two different realisations of linear regression with different parameters which produced different predictions.
If you for example change your test_size to 0.25 in train_test_split you will have one more model with different result.
I have used the test for the whole data on sklearn. The results finally matched.Silly from me. I should check how my training perform before to checking how my the sample test respond. That would help me to avoid the mistake acknowledging that there are 2 steps.
I wrote the following code to learn the score in the machine learning methods. but I get the following error. what would be the reason??
veri = pd.read_csv("deneme2.csv")
veri = veri.drop(['id'], axis=1)
y = veri[['Rating']]
x = veri.drop(['Rating','Genres'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)
ytahmin = DTR.predict(x)
DTR.fit(veri[['Reviews','Size','Installs','Type','Price','Content Rating','Category_c']],veri.Rating)
basari_DTR = DTR.score(X_test,y_test)
#print("DecisionTreeRegressor: Yüzde",basari_DTR*100," oranında:" )
a = np.array([159,19000000.0,10000,0,0.0,0,0]).reshape(1, -1)
predict_DTR = DTR.predict(a)
print(f1_score(y_train, y_test, average='macro'))
Error: Found input variables with inconsistent numbers of samples: [6271, 3089]
There are at least two issues with your code.
The first error you report
print(f1_score(y_train, y_test, average='macro'))
Error: Found input variables with inconsistent numbers of samples: [6271, 3089]
is due to your y_train and y_test having different lengths, as already pointed out in the other answer.
But this is not the main issue here, because, even if you change y_train to y_pred, as suggested, you get a new error:
print(f1_score(y_pred, y_test, average='macro'))
Error: continuous is not supported
This is simply because you are in a regression setting, while the f1 score is a classification metric and, as such, it does not work with continuous predictions.
In other words, f1 score is inappropriate for your (regression) problem, hence the errror.
Check the list of metrics available in scikit-learn, where you can confirm that f1 score is used only in classification, and pick up another metric suitable for regression problems.
For a more detailed exposition about what happens when choosing inappropriate metrics in scikit-learn, see Accuracy Score ValueError: Can't Handle mix of binary and continuous target
f1_score needs to take true y from test and the one you predicted on test set, hence last lines should be:
DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)
y_pred = DTR.predict(X_test)
print(f1_score(y_pred, y_test, average='macro'))
You shouldn't call fit twice and the shape of your predictions has to be of the same length as test, see some sklearn basic tutorials for more info.
I am making Sentiment Analysis Classification and I am doing it with Scikit-learn. This has 3 labels, positive, neutral and negative. The Shape of my training data is (14640, 15), where
negative 9178
neutral 3099
positive 2363
I have pre-processed the data and applied the bag-of-words word vectorization technique to the text of twitter as there many other attributes too, whose size is then (14640, 1000).
As the Y, means the label is in the text form so, I applied LabelEncoder to it. This is how I split my dataset -
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, random_state=42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
out: (10248, 1000) (10248,)
(4392, 1000) (4392,)
And this is my classifier
svc = svm.SVC(kernel='linear', C=1, probability=True).fit(X_train, Y_train)
prediction = svc.predict_proba(X_test)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
print('Precision score: ', precision_score(Y_test, prediction_int, average=None))
print('Accuracy Score: ', accuracy_score(Y_test, prediction_int))
out:Precision score: [0.73980398 0.48169243 0. ]
Accuracy Score: 0.6675774134790529
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Now I am not sure why the third one, in precision score is blank? I have applied average=None, because to make a separate precision score for every class. Also, I am not sure about the prediction, if it is right or not, because I wrote it for binary classification? Can you please help me to debug it to make it better. Thanks in advance.
As the warning explains:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
it seems that one of your 3 classes is missing from your predictions prediction_int (i.e. you never predict it); you can easily check if this is the case with
set(Y_test) - set(prediction_int)
which should be the empty set {} if this is not the case.
If this is indeed the case, and the above operation gives {1} or {2}, the most probable reason is that your dataset is imbalanced (you have much more negative samples), and you do not ask for a stratified split; modify your train_test_split to
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)
and try again.
UPDATE (after comments):
As it turns out, you have a class imbalance problem (and not a coding issue) which prevents your classifier from successfully predicting your 3rd class (positive). Class imbalance is a huge sub-topic in itself, and there are several remedies proposed. Although going into more detail is arguably beyond the scope of a single SO thread, the first thing you should try (on top of the suggestions above) is to use the class_weight='balanced' argument in the definition of your classifier, i.e.:
svc = svm.SVC(kernel='linear', C=1, probability=True, class_weight='balanced').fit(X_train, Y_train)
For more options, have a look at the dedicated imbalanced-learn Python library (part of the scikit-learn-contrib projects).
I am currently working on a project in which one portion we would like to use a Filter method of evaluation in which we did correlation analysis but in the other portion, we thought it would be a good idea to do more or a Wrapper method feature elimination using cross validation. I found the recursive feature elimination package on Sci-kit learn.
I am loading my data into a Pandas data frame 'df':
df = pd.read_csv("/Users/rohinmahesh/Documents/Data_Mining_Project/bank-additional-full.csv")
df = df.reset_index()
After creating dummy variables I create the feature and label arrays:
x = df2.iloc[:, :-1]
y = df2.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
After running various models with accuracy as a metric (project specification), the most optimal model was Logistic Regression:
clf = LogisticRegression()
clf.fit(x_train, y_train)
I am then evaluating the model with 3 fold Cross Validation (project specification):
pre_score = cross_val_score(clf, x, y, cv=3)
score = pre_score * 100
print(st.mean(score))
Now when I run the regular Recursive Feature Elimination, it is working:
rfe = RFE(clf, 3)
rfe = rfe.fit(x_test, y_test)
print(rfe.support_)
print(rfe.ranking_)
But when I try and run this using the RFECV, I am getting an error message:
rfe = RFECV(clf, cv=3)
rfe = rfe.fit(x_test, y_test)
print(rfe.support_)
print(rfe.ranking_)
I am new to this, so any help would be great!
Edit: I have gotten this to run properly (not sure what I did before that made it wrong) but I am getting this output, and I don't understand how everything is of rank 1. Could someone catch my mistake possibly?
A little late to post...
It simply means that the algorithm did find any performance improvements from dropping features.
All selected features have a rank of 1.
Only dropped features have a rank greater than 1.