Found input variables with inconsistent numbers of samples error - python

I wrote the following code to learn the score in the machine learning methods. but I get the following error. what would be the reason??
veri = pd.read_csv("deneme2.csv")
veri = veri.drop(['id'], axis=1)
y = veri[['Rating']]
x = veri.drop(['Rating','Genres'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.33)
DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)
ytahmin = DTR.predict(x)
DTR.fit(veri[['Reviews','Size','Installs','Type','Price','Content Rating','Category_c']],veri.Rating)
basari_DTR = DTR.score(X_test,y_test)
#print("DecisionTreeRegressor: Yüzde",basari_DTR*100," oranında:" )
a = np.array([159,19000000.0,10000,0,0.0,0,0]).reshape(1, -1)
predict_DTR = DTR.predict(a)
print(f1_score(y_train, y_test, average='macro'))
Error: Found input variables with inconsistent numbers of samples: [6271, 3089]

There are at least two issues with your code.
The first error you report
print(f1_score(y_train, y_test, average='macro'))
Error: Found input variables with inconsistent numbers of samples: [6271, 3089]
is due to your y_train and y_test having different lengths, as already pointed out in the other answer.
But this is not the main issue here, because, even if you change y_train to y_pred, as suggested, you get a new error:
print(f1_score(y_pred, y_test, average='macro'))
Error: continuous is not supported
This is simply because you are in a regression setting, while the f1 score is a classification metric and, as such, it does not work with continuous predictions.
In other words, f1 score is inappropriate for your (regression) problem, hence the errror.
Check the list of metrics available in scikit-learn, where you can confirm that f1 score is used only in classification, and pick up another metric suitable for regression problems.
For a more detailed exposition about what happens when choosing inappropriate metrics in scikit-learn, see Accuracy Score ValueError: Can't Handle mix of binary and continuous target

f1_score needs to take true y from test and the one you predicted on test set, hence last lines should be:
DTR = DecisionTreeRegressor()
DTR.fit(X_train,y_train)
y_pred = DTR.predict(X_test)
print(f1_score(y_pred, y_test, average='macro'))
You shouldn't call fit twice and the shape of your predictions has to be of the same length as test, see some sklearn basic tutorials for more info.

Related

Non linear regression using Xgboost

I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
#data
https://github.com/soufMiashs/Predict_Hits
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000)
xg_reg.fit(X_train,y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
y.sort_values().reset_index()["HITS_DAY"].plot()
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
)
# Show the plots of the test and train set (make sure they look similar!)
y_train.sort_values().reset_index()["HITS_DAY"].plot()
y_test.sort_values().reset_index()["HITS_DAY"].plot()
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor
estimator.fit(X_train, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Output:
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485

Root Mean Squared Error vs Accuracy Linear Regression

I built a simple linear regression model to predict students' final grade using this dataset https://archive.ics.uci.edu/ml/datasets/Student+Performance.
While my accuracy is very good, the errors seem to be big.
I'm not sure if I'm just not understanding the meaning of the errors correctly or if I made some errors in my code. I thought for the accuracy of 92, the errors should be way smaller and closer to 0.
Here's my code:
data = pd.read_csv("/Users/.../student/student-por.csv", sep=";")
X = np.array(data.drop([predict], 1))
y = np.array(data[predict])
x_train, x_test, y_train, y_test = sklearn.model_selection.train_test_split(X, y, test_size = 0.1, random_state=42)
linear = linear_model.LinearRegression()
linear.fit(x_train, y_train)
linear_accuracy = round(linear.score(x_test, y_test) , 5)
linear_mean_abs_error = metrics.mean_absolute_error(y_test, linear_prediction)
linear_mean_sq_error = metrics.mean_squared_error(y_test, linear_prediction)
linear_root_mean_sq_error = np.sqrt(metrics.mean_squared_error(y_test, linear_prediction))
Did I make any errors in the code or errors do make sense in this case?
The accuracy metric in sklearn linear regression is the R^2 metric. It essentially tells you the percent of the variation in the dependent variable explained by the model predictors. 0.92 is a very good score, but it does not mean that your errors will be 0. I looked your work and it seems that you used all the numeric variables as your predictors and your target was G3. The code seems fine and the results seem accurate too. In regression tasks it is really hard to get 0 errors. Please let me know if you have any questions. Cheers

Statmodels output different from sklearn regression

I am trying to get something magic from the boston dataset on sklearn. Wihtout making any change I did a regression with sklearn and another with statsmodels to easily get the p-value of my each of the variables used. However, my reults are completely different results.
Here it is:
boston_houses=load_boston()
boston=pd.DataFrame(data=boston_houses.data, columns=boston_houses.feature_names)
boston['MEDV']=boston_houses.target
boston.head()
X,y=boston.drop(columns='MEDV'),boston['MEDV']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.33, random_state=42)
lin_model = LinearRegression()
lin_model.fit(X_train, y_train)
pred= lin_model.predict(X_test)
from sklearn.metrics import r2_score,mean_squared_error
rSq=r2_score(y_test,pred)
rmse=np.sqrt(mean_squared_error(y_test,pred))
print ('The R-squared for this model {}'.format(rSq))
print ('The Root mean square error for this model {}'.format(rmse))
###### scipy now ###
The R-squared for this model 0.7261570836552478
The Root mean square error for this model 4.55236459846306
X_new=sm.tools.tools.add_constant(X_train)
estimator= sm.OLS(y_train, X_new)
estimator.fit()
print(estimator.fit().summary())
I get 0.739 for the R-squared with statsmodel,Why??
If you are wondering why it is not the same as R-squared which you received from sklearn.metrics.r2_score than the reason is that you have used two different realisations of linear regression with different parameters which produced different predictions.
If you for example change your test_size to 0.25 in train_test_split you will have one more model with different result.
I have used the test for the whole data on sklearn. The results finally matched.Silly from me. I should check how my training perform before to checking how my the sample test respond. That would help me to avoid the mistake acknowledging that there are 2 steps.

How to get the precision score of every class in a Multi class Classification Problem?

I am making Sentiment Analysis Classification and I am doing it with Scikit-learn. This has 3 labels, positive, neutral and negative. The Shape of my training data is (14640, 15), where
negative 9178
neutral 3099
positive 2363
I have pre-processed the data and applied the bag-of-words word vectorization technique to the text of twitter as there many other attributes too, whose size is then (14640, 1000).
As the Y, means the label is in the text form so, I applied LabelEncoder to it. This is how I split my dataset -
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, random_state=42)
print(X_train.shape,Y_train.shape)
print(X_test.shape,Y_test.shape)
out: (10248, 1000) (10248,)
(4392, 1000) (4392,)
And this is my classifier
svc = svm.SVC(kernel='linear', C=1, probability=True).fit(X_train, Y_train)
prediction = svc.predict_proba(X_test)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(np.int)
print('Precision score: ', precision_score(Y_test, prediction_int, average=None))
print('Accuracy Score: ', accuracy_score(Y_test, prediction_int))
out:Precision score: [0.73980398 0.48169243 0. ]
Accuracy Score: 0.6675774134790529
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/classification.py:1437: UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Now I am not sure why the third one, in precision score is blank? I have applied average=None, because to make a separate precision score for every class. Also, I am not sure about the prediction, if it is right or not, because I wrote it for binary classification? Can you please help me to debug it to make it better. Thanks in advance.
As the warning explains:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
it seems that one of your 3 classes is missing from your predictions prediction_int (i.e. you never predict it); you can easily check if this is the case with
set(Y_test) - set(prediction_int)
which should be the empty set {} if this is not the case.
If this is indeed the case, and the above operation gives {1} or {2}, the most probable reason is that your dataset is imbalanced (you have much more negative samples), and you do not ask for a stratified split; modify your train_test_split to
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)
and try again.
UPDATE (after comments):
As it turns out, you have a class imbalance problem (and not a coding issue) which prevents your classifier from successfully predicting your 3rd class (positive). Class imbalance is a huge sub-topic in itself, and there are several remedies proposed. Although going into more detail is arguably beyond the scope of a single SO thread, the first thing you should try (on top of the suggestions above) is to use the class_weight='balanced' argument in the definition of your classifier, i.e.:
svc = svm.SVC(kernel='linear', C=1, probability=True, class_weight='balanced').fit(X_train, Y_train)
For more options, have a look at the dedicated imbalanced-learn Python library (part of the scikit-learn-contrib projects).

Problems running Recursive Feature Elimination using Cross Validation in SkLearn?

I am currently working on a project in which one portion we would like to use a Filter method of evaluation in which we did correlation analysis but in the other portion, we thought it would be a good idea to do more or a Wrapper method feature elimination using cross validation. I found the recursive feature elimination package on Sci-kit learn.
I am loading my data into a Pandas data frame 'df':
df = pd.read_csv("/Users/rohinmahesh/Documents/Data_Mining_Project/bank-additional-full.csv")
df = df.reset_index()
After creating dummy variables I create the feature and label arrays:
x = df2.iloc[:, :-1]
y = df2.iloc[:, -1]
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.3)
After running various models with accuracy as a metric (project specification), the most optimal model was Logistic Regression:
clf = LogisticRegression()
clf.fit(x_train, y_train)
I am then evaluating the model with 3 fold Cross Validation (project specification):
pre_score = cross_val_score(clf, x, y, cv=3)
score = pre_score * 100
print(st.mean(score))
Now when I run the regular Recursive Feature Elimination, it is working:
rfe = RFE(clf, 3)
rfe = rfe.fit(x_test, y_test)
print(rfe.support_)
print(rfe.ranking_)
But when I try and run this using the RFECV, I am getting an error message:
rfe = RFECV(clf, cv=3)
rfe = rfe.fit(x_test, y_test)
print(rfe.support_)
print(rfe.ranking_)
I am new to this, so any help would be great!
Edit: I have gotten this to run properly (not sure what I did before that made it wrong) but I am getting this output, and I don't understand how everything is of rank 1. Could someone catch my mistake possibly?
A little late to post...
It simply means that the algorithm did find any performance improvements from dropping features.
All selected features have a rank of 1.
Only dropped features have a rank greater than 1.

Categories