train_test_split random_state not working; produces different output everytime - python

So, I've been using KNN on a set of data, with a random_state = 4 during the train_test_split phase. Despite of using the random state, the output of accuracy, classification report, prediction, etc, are different each time. Was wondering why was that?
Here's the head of the data: (predicting the position based on all_time_runs and order)
order position all_time_runs
0 10 NO BAT 1304
1 2 CAN BAT 7396
2 3 NO BAT 6938
3 6 CAN BAT 4903
4 6 CAN BAT 3761
And here's the code for the classification and prediction:
#splitting data into features and target
X = posdf.drop('position',axis=1)
y = posdf['position']
knn = KNeighborsClassifier(n_neighbors = 5)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)
#fitting the KNN model, y_train)
#predicting with the model
prediction = knn.predict(X_test)
#knn score
score = knn.score(X_test, y_test)

Althought train_test_split has a random factor associated to it, and it has to be solved to avoid having random resuls, it's not the only factor you should work on solving.
KNN is a model that takes each row of the test set, finds the nearest k training set vectors and classifies it by majority decision and even in case of ties, the decision is random. You need to set.seed(x) in order to ensure the method is replicable.
Documentation states:
Neighbors-based classification is a type of instance-based learning or non-generalizing learning: it does not attempt to construct a general internal model, but simply stores instances of the training data. Classification is computed from a simple majority vote of the nearest neighbors of each point: a query point is assigned the data class which has the most representatives within the nearest neighbors of the point.


Problem of doing preprocessing for testing set in GridSearchCV

I use 20% of the data set as my testing set and use GridSearchCV to implement K-fold cross-validation to tune hyperparameters.
By using a pipeline, we can put the column transformer and the machine learning algorithm into GridSearchCV together. If I set up a 5 fold cross-validation for GridSearchCV, the function will use 5 different training and validation sets to train and validate each combination of hyperparameters. As I know, GridSearchCV uses the mean of 5 fold scores to choose the best model.
Then my question is, how does it transform the testing set?
I'm very confused about this because to avoid data leakage, we should use only the training set to fit the transformer, but in this case, we have 5 different training sets and I don't know which one the GridSearchCV function uses to fit and transform the validation and testing set.
My code is given below
X_other, X_test, y_other, y_test = train_test_split(X, y, test_size = 0.2, random_state = i)
kf = KFold(n_splits = 4, shuffle = True, random_state = i)
pipe = Pipeline(steps = [("preprocessor", preprocessor),("model", ML_algo)])
grid = GridSearchCV(estimator = pipe, param_grid=param_grid,
scoring = make_scorer(mean_squared_error, greater_is_better=False,
squared=False),cv=kf, return_train_score = True, n_jobs=-1, verbose=False), y_other)
test_score = mean_squared_error(y_test, grid.predict(X_test), squared=False)
short answer: there is no data leakage, test set is not used (and should not be used) for training the model in your code.
long answer: k fold cross-validation randomly divided your X_other& y_other(training set) into k splits, for each iteration of cross-validation, k-1 fold of data is used to train the model while this model is then evaluated with the 1 fold left using the metric you specified in scoring=. (refer to the below picture from sklearn for details:
After finding the best set of hyperparameters by GridSearchCV(), all training set data is used in training a final model with the found hyperparameters, then, X_test,y_test (test set) can be transformed by this model. Note that in this process, X_test,y_test is not used and should not be used other than in the final prediction.

why am I getting a very high test accuracy even when i test my dataset with a single feature

I am writing a small program and I am training a random forest to predict a binary value. My dataset has around 20,000 entries and each entry has 25 features(continuous and categorical) with a binary target value to predict.
I am getting over 99% test accuracy which is surprisingly high. I tried to reduce the number of my features, even with two features I am still getting such high accuracy. I just want to make sure I am not doing anything wrong in my code, such as the training set leaking into my test set.
Here is the code snippet
data = pd.read_csv(r'test.csv')
data = data.drop_duplicates()
#spliting data
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#preproccessing the dataset by one hot encoding
l1 = OneHotEncoder(handle_unknown='ignore')
X_train = l1.transform(X_train)
X_test = l1.transform(X_test)
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state=0), y_train.to_numpy())
y_pred = classifier.predict(X_test)
print(accuracy_score(y_test, y_pred))
additionally, I forgot to add that my dataset is balanced and precision and recall scores are 100% !
This is quite a big dataset. How balanced is your dataset? It might be the case your test split is filled mostly with the entries of one label and failed every time the entry was from the other label. Therefore, i would say accuracy is not a good measure to rely on in here.
Have a look at this:
Difference of model accuracy and performance
Have a look at your confusion matrix and inspect your splits.

Non linear regression using Xgboost

I have a dataframe with 36540 rows. the objective is to predict y HITS_DAY.
I am trying to train a non-linear regression model but model doesn't seem to learn much.
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.20, random_state=42)
data_dmatrix = xgb.DMatrix(data=x,label=y)
xg_reg = xgb.XGBRegressor(learning_rate = 0.1, objectif='reg:linear', max_depth=5,
n_estimators = 1000),y_train)
preds = xg_reg.predict(X_test)
df=pd.DataFrame({'ACTUAL':y_test, 'PREDICTED':preds})
what am I doing wrong?
You're not doing anything wrong in particular (except maybe the objectif parameter for xgboost which doesn't exist), however, you have to consider how xgboost works. It will try to create "trees". Trees have splits based on the values of the features. From the plot you show here, it looks like there are very few samples that go above 0. So making a test train split random will likely result in a test set with virtually no samples with a value above 0 (so a horizontal line).
Other than that, it seems you want to fit a linear model on non-linear data. Selecting a different objective function is likely to help with this.
Finally, how do you know that your model is not learning anything? I don't see any evaluation metrics to confirm this. Try to think of meaningful evaluation metrics for your model and show them. This will help you determine if your model is "good enough".
To summarize:
Fix the imbalance in your dataset (or at least take it into consideration)
Select an appropriate objective function
Check evaluation metrics that make sense for your model
From this example it looks like your model is indeed learning something, even without parameter tuning (which you should do!).
import pandas
import xgboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
# Read the data
df = pandas.read_excel("./data.xlsx")
# Split in X and y
X = df.drop(columns=["HITS_DAY"])
y = df["HITS_DAY"]
# Show the values of the full dataset in a plot
# Split in test and train, use stratification to make sure the 2 groups look similar
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.20, random_state=42, stratify=[element > 1 for element in y.values]
# Show the plots of the test and train set (make sure they look similar!)
# Create the regressor
estimator = xgboost.XGBRegressor(objective="reg:squaredlogerror")
# Fit the regressor, y_train)
# Predict on the test set
predictions = estimator.predict(X_test)
df = pandas.DataFrame({"ACTUAL": y_test, "PREDICTED": predictions})
# Show the actual vs predicted
df.sort_values("ACTUAL").reset_index()[["ACTUAL", "PREDICTED"]].plot()
# Show some evaluation metrics
print(f"Mean squared error: {mean_squared_error(y_test.values, predictions)}")
print(f"R2 score: {r2_score(y_test.values, predictions)}")
Mean squared error: 0.01525351142868279
R2 score: 0.07857787102063485

How to get the precision score of every class in a Multi class Classification Problem?

I am making Sentiment Analysis Classification and I am doing it with Scikit-learn. This has 3 labels, positive, neutral and negative. The Shape of my training data is (14640, 15), where
negative 9178
neutral 3099
positive 2363
I have pre-processed the data and applied the bag-of-words word vectorization technique to the text of twitter as there many other attributes too, whose size is then (14640, 1000).
As the Y, means the label is in the text form so, I applied LabelEncoder to it. This is how I split my dataset -
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, random_state=42)
out: (10248, 1000) (10248,)
(4392, 1000) (4392,)
And this is my classifier
svc = svm.SVC(kernel='linear', C=1, probability=True).fit(X_train, Y_train)
prediction = svc.predict_proba(X_test)
prediction_int = prediction[:,1] >= 0.3
prediction_int = prediction_int.astype(
print('Precision score: ', precision_score(Y_test, prediction_int, average=None))
print('Accuracy Score: ', accuracy_score(Y_test, prediction_int))
out:Precision score: [0.73980398 0.48169243 0. ]
Accuracy Score: 0.6675774134790529
/usr/local/lib/python3.6/dist-packages/sklearn/metrics/ UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
'precision', 'predicted', average, warn_for)
Now I am not sure why the third one, in precision score is blank? I have applied average=None, because to make a separate precision score for every class. Also, I am not sure about the prediction, if it is right or not, because I wrote it for binary classification? Can you please help me to debug it to make it better. Thanks in advance.
As the warning explains:
UndefinedMetricWarning: Precision is ill-defined and being set to 0.0 in labels with no predicted samples.
it seems that one of your 3 classes is missing from your predictions prediction_int (i.e. you never predict it); you can easily check if this is the case with
set(Y_test) - set(prediction_int)
which should be the empty set {} if this is not the case.
If this is indeed the case, and the above operation gives {1} or {2}, the most probable reason is that your dataset is imbalanced (you have much more negative samples), and you do not ask for a stratified split; modify your train_test_split to
X_train, X_test, Y_train, Y_test = train_test_split(bow, Y, test_size=0.3, stratify=Y, random_state=42)
and try again.
UPDATE (after comments):
As it turns out, you have a class imbalance problem (and not a coding issue) which prevents your classifier from successfully predicting your 3rd class (positive). Class imbalance is a huge sub-topic in itself, and there are several remedies proposed. Although going into more detail is arguably beyond the scope of a single SO thread, the first thing you should try (on top of the suggestions above) is to use the class_weight='balanced' argument in the definition of your classifier, i.e.:
svc = svm.SVC(kernel='linear', C=1, probability=True, class_weight='balanced').fit(X_train, Y_train)
For more options, have a look at the dedicated imbalanced-learn Python library (part of the scikit-learn-contrib projects).

How can I explain this drop in performance on test data?

I am asking the question here, even though I hesitated to post it on CrossValidated (or DataScience) StackExchange. I have a dataset of 60 labeled objects (to be used for training) and 150 unlabeled objects (for test). The aim of the problem is to predict the labels of the 150 objects (this used to be given as a homework problem). For each object, I computed 258 features. Considering each object as a sample, I have X_train : (60,258), y_train : (60,) (labels of the objects used for training) and X_test : (150,258). Since the solution of the homework problem was given, I also have the true labels of the 150 objects, in y_test : (150,).
In order to predict the labels of the 150 objects, I choose to use a LogisticRegression (the Scikit-learn implementation). The classifier is trained on (X_train, y_train), after the data has been normalized, and used to make predictions for the 150 objects. Those predictions are compared to y_test to assess the performance of the model. For reproducibility, I copy the code I have used.
from sklearn import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import cross_val_score, crosss_val_predict
# Fit classifier
LogReg = LogisticRegression(C=1, class_weight='balanced')
scaler = StandardScaler()
clf = make_pipeline(StandardScaler(), LogReg), y_train)
# Performance on training data
CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')
# Performance on test data
probas = LogReg.predict_proba(X_test)[:, 1]
AUC = metrics.roc_auc_score(y_test, probas)
The matrices X_train,y_train,X_test and y_test are saved in a .mat file available at this link. My problem is the following :
Using this approach, I get a good performance on training data (CV_score = 0.8) but the performance on the test data is much worse : AUC = 0.54 for C=1 in LogReg and AUC = 0.40 for C=0.01. How can I get AUC<0.5 if a naive classifier should score AUC = 0.5 ? Is this due to the fact that I have a small number of samples for training ?
I have noticed that the performance on test data improves if I change the code for :
y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
AUC = metrics.roc_auc_score(y_test, y_pred)
Indeed, AUC=0.87 for C=1 and 0.9 for C=0.01. Why is the AUC score so much better using cross validation predictions ? Is it because cross validation allows to make predictions on subsets of the test data which do not contain objects/samples which decrease the AUC ?
Looks like you are encountering an overfitting problem, i.e. the classifier trained using the training data is overfitting to the training data. It has poor generalization ability. That is why the performance on the testing dataset isn't good.
cross_val_predict is actually training the classifier using part of your testing data and then predict on the rest. So the performance is much better.
Overall, there seems to be quite some difference between your training and testing datasets. So the classifier with the highest training accuracy doesn't work well on your testing set.
Another point not directly related with your question: since the number of your training samples is much smaller than the feature dimensions, it may be helpful to perform dimension reduction before feeding to classifier.
It looks like your training and test process are inconsistent. Although from your code you intend to standardize your data, you fail to do so during testing. What I mean:
clf = make_pipeline(StandardScaler(), LogReg), y_train)
Although you define a pipeline, you do not fit the pipeline ( but only the Logistic Regression. This matters, because your cross-validated score is calculated with the pipeline (CV_score = cross_val_score(clf, X_train, y_train, cv=10, scoring='roc_auc')) but during test instead of using the pipeline as expected to predict, you use only LogReg, hence the test data are not standardized.
The second point you raise is different. In y_pred = cross_val_predict(clf, X_test, y_test, cv=5)
you get predictions by doing cross-validation on the test data, while ignoring the train data. Here, you do data standardization since you use clf and thus your score is high; this is evidence that the standardization step is important.
To summarize, standardizing the test data, I believe will improve your test score.
Firstly it makes no sense to have 258 features for 60 training items. Secondly CV=10 for 60 items means you split the data into 10 train/test sets. Each of these has 6 items only in the test set. So whatever results you obtain will be useless. You need more training data and less features.
