I'm having a weird problem that may suprise you all. My classification rate is too high on my test set. I'm using scikit-learn packages, and I'm very suspicious of these classification rates, as they are very close to 1.
x=[]
for nums in range(1,100):
X=maxDataset[:,1:]
y=maxDataset[:,0]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
cvIter=ShuffleSplit(X_train.shape[0], n_iter=10, test_size=0.2)
clf=RandomForestClassifier()
clf.fit(X_train,y_train)
clf.fit(X, y)
avg=metrics.accuracy_score(y_test,clf.predict(X_test))
x.append(avg)
xMean=np.average(x)
print(xMean)
Is something wrong with my evaluation? I was suspicious that this could be due to fitting the model on the entire data set. If this is the case, or something else is the problem,how can I fix it to get an accurate evaluation of the classifier? The xMean ranges form 98.3 to 99.
Thanks
Related
I have to discretize into at least 5 bins a continuous target variable in order to lower the complexity of a classification model using the sklearn library
In order to do this, I've used the KBinsDiscretizer but I don't know how can I split in balanced parts the dataset now that I've discretized the target variable.
This is my code:
X = df.copy()
y = X.pop('shares')
# scaling the dataset so all data in the same range
scaler = preprocessing.MinMaxScaler()
X = scaler.fit_transform(X)
discretizer = preprocessing.KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform')
y_discretized = discretizer.fit_transform(y.values.reshape(-1, 1))
# is this correct?
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, shuffle=True, stratify=y_discretized)
For completeness, I'm trying to recreate a less complex model than the one showed in: [1] K. Fernandes, P. Vinagre and P. Cortez. A Proactive Intelligent Decision Support System for Predicting the Popularity of Online News. Proceedings of the 17th EPIA 2015 - Portuguese Conference on Artificial Intelligence, September, Coimbra, Portugal
Your y_train and y_test are parts of y, which has (it seems) the original continuous values. So you're ending up fitting multiclass classification models, with probably lots of different classes, which likely causes the crashes.
I assume what you wanted is
X_train, X_test, y_train, y_test = train_test_split(X, y_discretized, test_size=0.33, shuffle=True, stratify=y_discretized)
Whether discretizing a continuous target to turn a regression into a classification is a topic for another site, see e.g. https://datascience.stackexchange.com/q/90297/55122
I am writing a small program and I am training a random forest to predict a binary value. My dataset has around 20,000 entries and each entry has 25 features(continuous and categorical) with a binary target value to predict.
I am getting over 99% test accuracy which is surprisingly high. I tried to reduce the number of my features, even with two features I am still getting such high accuracy. I just want to make sure I am not doing anything wrong in my code, such as the training set leaking into my test set.
Here is the code snippet
data = pd.read_csv(r'test.csv')
data = data.drop_duplicates()
#spliting data
X = data.drop('label', axis=1)
y = data['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
#preproccessing the dataset by one hot encoding
l1 = OneHotEncoder(handle_unknown='ignore')
l1.fit(X_train)
X_train = l1.transform(X_train)
X_test = l1.transform(X_test)
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=20, random_state=0)
classifier.fit(X_train, y_train.to_numpy())
#evaluation
y_pred = classifier.predict(X_test)
print(accuracy_score(y_test, y_pred))
additionally, I forgot to add that my dataset is balanced and precision and recall scores are 100% !
This is quite a big dataset. How balanced is your dataset? It might be the case your test split is filled mostly with the entries of one label and failed every time the entry was from the other label. Therefore, i would say accuracy is not a good measure to rely on in here.
Have a look at this:
Difference of model accuracy and performance
Have a look at your confusion matrix and inspect your splits.
I have a dataset, and I splitted it in percentages of 60, 20, 20. The 60% is for training, 20% is for testing and the rest is for validation (I made the split,because I need it).
I used the next code to split it (I think I found it in stackoverflow) and applied Naive Bayes classifier...
train_ratio = 0.60
validation_ratio = 0.20
test_ratio = 0.20
# train is now 60% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 1 - train_ratio)
# test is now 20% of the initial data set
# validation is now 20% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
print(x_train, x_val, x_test)
q=gnb.fit(x_train, y_train)
predict = gnb.predict(x_test)
print(predict)
print("Accuracy: ", accuracy_score(y_test, predict))
I tried use scikit-learn to make a learning curve, but according to scikit-learn documentation the learning curve function gives the train_sizes, the train_scores and valid_scores.
This is kinda confusing, I'm new to scikit-learn, and I don't have a clue how to use make a learning curve with the each percentage of the data I splitted.
Does anyone knows how to use splitted data in scikit's learning curves?
Thanks in advance.
Is there any way that I can track my model's performance in terms of it's classified labels, during the training phase? Any classifier from sklearn would work as an example.
To be more specific, I want to get something like a list of Confusion Matrices here:
clf = LinearSVC(random_state=42).fit(X_train, y_train)
# ... here ...
y_pred = clf.predict(X_test)
My objective here is to see how well the model is learning (during training). This is similar to analyzing the training loss, that is a common practice in DNN's, and libraries such as pyTorch, Keras, and Tensorflow have such capability already implemented.
I thought a quick browsing of the web would give me what I want, but apparently not. I still believe this should be fairly simple though.
Some ML practitioners like to work with three folds of data: training, validation and testing sets. The latter should not be seen in any training at all, but the middle could. For example, cross-validation uses K different folds of validation sets "during the training phase" to get a less biased performance estimation when training with different parts of the data.
But you can do this on a single validation fold for the purpose of what you asked.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train2, X_valid, y_train2, y_valid = train_test_split(X_train, y_train, test_size=0.2)
# Fit a classifier with train data only
clf = LinearSVC(random_state=42).fit(X_train2, y_train2)
y_valid_pred = clf.predict(X_valid)
confusionm_valid = confusion_matrix(y_valid, y_valid_pred) # ... here ...
# Refit with all your training data
clf = LinearSVC(random_state=42).fit(X_train, y_train)
y_pred = clf.predict(X_valid)
I am having trouble with fit function when applied to MLPClassifier. I carefully read Scikit-Learn's documentation about that but was not able to determine how validation works.
Is it cross-validation or is there a split between training and validation data ?
Thanks in advance.
The fit function per se does not include cross-validation and also does not apply a train test split.
Fortunately you can do this by your own.
Train Test split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33) // test set size is 0.33
clf = MLPClassifier()
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
K-Fold cross validation
from sklearn.model_selection import KFold
kf = KFold(n_splits=2)
kf.get_n_splits(X)
clf = MLPClassifier()
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
clf.fit(X_train, y_train)
clf.predict(X_test, y_test) // predict on test set
For cross validation multiple functions are available, you can read more about it here. The here stated k-fold is just an example.
EDIT:
Thanks for this answer, but basically how does fit function works
concretely ? It just trains the network on the given data (i.e.
training set) until max_iter is reached and that's it ?
I am assuming your are using the default config of MLPClassifier. In this case the fit function tries to do an optimization on basis of adam optimizer. In this case, indeed, the network trains until max_iter is reached.
Moreover, in the K-Fold cross validation, is the model improving as
long as the loop goes through or just restarts from scratch ?
Actually cross-validation is not used to improve the performance of your network, it's actually a methodology to test how well your algrotihm generalizes on different data. For k-fold, k independent classifiers are trained and tested.