I would like to test the accuracy by epoch in scikit-learn. However, so far, i have been unsuccessful.
This is my code part of classifying with mlpclassifier:
NUM_EPOCHS = 1000
LOG_FOR_EVERY = 10
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
clf = MLPClassifier(hidden_layer_sizes=(18, 175, 256), batch_size=528,
learning_rate_init=0.0001, beta_1=0.001,
beta_2=0.001, max_iter=1, warm_start=True)
for i in range(NUM_EPOCHS):
clf.fit(X_train, y_train.ravel())
I have also made a graph with this result but i need to make it continuous and further increase accuracy.
Why is the accuracy not increasing?
Please try with :
clf.fit(X_train, y_train.ravel(), epochs=NUM_EPOCHS)
Related
I'm using the train_test_split from sklearn.model_selection. My code looks like the following:
x_train, x_test , y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=1234)
Edit: After this is done, how do I fit these to the linear regression model, and then see how good this model is? i.e. Which of the four components (x_train, x_test, y_train, or y_test) would I use to calculate MSE or RMSE? And how exactly how would I do that?
#split dataset in features and target variable
feature_cols = ['RIAGENDR_0', 'RIDAGEYR', 'RIDRETH3_2', 'RIDRETH3_3', 'RIDRETH3_4', 'RIDRETH3_6', 'RIDRETH3_7', 'INDFMPIR', 'DMDMARTZ_1.0', 'DMDMARTZ_2.0', 'DMDMARTZ_3.0', 'DMDMARTZ_4.0', 'DMDMARTZ_6.0', 'DMDEDUC2', 'RFXT010', 'BMXWT', 'BMXBMI', 'URXUMA', 'LBDHDD', 'LBXFER', 'LBXGH', 'LBXBPB', 'LBXBCD', 'LBXBSE', 'LBXBMN', 'URXUBA', 'URXUCD', 'URXUCO', 'URXUCS', 'URXUMO', 'URXUMN', 'URXUPB', 'URXUSB', 'URXUSN', 'URXUTL', 'URXUTU']
X = data[feature_cols] # Features
scale = StandardScaler()
X = scale.fit_transform(X)
y = data['depre_score'] # Target variable
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1) # 70% training and 30% test
clf = DecisionTreeClassifier()
clf = clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print(y_test)
print(y_pred)
confusion = metrics.confusion_matrix(y_test, y_pred)
print(confusion)
print("Accuracy:",metrics.accuracy_score(y_test, y_pred))
recall_sensitivity = metrics.recall_score(y_test, y_pred, pos_label=1)
recall_specificity = metrics.recall_score(y_test, y_pred, pos_label=0)
print(recall_sensitivity, recall_specificity)
Why do you think you are doing something wrong? Perhaps your data are such that you can achieve a perfect classification... e.g., see this mushroom classification.
Having said that, it is also possible that there is some leakage in your data as specified by #gtomer. That means an exact point that is present in training set is available in your test set. You can do K-fold test on your data and see how it follows up with the accuracy. And secondly, use different classifiers too (it is better to use Random Forests compared to Decision Trees)
I'm trying to do a multiclass classification using multiple machine learning using this function that I have created:
def model_roc(X, y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
pipeline1 = imbpipeline(steps = [['pca' , PCA()],
['smote', SMOTE('not majority',random_state=11)],
['scaler', MinMaxScaler()],
['classifier', LogisticRegression(random_state=11,max_iter=1000)]])
stratified_kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=11)
param_grid1 = {'classifier__penalty': ['l1', 'l2'],'classifier__C':[0.001, 0.01, 0.1, 1, 10, 100, 1000]}
grid_search1 = GridSearchCV(estimator=pipeline1,
param_grid=param_grid1,
scoring=make_scorer(roc_auc_score, average='weighted',multi_class='ovo', needs_proba=True),
cv=stratified_kfold,
n_jobs=-1)
print("#"*100)
print(pipeline1['classifier'])
grid_search1.fit(X_train, y_train)
cv_score1 = grid_search1.best_score_
test_score1 = grid_search1.score(X_test, y_test)
print('cv_score',cv_score1, 'test_score',test_score1)
return
I have 2 questions:
Can I get multiple metrics from the same function(ROC/Accuracy/precision and F1_score) as it is imbalanced data with multiple classes?
I need to plot the learning curve, but I don't know how to do this out of my function.
to plot a learning curve of roc_auc_score
def plot_learning_curves(model,X,y):
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=11)
train_scores ,test_scores = [],[]
for m in range(1,len(X_train)):
model.fit(X_train[:m],y_train[:m])
y_train_predict =model.predict(X_train[:m])
y_test_predict = model.predict(X_test)
train_scores.append(roc_auc_score(y_train,y_train_predict))
test_scores.append(roc_auc_score(y_test,y_test_predict))
plt.plot(train_scores,"r-+",linewidth=2,label="train")
plt.plot(test_scores,"b-",linewidth=2,label="test")
we iterate m from 1 to the whole length of the training set, slicing m records from the training data and training the model on this subset.
The model's roc_auc_score will be calculated and added to a list, showing the progress of it's roc_auc_score as the model trains on more and more instances.
-modified from ageron-handsonml2
I have split my data into 3 sets train, test and validation as shown below:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=1)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=1)
I wanted to ask where do I put the validation set in this code:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))
I suggest you to read up some more on why you would split your date into a train, test and validation set.
In the code you show you can use validation data in the same way you use your test data but that doesn't really make fully sense.
There is a lot to it, I think this can get you started. Link
In short and very simplified, the general idea is that you use results from test data to make adjustment to your model, to improve its performance.
Your validation data you use only at the very end for your final model evaluation, to make sure it actually does perform well on unseen data.
(Worst case if you only use two sets, then you might adjust parameters until works well for these two datasets but still not any other.)
You can show validation data score and accuracy like this way:
#Defining Model
model = LogisticRegression()
model.fit(X_train, y_train)
y_pred_val = model.predict(X_val)
print("Accuracy on val is:",accuracy_score(y_val, y_pred_val))
y_pred = model.predict(X_test)
print("Accuracy on test is:",accuracy_score(y_test,y_pred))
#Measure Precision,Recall
print("Precision Score for Val: ",precision_score(y_val, y_pred_val, average='macro'))
print("Recall Score for Val: ",recall_score(y_val, y_pred_val, average='macro'))
print("F1-Score for Val :",f1_score(y_val, y_pred_val, average='macro'))
print("Precision Score: ",precision_score(y_test, y_pred,average='macro'))
print("Recall Score: ",recall_score(y_test, y_pred,average='macro'))
print("F1-Score :",f1_score(y_test, y_pred,average='macro'))
I used a random-forest classifier to classify my dataset; I want to use cross-validation; my problem is I could not find a way to know the accuracy of train and test splits, so is this possible?
here is the code I used, which tell me about the
X_test, X_rem, y_test, y_rem = train_test_split(X,Y, test_size=0.1)
X_valid, X_train, y_valid, y_train = train_test_split(X_rem,y_rem, train_size=0.80)
cv = KFold(n_splits=10, random_state=1, shuffle=True)
# create model
model = RandomForestClassifier(n_estimators =400)
scoresranP = cross_val_score(model, X_rem, y_rem, cv=cv, n_jobs=-1)
print('Accuracy of Random Forest: %.3f (%.3f)' % (mean(scoresranP), std(scoresranP)))
If I am right, in this case, the "scoresranP" will give me the training accuracy, so can I get the test accuracy using the test split?
I am wrong. Can anyone tell me if I can using cross_val_score with three splits (train, valid, test)?
I really appreciate any help you can provide.
If I am correct Random Forest does not gives you a training accuracy as it is a regression learning method.