I have 5 different sets of data and want find evaluation metrics for every set using neural network regression. I realized that the R^2 is dropping every loop in order manner. I'm pretty sure there should be a point that I need to identify, but not sure where in my code.
My code:
for sensor, sensorlol, name in zip(sensors, sensorslol, names):
x_train, x_test, y_train, y_test = train_test_split(sensor, reference, test_size=0.2, random_state=42)
x_trainl, x_testl, y_trainl, y_testl = train_test_split(sensorlol, referencelol, test_size=0.2, random_state=42)
kf=KFold(7, shuffle=True, random_state=42)
ann=MLPRegressor(hidden_layer_sizes=int(node), activation='relu',
learning_rate='constant', learning_rate_init=initl, shuffle=False)
ann.fit(x_train, y_train)
m_predictionlol=cross_val_predict(ann, sensorlol, referencelol, cv=kf)
R2lol=r2_score(referencelol, m_predictionlol)
MAElol=mean_absolute_error(referencelol, m_predictionlol)
RMSElol=mean_squared_error(referencelol, m_predictionlol)
MBElol=np.mean(m_predictionlol-referencelol)
r_2lol.append(R2lol)
maelol.append(MAElol)
rmselol.append(RMSElol)
mbelol.append(MBElol)
sumref=np.sum(referencelol)
probref=referencelol/sumref
sumtest=np.sum(m_predictionlol)
probtest=m_predictionlol/sumtest
KLlol=sum(rel_entr(probtest,probref))
kllol.append(KLlol)
del m_predictionlol, sensorlol
dataframe1=pd.DataFrame(list(zip(lst, r_2lol, maelol, rmselol, mbelol,kllol)), columns=['Sensor', 'R^2', 'MAE', 'RMSE', 'MBE','KL'])
And results:
Sensor R^2 MAE RMSE MBE KL
0 I 0.803568 1.776084 5.702426 0.097944 0.044695
1 H 0.739653 2.013070 7.557870 0.102656 0.053525
2 L 0.722556 2.074596 8.054198 -0.143503 0.058237
3 G 0.696291 2.193398 8.816680 0.261528 0.062377
4 J 0.677972 2.251240 9.348475 -0.000313 0.068745
I have added random seed to reduce randomness of the model. I have deleted the model every loop and metric variables to avoid any overlay.
Related
I am using tensorflow and keras for a binary classification problem.
I have only a training set of 81 samples (Testsize 21), but ~1900 features. I know its too less samples and too many features, but its a biological problem (gene-expression data), so i have to deal with it.
My model looks like this: (using different neurons per layer, different number of hidden layers, regularization and dropout to deal with the high dimensional data)
model = Sequential()
model.add(Input((input_shape,)))
for i in range(num_hidden):
model.add(Dense(n_neurons, activation="relu",kernel_regularizer=keras.regularizers.l1_l2(l1_reg, l2_reg)))
model.add(Dropout(dropout_rate))
model.add(Dense(1, activation="sigmoid"))
ann_optimizer= keras.optimizers.Adam()
model.compile(loss="binary_crossentropy",
optimizer=ann_optimizer, metrics=['accuracy'])
I am using a 10 fold nested cross validation and grid search in the inner fold like this:
# fit and evaluate the model
# configure the inner cross-validation procedure (5 fold, 80 inner training dataset, 20 inner test dataset)
cv_inner = ShuffleSplit(n_splits=5, test_size=0.2, random_state=1)
# define the mode
ann = KerasRegressor(build_fn=regressionModel_sequential, input_shape=X_train.shape[1],
batch_size=batch_size)
# use pipeline as prevent from Leaky Preprocessing (StandardScaler on 80% inner-training dataset))
pipe = Pipeline(steps=[('scaler', StandardScaler()), ('ann', ann)])
# define the grid search of with inner cv to get good parameters
grid_search_result = GridSearchCV(
pipe, param_grid, n_jobs=-1, cv=cv_inner, refit=True, verbose=0)
#refit = True a final model with the entire inner-training dataset
# execute search
grid_search_result.fit(X_train, y_train, ann__verbose=0)
logger.info('>>>>> est=%.3f, params=%s' % (grid_search_result.best_score_, grid_search_result.best_params_))
# to get loss curve
ann_val = regressionModel_sequential(input_shape=X_train.shape[1],
n_neurons=grid_search_result.best_params_['ann__n_neurons'],
l1_reg=grid_search_result.best_params_['ann__l1_reg'],
l2_reg=grid_search_result.best_params_['ann__l2_reg'],
num_hidden=grid_search_result.best_params_['ann__num_hidden'],
dropout_rate=grid_search_result.best_params_['ann__dropout_rate'])
# Validation with outer 20 %
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)
history = ann_val.fit(X_train, y_train, batch_size=batch_size, verbose=0,
validation_split=0.25, shuffle=True, epochs=grid_search_result.best_params_['ann__epochs'])
plot_history(history, directory, i)
# use best grid search reult for predicting on outer test dataset
y_predicted = ann_val.predict(X_test)
# print predicted
logger.info(y_predicted[:5])
logger.info(y_test[:5])
rmse = (np.sqrt(metrics.mean_squared_error(y_test, y_predicted)))
mae = (metrics.mean_squared_error(y_test, y_predicted))
r_squared = metrics.r2_score(y_test, y_predicted)
My loss seems good: loss
But accuracy is very bad. accuracy (example from one outer fold)
Does anyone have suggestions on what i could do to improve my results?
I also know that the biological question behind is very hard/maybe not possible to solve.
I have a dataset, and I splitted it in percentages of 60, 20, 20. The 60% is for training, 20% is for testing and the rest is for validation (I made the split,because I need it).
I used the next code to split it (I think I found it in stackoverflow) and applied Naive Bayes classifier...
train_ratio = 0.60
validation_ratio = 0.20
test_ratio = 0.20
# train is now 60% of the entire data set
# the _junk suffix means that we drop that variable completely
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size = 1 - train_ratio)
# test is now 20% of the initial data set
# validation is now 20% of the initial data set
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))
print(x_train, x_val, x_test)
q=gnb.fit(x_train, y_train)
predict = gnb.predict(x_test)
print(predict)
print("Accuracy: ", accuracy_score(y_test, predict))
I tried use scikit-learn to make a learning curve, but according to scikit-learn documentation the learning curve function gives the train_sizes, the train_scores and valid_scores.
This is kinda confusing, I'm new to scikit-learn, and I don't have a clue how to use make a learning curve with the each percentage of the data I splitted.
Does anyone knows how to use splitted data in scikit's learning curves?
Thanks in advance.
Is there any way that I can track my model's performance in terms of it's classified labels, during the training phase? Any classifier from sklearn would work as an example.
To be more specific, I want to get something like a list of Confusion Matrices here:
clf = LinearSVC(random_state=42).fit(X_train, y_train)
# ... here ...
y_pred = clf.predict(X_test)
My objective here is to see how well the model is learning (during training). This is similar to analyzing the training loss, that is a common practice in DNN's, and libraries such as pyTorch, Keras, and Tensorflow have such capability already implemented.
I thought a quick browsing of the web would give me what I want, but apparently not. I still believe this should be fairly simple though.
Some ML practitioners like to work with three folds of data: training, validation and testing sets. The latter should not be seen in any training at all, but the middle could. For example, cross-validation uses K different folds of validation sets "during the training phase" to get a less biased performance estimation when training with different parts of the data.
But you can do this on a single validation fold for the purpose of what you asked.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
X_train2, X_valid, y_train2, y_valid = train_test_split(X_train, y_train, test_size=0.2)
# Fit a classifier with train data only
clf = LinearSVC(random_state=42).fit(X_train2, y_train2)
y_valid_pred = clf.predict(X_valid)
confusionm_valid = confusion_matrix(y_valid, y_valid_pred) # ... here ...
# Refit with all your training data
clf = LinearSVC(random_state=42).fit(X_train, y_train)
y_pred = clf.predict(X_valid)
I have 11 rows of data, and my goal is to train the network on 10, and validate on 1 specific row (not random).
The aim is to work through validating on each single row while training on the other 10, until I have a prediction for all 11 rows.
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)
The train/test split as shown above doesn't seem like it will work as it is random, is there a way to specify exactly which rows are to be used for training and testing?
What you are looking for seems to be k-fold cross validation. This will use each row as a validation set, and train on the remaining k - 1 rows and so forth. I would suggest using sklearn's built-in method.
from sklearn.model_selection import KFold
n_splits = 11
for train_idx, test_idx in KFold(n_splits).split(x):
x_train, x_test = x[train_idx], x[test_idx]
y_train, y_test = y[train_idx], y[test_idx]
# do your stuff
What is the difference or relationship between the Neural Network (NN) epoch and the max_iter parameter in scikit-learn?
For instance, as it can be seen in the code, evaluating the NN model for max_iter from 1 up to 10000 and evaluating for each iteration the Mean Absolute Error can be seen as the epoch? See image/link below, please!
Thank you very much!
for i in range(1,10000,10):
clf = MLPRegressor(max_iter=i, solver='lbfgs', alpha=1e-6, activation='relu', # melhorou e muito o treino com relu
hidden_layer_sizes=hidden_layer_sizes, random_state=1)
clf.fit(X_train_scaled, y_train)
mae_B = cross_val_score(clf, X_train_scaled, y_train, scoring="neg_mean_absolute_error", cv=10)
print i, float(-mae_B.mean()), clf.score(X_train_scaled, y_train), clf.score(X_test_scaled, y_test)
max_iter is equivalent to maximum number of epochs you want the model get trained on. It is called as maximum because the learning could get stopped before reaching the maximum number of iterations as well based on other termination criteria - n_iter_no_change. Hence do not loop through with different max_iterations, try to tweak the tol and n_iter_no_change if you want to avoid the overfitting.
Try the following and set the reasonably enough epochs in max_iter and then play with n_iter_no_change and tol. Reference Doc
clf = MLPRegressor(max_iter=50, solver='lbfgs', alpha=1e-6, activation='relu',
hidden_layer_sizes=hidden_layer_sizes, random_state=1,
tol=1e-3, n_iter_no_change = 5)
clf.fit(X_train_scaled, y_train)
mae_B = cross_val_score(clf, X_train_scaled, y_train, scoring="neg_mean_absolute_error", cv=10)
print i, float(-mae_B.mean()), clf.score(X_train_scaled, y_train), clf.score(X_test_scaled, y_test)