How to use cross validation in keras classifier - python

I was practicing the keras classification for imbalanced data. I followed the official example:
and used the scikit-learn api to do cross-validation.
I have tried the model with different parameter.
However, all the times one of the 3 folds has value 0.
results [0.99242424 0.99236641 0. ]
What am I doing wrong?
How to get ALL THREE validation recall values of order "0.8"?
import numpy as np
import pandas as pd
import tensorflow as tf
from tensorflow import keras
from sklearn.model_selection import train_test_split
from keras.wrappers.scikit_learn import KerasClassifier
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import StratifiedKFold
import os
import random
SEED = 100
os.environ['PYTHONHASHSEED'] = str(SEED)
# load the data
ifile = ""
df = pd.read_csv(ifile,compression='zip')
# train test split
target = 'Class'
Xtrain,Xtest,ytrain,ytest = train_test_split(df.drop([target],axis=1),
print(f"Xtrain shape: {Xtrain.shape}")
print(f"ytrain shape: {ytrain.shape}")
# build the model
def build_fn(n_feats):
model = keras.models.Sequential()
model.add(keras.layers.Dense(256, activation="relu", input_shape=(n_feats,)))
model.add(keras.layers.Dense(256, activation="relu"))
model.add(keras.layers.Dense(256, activation="relu"))
# last layer is dense 1 for binary sigmoid
model.add(keras.layers.Dense(1, activation="sigmoid"))
# compile
return model
# fitting the model
n_feats = Xtrain.shape[-1]
counts = np.bincount(ytrain)
weight_for_0 = 1.0 / counts[0]
weight_for_1 = 1.0 / counts[1]
class_weight = {0: weight_for_0, 1: weight_for_1}
FIT_PARAMS = {'class_weight' : class_weight}
clf_keras = KerasClassifier(build_fn=build_fn,
n_feats=n_feats, # custom argument
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=SEED)
results = cross_val_score(clf_keras, Xtrain, ytrain,
fit_params = FIT_PARAMS,
n_jobs = -1,
print('results', results)
Xtrain shape: (227845, 30)
ytrain shape: (227845,)
results [0.99242424 0.99236641 0. ]
CPU times: user 3.62 s, sys: 117 ms, total: 3.74 s
Wall time: 5min 15s
I am getting the third recall as 0. I am expecting it of the order 0.8, how to make sure all three values are around 0.8 or more?

You have chosen to use sklearn wrappers for your model - they have benefits, but the model training process is hidden. Instead, I trained the model separately with validation dataset added. The code for this would be:
clf_1 = KerasClassifier(build_fn=build_fn,
n_feats=n_feats), ytrain, class_weight=class_weight,
validation_data=(Xtest, ytest),
In the output it is clearly seen that while loss metric goes down, recall is not stable. This lead to poor performance in CV reflected in zeros in CV results, as you observed.
I fixed this by reducing learning rate to just 0.0001. While it is 100 times less than yours - it reaches 98% recall on train and 100% (or close) on test in just 10 epochs.
Your code needs just one fix to achieve stable results: change LR to much lower one, like 0.0001:
You can experiment with LR in the range < 0.001.
For reference, with LR 0.0001 I got:
results [0.99242424 0.97709924 1. ]
Good luck!
PS: thanks for inluding compact and complete MWE


F1 Score metric per class in Tensorflow

I have implemented the following metric to look at Precision and Recall of the classes I deem relevant.
metrics=[tf.keras.metrics.Recall(class_id=1, name='Bkwd_R'),tf.keras.metrics.Recall(class_id=2, name='Fwd_R'),tf.keras.metrics.Precision(class_id=1, name='Bkwd_P'),tf.keras.metrics.Precision(class_id=2, name='Fwd_P')]
How can I implement the same in Tensorflow 2.5 for F1 score (i.e specifically for class 1 and class 2, and not class 0, without a custom function.
Using this metric setup:
tfa.metrics.F1Score(num_classes = 3, average = None, name = f1_name)
I get the following during training:
13367/13367 [==============================] 465s 34ms/step - loss: 0.1683 - f1_score: 0.5842 - val_loss: 0.0943 - val_f1_score: 0.3314
and when I do model.evaluate:
224/224 [==============================] - 11s 34ms/step - loss: 0.0665 - f1_score: 0.3325
and the scoring =
Score: [0.06653735041618347, array([0.99740255, 0. , 0. ], dtype=float32)]
The problem is that this is training based on the average, but I would like to train on the F1 score of a sensible averaging/each of the last two values/classes in the array (which are 0 in this case)
Will accept a non tensorflow specific function that gives the desired result (with full function and call during fit code) but was really hoping for something using the exisiting tensorflow code if it exists)
You can have a look at in tensorflow-addons package.
Specifically, if you need a per-class score, you need to set the average param to None, or macro.
As is mentioned in David Harris' comment, a neural network model is trained on loss functions, not on metric scores. Losses help drive the model towards a solution to provide accurate labels via backpropagation. Metrics help to provide a comparable evaluation of that model's performance that are a lot more human-legible.
So, that being said, I feel like what you're saying in your question is that "there are three classes, and I want the model to care more about the last two of the three". I want to
IF that's the case, one approach you can take is to weight your samples by label. Let's say that you have labels in an array y_train.
# Which classes are you wanting to focus on
classes_i_care_about = [1, 2]
# Initialize all weights to 1.0
sample_weights = np.ones(shape=(len(y_train),))
# Give the classes you care about 50% more weight
sample_weight[np.isin(y_train, classes_i_care_about)] = 1.5
This is the best advice I can offer without knowing more. If you're looking for other info on how you can have your model do better on certain classes, other info could be useful, such as:
What's the proportions of labels in your dataset?
What is the last layer of your model architecture? Dense(3, activation="softmax")?
What loss are you using?
Here's a more complete, reproducible example that shows what I'm talking about with the sample weights:
import numpy as np
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
import tensorflow_addons as tfa
iris_data = load_iris() # load the iris dataset
x =
y_ =, 1) # Convert data to a single column
# One Hot encode the class labels
encoder = OneHotEncoder(sparse=False)
y = encoder.fit_transform(y_)
# Split the data for training and testing
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.20)
# Build the model
def get_model():
model = Sequential()
model.add(Dense(10, input_shape=(4,), activation='relu', name='fc1'))
model.add(Dense(10, activation='relu', name='fc2'))
model.add(Dense(3, activation='softmax', name='output'))
# Adam optimizer with learning rate of 0.001
optimizer = Adam(lr=0.001)
return model
model = get_model()
results = model.evaluate(test_x, test_y)
print('Final test set loss: {:4f}'.format(results[0]))
print('Final test set accuracy: {:4f}'.format(results[1]))
print('Final test F1 scores: {}'.format(results[2]))
Final test set loss: 0.585964
Final test set accuracy: 0.633333
Final test F1 scores: [1. 0.15384616 0.6206897 ]
Now, we add weight to classes 1 and 2:
sample_weight = np.ones(shape=(len(train_y),))
(train_y[:, 1] == 1) | (train_y[:, 2] == 1)
] = 1.5
model = get_model()
results = model.evaluate(test_x, test_y)
print('Final test set loss: {:4f}'.format(results[0]))
print('Final test set accuracy: {:4f}'.format(results[1]))
print('Final test F1 scores: {}'.format(results[2]))
Final test set loss: 0.437623
Final test set accuracy: 0.900000
Final test F1 scores: [1. 0.8571429 0.8571429]
Here, the model has emphasized learning these, and their respective performance is improved.

Classification ANN stuck at 60%

I am trying to create a binary classifier on a data set of 10,000. I have tried multiple Activators and Optimizers, however the results are always between 56.8% and 58.9%. Given the fairly steady results over many dozen iterations, I assume the problem is either:
My dataset is not classifiable
My model is broken
This is the data set: training-set.csv
I may be able to get 2000 more records but that would be it.
My question is: is there something in the way my model is constructed that is preventing it from learning to a higher degree?
Note that I am happy to have as many layers and nodes as needed, and time is not a factor in generating the model.
dataframe = pandas.read_csv(r"training-set.csv", index_col=None)
dataset = dataframe.values
X = dataset[:,0:48].astype(float)
Y = dataset[:,48]
#count the input variables
col_count = X.shape[1]
#normalize X
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
X_scale = sc_X.fit_transform(X)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_scale, Y, test_size = 0.2)
# define baseline model
activator = 'linear' #'relu' 'sigmoid' 'softmax' 'exponential' 'linear' 'tanh'
#opt = 'Adadelta' #adam SGD nadam RMSprop Adadelta
nodes = 1000
max_layers = 2
max_epochs = 100
max_batch = 32
loss_funct = 'binary_crossentropy' #for binary
last_act = 'sigmoid' # 'softmax' 'sigmoid' 'relu'
def baseline_model():
# create model
model = Sequential()
model.add(Dense(nodes, input_dim=col_count, activation=activator))
for x in range(0, max_layers):
model.add(Dense(nodes, input_dim=nodes, activation=activator))
model.add(Dense(1, activation=last_act)) #model.add(Dense(1, activation=last_act))
# Compile model
adam = Adam(lr=0.001)
model.compile(loss=loss_funct, optimizer=adam, metrics=['accuracy'])
return model
estimator = KerasClassifier(build_fn=baseline_model, epochs=max_epochs, batch_size=max_batch), y_train)
y_pred = estimator.predict(X_test)
#confusion matrix
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
score = np.sum(cm.diagonal())/float(np.sum(cm))
Two points:
There is absolutely no point in stacking dense layers with linear activations - they only result to a single linear unit; change to activator = 'relu' (and just don't bother with the other candidate activation functions in your commented-out list).
Do not use dropout by default, especially if your model has difficulties in learning (like here); remove the dropout layer(s), and just be ready to put (some of) them back in only in case you see overfitting (you are currently still very far from that point, so this is not something to worry about now).

Spiral problem, why does my loss increase in this neural network using Keras?

I'm trying to solve the spiral problem using Keras with 3 spirals instead of 2 using a similar strategy that I used for 2. Problem is my loss is now growing exponentially instead of decreasing with the same parameters I used for 2 spirals (The neural network structure has 3 outputs instead of being binary). I'm not quite sure what could be happening with this issue if anyone could help? I have tried this with various epochs, learning rates, batch sizes.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import MinMaxScaler
from tensorflow.keras.optimizers import RMSprop
from Question1.utils import create_neural_network, create_test_data
EPOCHS = 250
def main():
df = three_spirals(1000)
# Set-up data
x_train = df[['x-coord', 'y-coord']].values
y_train = df['class'].values
# Don't need y_test, can inspect visually if it worked or not
x_test = create_test_data()
# Scale data
scaler = MinMaxScaler()
x_train = scaler.transform(x_train)
x_test = scaler.transform(x_test)
relu_model = create_neural_network(layers=3,
neurons=[40, 40, 40],
# Train networks, y=y_train, epochs=EPOCHS, verbose=1, batch_size=BATCH_SIZE)
# Predictions on test data
relu_predictions = relu_model.predict_classes(x_test)
models = [relu_model]
test_predictions = [relu_predictions]
# Plot
plot_data(models, test_predictions)
And here is the create_neural_network function:
def create_neural_network(layers, neurons, activation, optimizer, loss, outputs=1):
if layers != len(neurons):
raise ValueError("Number of layers doesn't much the amount of neuron layers.")
model = Sequential()
for i in range(layers):
model.add(Dense(neurons[i], activation=activation))
# Output
if outputs == 1:
model.add(Dense(outputs, activation='softmax'))
return model
I have worked it out, for the output data it isn't like a binary classification where you only need one column. For multi classification you need a column for each class you want to where I had y could be 0, 1, 2 was incorrect. The correct way to do this was to have y0, y1, y2 which would be 1 if it fit that specific class and 0 if it didn't.

Overfitting and data leakage in tensorflow/keras neural network

Good morning, I'm new in machine learning and neural networks. I am trying to build a fully connected neural network to solve a regression problem. The dataset is composed by 18 features and 1 label, and all of these are physical quantities.
You can find the code below. I upload the figure of the loss function evolution along the epochs (you can find it below). I am not sure if there is overfitting. Someone can explain me why there is or not overfitting?
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.feature_selection import SelectFromModel
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from matplotlib import pyplot as plt
import keras
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout, BatchNormalization
from tensorflow.keras.callbacks import EarlyStopping
from keras import optimizers
from sklearn.metrics import r2_score
from keras import regularizers
from keras import backend
from tensorflow.keras import regularizers
from keras.regularizers import l2
# =============================================================================
# Scelgo il test size
# =============================================================================
test_size = 0.2
dataset = pd.read_csv('DataSet.csv', decimal=',', delimiter = ";")
label = dataset.iloc[:,-1]
features = dataset.drop(columns = ['Label'])
y_max_pre_normalize = max(label)
y_min_pre_normalize = min(label)
def denormalize(y):
final_value = y*(y_max_pre_normalize-y_min_pre_normalize)+y_min_pre_normalize
return final_value
# =============================================================================
# Split
# =============================================================================
X_train1, X_test1, y_train1, y_test1 = train_test_split(features, label, test_size = test_size, shuffle = True)
y_test2 = y_test1.to_frame()
y_train2 = y_train1.to_frame()
# =============================================================================
# Normalizzo
# =============================================================================
scaler1 = preprocessing.MinMaxScaler()
scaler2 = preprocessing.MinMaxScaler()
X_train = scaler1.fit_transform(X_train1)
X_test = scaler2.fit_transform(X_test1)
scaler3 = preprocessing.MinMaxScaler()
scaler4 = preprocessing.MinMaxScaler()
y_train = scaler3.fit_transform(y_train2)
y_test = scaler4.fit_transform(y_test2)
# =============================================================================
# Creo la rete
# =============================================================================
optimizer = tf.keras.optimizers.Adam(lr=0.001)
model = Sequential()
model.add(Dense(60, input_shape = (X_train.shape[1],), activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dense(60, activation = 'relu',kernel_initializer='glorot_uniform'))
model.add(Dense(1,activation = 'linear',kernel_initializer='glorot_uniform'))
model.compile(loss = 'mse', optimizer = optimizer, metrics = ['mse'])
history =, y_train, epochs = 100,
validation_split = 0.1, shuffle=True, batch_size=250
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
y_train_pred = model.predict(X_train)
y_test_pred = model.predict(X_test)
y_train_pred = denormalize(y_train_pred)
y_test_pred = denormalize(y_test_pred)
plt.plot((y_test1),(y_test_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')
plt.plot((y_train1),(y_train_pred),'.', color='darkviolet', alpha=1, marker='o', markersize = 2, markeredgecolor = 'black', markeredgewidth = 0.1)
plt.plot((np.array((-0.1,7))),(np.array((-0.1,7))),'-', color='magenta')
plt.plot(loss_values,'b',label = 'training loss')
plt.plot(val_loss_values,'r',label = 'val training loss')
plt.ylabel('Loss Function')
print("\n\nThe R2 score on the test set is:\t{:0.3f}".format(r2_score(y_test_pred, y_test1)))
print("The R2 score on the train set is:\t{:0.3f}".format(r2_score(y_train_pred, y_train1)))
from sklearn import metrics
# Measure MSE error.
score = metrics.mean_squared_error(y_test_pred,y_test1)
print("\n\nFinal score test (MSE): %0.4f" %(score))
score1 = metrics.mean_squared_error(y_train_pred,y_train1)
print("Final score train (MSE): %0.4f" %(score1))
score2 = np.sqrt(metrics.mean_squared_error(y_test_pred,y_test1))
print(f"Final score test (RMSE): %0.4f" %(score2))
score3 = np.sqrt(metrics.mean_squared_error(y_train_pred,y_train1))
print(f"Final score train (RMSE): %0.4f" %(score3))
I tried alse to do feature importances and to raise n_epochs, these are the results:
Feature Importance:
No Feature Importace:
Looks like you don't have overfitting! Your training and validation curves are descending together and converging. The clearest sign you could get of overfitting would be a deviation between these two curves, something like this:
Since your two curves are descending and are not diverging, it indicates your NN training is healthy.
HOWEVER! Your validation curve is suspiciously below the training curve. This hints a possible data leakage (train and test data have been mixed somehow). More info on a nice an short blog post. In general, you should split the data before any other preprocessing (normalizing, augmentation, shuffling, etc...).
Other causes for this could be some type of regularization (dropout, BN, etc..) that is active while computing the training accuracy and it's deactivated when computing the Validation/Test accuracy.
Overfitting is, when the model does not generalize to other data than the training data. When this happen you will have a very (!) low training loss but a high validation loss. You can think of it this way: if you have N points you can fit a N-1 polynomial such that you have a zero training loss (your model hits all your training points perfectly). But, if you apply that model to some other data, it will most likely produce a very high error (see the image below). Here the red line is our model and the green is the true data (+ noice), and you can see in the last picture we get zero training error. In the first, our model is too simple (high train/high validation error), the second is good (low train/low valuidation error) the third and last is too complex i.e overfitting (very low train/high validation error).
Neural network can work in the same way, so by looking at your training vs validation error, you can conclude if it overfits or not
No, this is not overfitting as your validation loss isn´t increasing.
Nevertheless, if I were you I would be a little bit skeptical. Try to train your model for even more epochs and watch out for the validation loss.
What you definitely should do, is to observe the following:
- are there duplicates or near-duplicates in the data (creates information leakage from train to test validation split)
- are there features that have a causal connection to the target variable
Usually, you have some random component in a real-world dataset, so that rules that are observed in train data aren´t 100% true for validation data.
Your plot shows that the validation loss is even more decreasing as train loss decreases. Usually, you get to some point in training, where the rules you observe in train data are too specific to describe the whole data. That´s when overfitting begins. Hence, it is weird, that your validation loss doesn´t increase again.
Please check whether your validation loss approaches zero when you´re training for more epochs. If it´s the case I would check your database very carefully.
Let´s assume, that there is a kind of information leakage from the train set to the validation set (through duplicate records for example). Your model would change the weights to describe very specific rules. When applying your model to new data it would fail miserably since the observed connections are not really general.
Another common data problem is, that features may have an inversed causality.
The thing that validation loss is generally lower than train error is probably depending on dropout and regularization, since it´s applied while training but not for predicting/testing.
I put some emphasis on this because a tiny bug or an error in the data can "fuck up" your whole model.

How to change a negative r2_score result from Keras code

So I am trying to develop a deep learning program, which can predict the quality of wine based as a regression problem. The dataset is from I have read some tutorials but it is mainly based on
This code is written and run on It runs without any issues, however the r2_score is negative and I don't fully comprehend why we are using sometimes X_test and X[test] e.g. for predicting the r2_score etc.
import matplotlib.pyplot as plt
import h5py # export models in HDF5 format
from keras.datasets import mnist
from keras.utils import np_utils
from keras.layers import Activation, Dense, Dropout
from keras.models import Sequential
from keras import optimizers
from keras import losses
from keras import metrics
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import train_test_split
import numpy as np
import pandas as pd
# Read in red wine data
# Read a comma-separated values (csv) file into DataFrame.
red = pd.read_csv("", sep=';')
X = red.drop('quality', axis=1) # Isolate data # Drop specified labels from rows or columns. # same as ix[:,0:11]
Y = red.quality
X = StandardScaler().fit_transform(X) # Scale the data with `StandardScaler
# StandardScaler transforms data such that its distribution will have a mean value 0 and standard deviation of 1.
# Each value in the dataset will have the sample mean value subtracted, and then divided by the standard deviation of the whole dataset.
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=42)
seed = 7
kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=seed)
# split up the data into K partitions / K-fold cross-validation
# Generate indices to split data into training and test set
for train, test in kfold.split(X, Y):
model = Sequential() # Initialize the model
model.add(Dense(64, input_dim=11, activation='relu')) # Add input layer
model.add(Dense(1)) # Add output layer
model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
#model.compile(loss='categorical_crossentropy', optimizer=optimizers.SGD(), metrics=['accuracy'])
history =[test], Y[test], validation_split=0.25, epochs=NB_EPOCH, verbose=1)
#history =, y_train, validation_split=0.25, epochs=20, verbose=1)
mse_value, mae_value = model.evaluate(X[test], Y[test], verbose=0)
print("Mean Squared Error: "+ str(mse_value)) # quantifies the difference between the estimator and what is estimated
print("Mean Absolute Error: " + str(mae_value)) #quantifies how close predictions are to the eventual outcomes
score = model.evaluate(X_test, y_test, verbose=1)
print("Test score:", score[0])
print('Test accuracy:', score[1])
# generating the graph through matplotlib
# Plot training & validation mea values
fig= plt.figure(figsize=(20,5))
plt.title('Model MAE')
plt.legend(['train', 'test'], loc='upper left')
from sklearn.metrics import r2_score
y_pred = model.predict(X[test])
print("this is r2:" + str(r2_score(Y[test], y_pred)))
reason for splitting data to train and test is really simple if you train your model with all of your data, your model would work better for your data ( because of seeing it before) but when you want to see how it work on new data it won't work well (doesn't generalize)
so you should put a part of your data away so you could test your model, see if it generalize well. and when you want to what is the prediction of a new set to see how your model works you could use x[test].
and when you are calculating your error (r2) it is (y[test]-y_pred) for each sample so it could be negative or positive ( you are checking the difference of what you have predicted and what you should have got so a negative r2 means you are predicting lower than mean value) you could use mse for check it too (it's better for regression problems)
something like this:
MSE = np.square(np.subtract(Y_true,Y_pred)).mean()
