Currently I'm trying to train a Keras Sequential Network with pooled output from BERT. The fine tuned BertForSequence Classification yields good results, but using the pooled_output in a Neural Network does not work as intented. As Input data I got 10.000 Values, each consisting of the 768 floats that my BERT-Model provides. I'm trying to do a simple binary classification, so I also got the labels with 1 and 0's.
As you can see my data has a good number of examples for both classes. After shuffling them, I do a normal train test split and create/fit my model with:
model = Sequential()
model.add(Dense(1536, input_shape=(768,), activation='relu'))
model.add(Dense(1536, activation='relu'))
model.add(Dense(1536, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
opt = Adam(learning_rate=0.0001)
model.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy'])
#Normally with early stopping so quite a few epochs
history = model.fit(train_features, train_labels, epochs=800, batch_size=68, verbose=1,
validation_split=0.2, callbacks=[])
During training the loss decreases and my accuracy increases as expected. BUT the val_loss increases and the val_accuracy stays the same! Sure I'm overfitting, but I would expect that the val_accuracy increases, at least for a few epochs and then decreaes when I'm overfitting.
Has anyone an Idea what I'm doing wrong? Perhaps 10.000 values aren't enough to generalize?
Model is over fitting as expected but am surprised it starts over fitting on the early epochs which makes me winder if you have some mislabeling in your validation set. At any rate try add changing the model as follows
model = Sequential()
model.add(Dense(1536, input_shape=(768,), activation='relu'))
model.add(Dropout(.3))
model.add(Dense(512, activation='relu'))
model.add(Dropout(.3))
model.add(Dense(128, activation='relu'))
model.add(Dropout(.3))
model.add(Dense(1, activation='sigmoid'))
See if this reduces the over fitting problem
It was not just a mislabeling in my validation set, but in my whole data.
I take a sample of 100000 entries
train_df = train_df.sample(frac=1).reset_index(drop=True)
train_df = train_df.iloc[0:100000]
and delete some values
train_df = train_df[train_df['label'] != '-']
after that i set a few values using train_df.at in a loop, but some indices don't exist because i deleted them. train_df.at only throws warnings so I did not see this. Also I mixed .loc and .iloc so in my case i selected .iloc[2:3] but the index 2 does not exist, so it return index 3 wich is on position 2. After that I make my changes and train_df.at fails at inserting on position 2, but my loop goes on. The next iteration .iloc returns index 4 on position 3. My loop then puts the data on index 3 - from now on all my labels are one position off.
Related
I am trying to build a regression model but the mse and mae are very high. I filter and normalize the data (both the input and output, and also the test and train set). I think the problem comes because I have very high values in one column: the minimum is 1 and the maximum is 9100000 (without normalizing), but I actually need to predict these high values.
The model looks like this: I have 6 input columns and 800000 rows. And I have tried with more neurons and layers, or changing the sigmoid function, but the loss and the error keep being around 0.8 for mse and 0.3 for mae. The predictions are also way lower than they should be, never achieving the high values.
model = Sequential()
model.add(Dense(7, input_dim=num_input, activation='relu'))
model.add(Dense(7, activation='relu'))
model.add(Dense(1, activation='sigmoid'))
model.compile(loss='mse', optimizer='rmsprop', metrics=['mse', 'mae'])
history = model.fit(x_train, y_train, epochs=epochs, batch_size=batch_size, validation_data=(x_val, y_val))
A few remarks/advices:
RMSProp is generally not used with fully connected layers, I recommend switching to Adam or SGD.
If you have a skewed distribution with many large values, you might consider using the log of these values instead.
First try with a shallow model with few neurons. Then gradually increase the number of neurons in order to overfit the dataset. You should be able to reach perfect score on the train set. At that point you can start decrease the number of neurons and add layers with dropout to improve generalisation.
As already mentioned in the comments, the output activation for regression should be "linear". Sigmoid is for binary classification.
I'm having a problem with a model I want to train.
It's a typical seq-to-seq problem with an attention layer, where the input is a string, and the output is a substring from the submitted string.
e.g.
Input Ground Truth
-----------------------------
helloimchuck chuck
johnismyname john
(This is just a dummy data, not a real part of the dataset ^^)
And the model looks like this:
model = Sequential()
model.add(Bidirectional(GRU(hidden_size, return_sequences=True), merge_mode='concat',
input_shape=(None, input_size))) # Encoder
model.add(Attention())
model.add(RepeatVector(max_out_seq_len))
model.add(GRU(hidden_size * 2, return_sequences=True)) # Decoder
model.add(TimeDistributed(Dense(units=output_size, activation="softmax")))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=['accuracy'])
The problem is this here:
As you can see, there is overfitting.
I'm using early stop criteria on the validation loss with patience=8.
self.Early_stop_criteria = keras.callbacks.EarlyStopping(monitor='val_loss', min_delta=0,
patience=8, verbose=0,
mode='auto')
And I'm using one-hot-vector to fit the model.
BATCH_SIZE = 64
HIDDEN_DIM = 128
The thing is, I've tried with other batch sizes, other hidden dimensions, a dataset of 10K rows, 15K rows, 25K rows and now 50K rows. However, there is always overfitting, and I don't know why.
The test_size = 0.2 and the validation_split=0.2. Those are the only parameters I haven't changed.
I'm also made me sure that the dataset properly build.
The only idea that I have is trying with another validation split, maybe 0.33 instead of 0.2.
I don't know if cross-validation would help.
Maybe anyone has a better idea, what I could try. Thanks in advance.
As kvish proposed, dropout was a good solution.
I first tried with a dropout of 0.2.
model = Sequential()
model.add(Bidirectional(GRU(hidden_size, return_sequences=True, dropout=0.2), merge_mode='concat',
input_shape=(None, input_size))) # Encoder
model.add(Attention())
model.add(RepeatVector(max_out_seq_len))
model.add(GRU(hidden_size * 2, return_sequences=True)) # Decoder
model.add(TimeDistributed(Dense(units=output_size, activation="softmax")))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop", metrics=['accuracy'])
And with 50K rows, it worked, but still had overfitting.
So, I tried with a dropout of 0.33, and it worked perfectly.
I have a classification problem that target contains 5 classes, 15 features(all continuous)
and have 1 million for training data, 0.5 million for validation data.
e.g.,
shape of X_train = (1000000,15)
shape of X_validation = (500000,15)
First, I used Random Forest that can get 88% Avg. Accuracy.
After that I tried many Neural Network architecture, the best one got ~80% Avg. Accuracy both on training and validation data, which was worse than Random forest.
(I don't know much about designing Neural Network architecture)
Following is the best one of my NN architecture. (~80% Avg.Accuracy)
model = Sequential()
model.add(Dense(1000, input_dim=15, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(900, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(800, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(700, activation='relu'))
model.add(Dropout(0.1))
model.add(Dense(600, activation='relu'))
model.add(Dense(5, activation='softmax'))#output layer
adadelta = Adadelta()
model.compile(loss='categorical_crossentropy', optimizer=adadelta, metrics=['accuracy'])
Batch Size = 128 and epochs = 100
I have read this question. The answer point out that NN needs amount of data and some regulization. I think my data size is good enough and I have also tried higer Dropout rate and L2 regulization but still not working.
What could the problem be?
This is biological data that I have no domain knowledge so sorry about that I can't explain it. I've plot the feature distribution as below, all features are between 0 to 3
I am using tensorflow with keras to perform regression on some historical data. Data type as follows:
id,timestamp,ratio
"santalucia","2018-07-04T16:55:59.020000",21.8
"santalucia","2018-07-04T16:50:58.043000",22.2
"santalucia","2018-07-04T16:45:56.912000",21.9
"santalucia","2018-07-04T16:40:56.572000",22.5
"santalucia","2018-07-04T16:35:56.133000",22.5
"santalucia","2018-07-04T16:30:55.767000",22.5
And I am reformulating it as a time series problem (25 time steps) so that I can predict (make a regression) for the next values of the series (variance should not be high). I am using also sklearn.preprocessing MinMaxScaler to scale the data to range (-1,1) or (0,1) depending if I use LSTM or Dense (respectively).
I am training with two different architectures:
Dense is as follows:
def get_model(self, layers, activation='relu'):
model = Sequential()
# Input arrays of shape (*, layers[1])
# Output = arrays of shape (*, layers[1] * 16)
model.add(Dense(units=int(64), input_shape=(layers[1],), activation=activation))
model.add(Dense(units=int(64), activation=activation))
# model.add(Dropout(0.2))
model.add(Dense(units=layers[3], activation='linear'))
# activation=activation))
# opt = optimizers.Adagrad(lr=self.learning_rate, epsilon=None, decay=self.decay_lr)
opt = optimizers.rmsprop(lr=0.001)
model.compile(optimizer=opt, loss=self.loss_fn, metrics=['mae'])
model.summary()
return model
Which more or less provides with good results (same architecture as in tensorflows' tutorial for predicting house prices).
However, LSTM is not giving good results, it usually ends up stuck around a value (for example, 40 (40.0123123, 40.123123,41.09090...) and I do not see why or how to improve it. Architecture is:
def get_model(self, layers, activation='tanh'):
model = Sequential()
# Shape = (Samples, Timesteps, Features)
model.add(LSTM(units=128, input_shape=(layers[1], layers[2]),
return_sequences=True, activation=activation))
model.add(LSTM(64, return_sequences=True, activation=activation))
model.add(LSTM(layers[2], return_sequences=False, activation=activation))
model.add(Dense(units=layers[3], activation='linear'))
# activation=activation))
opt = optimizers.Adagrad(lr=0.001, decay=self.decay_lr)
model.compile(optimizer=opt, loss='mean_squared_error', metrics=['accuracy'])
model.summary()
return model
I currently train with a batch size of 200 that increases by a rate of 1.5 every fit. Each fit is made of 50 epochs, and I use a keras earlystopping callback with at least 20 epoch.
I have tried adding more layers, more units, reducing layers, units, increasing and decreasing learning rate, etc, but every time it gets stuck around a value. Any reason for this?
Also, do you know any good practices that can be applied to this problem?
Cheers
Have you tried holding back a validation set seeing how well the model performance on the training set tracks with the validation set? This is often how I catch myself overfitting.
A simple function for doing this (adapted from here) can help you do that:
hist = model.fit_generator(...)
def gen_graph(history, title):
plt.plot(history.history['categorical_accuracy'])
plt.plot(history.history['val_categorical_accuracy'])
plt.title(title)
gen_graph(hist, "Accuracy, training vs. validation scores")
Also, do you have enough samples? If you're really, really sure that you have done as much as you can in terms of preprocessing, and in terms of hyperparameter tuning... generating some synthetic data or doing some data augmentation has occasionally helped me.
I'm trying to make an autoencoder using Keras with a tensorflow backend. In particular, I have data of a vector of n_components (i.e. 200) sampled n_times (i.e. 20000). It is key that when I train time t, that I compare it only to time t. It appears that it is shuffling the sampling times. I removed the bottleneck and find that the network is doing a pretty bad job of predicting the n_components, instead representing something more like the mean of the input scaled by each component.
Here is my network with the bottleneck commented out:
model = keras.models.Sequential()
# Make a 7-layer autoencoder network
model.add(keras.layers.Dense(n_components, activation='relu', input_shape=(n_components,)))
model.add(keras.layers.Dense(n_components, activation='relu'))
# model.add(keras.layers.Dense(50, activation='relu'))
# model.add(keras.layers.Dense(3, activation='relu'))
# model.add(keras.layers.Dense(50, activation='relu'))
model.add(keras.layers.Dense(n_components, activation='relu'))
model.add(keras.layers.Dense(n_components, activation='relu'))
model.compile(loss='mean_squared_error', optimizer='sgd', metrics=['accuracy'])
# act is a numpy matrix of size (n_components, n_times)
model.fit(act.T, act.T, epochs=15, batch_size=100, shuffle=False)
newact = model.predict(act.T).T
I have tested shuffling the second component of act, n_times, and passing it as model.fit(act.T, act_shuffled.T) and see no difference from model.fit(act.T, act.T). Am I doing something wrong? How can I force it to learn from the specific time?
Many thanks,
Arthur
I believe that I have solved the problem, but more knowledgeable users of Keras might be able to correct me. I had tried many different values for the argument batch_size of fit, but I didn't try a value of 1. When I changed it to 1, it did a good job of reproducing the input data.
I believe that the batch size, even if shuffle is set to False, allows the autoencoder to train one input time against an unrelated input time.
So, I have ammended my code to:
model.fit(act.T, act.T, epochs=15, batch_size=1, shuffle=False)