how could i increase the accuracy of training set - python

i'm working on a classification problem (human activity classification) and i used CNN the code of model is :
model = Sequential()
model.add(Conv2D(100, (2, 2), activation = 'relu', input_shape = X_train[0].shape))
model.add(Dropout(0.1))
#adding pooling layer
model.add(MaxPool2D(2,2))
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Flatten())
model.add(Dense(64, activation = 'relu'))
model.add(Dropout(0.5))
model.add(Dense(7, activation='softmax'))
compiling and fiting :
model.compile(optimizer=Adam(learning_rate = 0.001), loss = 'sparse_categorical_crossentropy', metrics = ['accuracy'])
history = model.fit(X_train, y_train, epochs = 20, validation_data= (X_test, y_test), verbose=1)
the accuracy was like this
how coul'd i increase the last value of accuracy ? and why the curve is increasing kinda fast?

There are a few avenues you can pursue here, specifically finding answers to the following questions for your particular problem. Here's a great video, although not for tensorflow, but I think the question you are asking is general enough for it to apply
What is the right amount of time to train for? Likely the answer here is somewhere between 20 epochs and 90, more specifically, it's where your two series in the plot start to diverge; in other words, your model starts to memorize the training data at the point of divergence. Tensorflow has early stopping mechanisms to help with this.
What is the performance of a naïve guesser? Is the complexity of your model proportional to the complexity/dimensionality of the problem?
What is the human insight that you can bring to the problem? Are there things you can do to the features that will help the model create separability in higher dimensions? For example, let's say your model is going to predict what activity a person is going to do at a given point in time. In this case, information related to people might be separate from time and activity data. You can create features that represent combinations of other features (assuming you have a lot of data), and encode this and feed it to your model. You can create embeddings in your model to get your model to deal with the sparsity that occurs when you combine such categorical features.
Another aspect of this that I think is very important to answer is "Why am I solving this problem?". In some cases, the answer might be "I want to learn X", in which case you might approach it differently. For example, if it's all tabular data, you might have more interpretable/better results using something like scikit-learn using a tree based model. It also, of course, depends on the amount and type of data you have. Nested cross-validation can give you great insight into what are the combinations of hyperparameters and features that will produce a model that generalizes, and also about the variation you can expect to see on unseen data.
Best of luck!

Related

Optimal permutations of kernel_initializer, activation function and optimizer for Regression

I am using the kernel_initializer='normal' and optimizer='adam' to find an optimum regression solution. I am getting close to 0.94 accuracy on training data. I would like to test a few other kernel_initializer, activation function and optimizer combinations but I am not sure kernel_initializer and activation function which works well for regression. Please suggest
# create model
model = Sequential()
model.add(Dense(13, input_dim=13, kernel_initializer='normal', activation='relu'))
model.add(Dense(6, kernel_initializer='normal', activation='relu'))
model.add(Dense(1, kernel_initializer='normal'))
# Compile model
model.compile(loss='root_mean_squared_error', optimizer='adam')
Well, it might not be a good idea. You're after a fairly small margin in performance, and by "fishing" for good results you're essentially exploiting your validation as the training set, relying on small variations in the data to inform model design.
Few tips:
Glorot initializer (default) is usually the best. However, the difference is really small, especially in such a tiny model.
relu activation is helpful to fight vanishing gradients. With three layers in the model, you probably won't have it. Here it really depends on the nature of the data; even linear activation might make sense.
for a regular regression (i.e. predicting a number, not a binary output(s)), you probably need to use linear regression on the output layer. It is the default one, but it's better to make it explicit.
other optimizers might improve rate of conversion, but usually don't improve the performance. Adam sounds like a reasonable choice - sgd will do the same but slower, ftrl works the best on sparse data such as language input.

Analyzing learning curves for facial expression recognition [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 2 years ago.
Improve this question
I have a neural network set up in tensorflow (in python) that is operating on the fer2013 dataset (can be found on kaggle). My network architecture is this
emotion_model = Sequential()
emotion_model.add(Conv2D(32, kernel_size=(3, 3), activation='relu', input_shape=(48,48,1)))
emotion_model.add(Conv2D(64, kernel_size=(3, 3), activation='relu'))
emotion_model.add(MaxPooling2D(pool_size=(2, 2)))
emotion_model.add(Dropout(0.25))
emotion_model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
emotion_model.add(MaxPooling2D(pool_size=(2, 2)))
emotion_model.add(Conv2D(128, kernel_size=(3, 3), activation='relu'))
emotion_model.add(MaxPooling2D(pool_size=(2, 2)))
emotion_model.add(Dropout(0.25))
emotion_model.add(Flatten())
emotion_model.add(Dense(1024, activation='relu'))
emotion_model.add(Dropout(0.5))
emotion_model.add(Dense(7, activation='softmax'))
emotion_model.compile(loss='categorical_crossentropy', optimizer=Adam(lr=0.0001, decay=1e-6), metrics=['accuracy'])
emotion_model_info = emotion_model.fit(
train_generator,
steps_per_epoch=28709 // 64,
epochs=50,
validation_data=validation_generator,
validation_steps=7178 // 64)
I plotted a learning curve for this algorithm and I got this:
Now, I am a beginner to machine learning but this divergence in Test vs. Validation data accuracy/cost would seem to point to overfitting of the data. However, I was looking at other people's accuracy levels on the same dataset and found that most people get around 62% accuracy on validation (which is what I have currently) and they usually get about the same for training accuracy. So I'm very surprised that my training data is performing so well (indicating overfitting) and yet my validation accuracy is on par with others implementations. My question is two-fold. First, is there anything wrong with my implementation that may be causing my model to perform so well on training but only average on val (and not really have any room for improvement) or is this just classic overfitting? If it is overfitting, I would appreciate some advice on how to counter this. My dataset is fixed for the most part (I guess I could try to add more data if I need to), I've tried adding some regularization and that harmed performance. Basically, I feel like I'm missing something here. It strikes me as suspicious that my training accuracy is so high and I wanted to check here to make sure I wasn't missing anything before I sink time into trying to correct the overfitting. Any help is appreciated.
You're quite correct: this is the very definition of over-fitting.
Validation and training losses diverge
Validation and training accuracies diverge
Validation loss later increases
In general, we also expect that the validation loss will reach a relative minimum at about the same point -- this defines the convergence point. Here, it seems that out of the many things the model learns in training, there are still a few useful learnings after the divergence point around epoch 8.
The next items to consider are
What makes you think that you need to train for 50 epochs? Is it possible that epoch 8 is a reasonable place to halt training?
Have you checked your data set to ensure that the training set is, indeed, properly representative of the entire data set?
Consider cross-validation to check the viability of your data splits
I don't know much about this specific dataset, but, generally speaking, a deep learning model will always converge to best fit the training data given it's strong enough (has enough neurons) to do so.
You can always use dropout layers in between your main layers, which are layers that randomly dropout some pixels in the input to the incoming layer. Just do:
emotion_model.add(tensorflow.keras.layers.Dropout(dropout_precentage))
You could also try using L1 and/or L2 norm, which just adds up the weights of a layer to the final lose, which means that the model can't give a certain feature a large weight, which reduces outfitting. just add a kernel_regularizer argument to layers that containt weights, like:
emotion_model.add(Conv2D(64, kernel_size=(3, 3), activation='relu',
kernel_regularizer = keras.regularizers.l1() ))

My LSTM model overfits over validation data

Here's my LSTM model to classify hand gesture. Initially, I had 1960 training data of shape(num_sequences, num_joints, 3) that I reshape to shape(num_sequences, num_joints*3).
Here's my model:
input_shape = (trainx.shape[1], trainx.shape[2])
print("Build LSTM RNN model ...")
model = Sequential()
model.add(Masking(mask_value=0., input_shape=(171, 66)))
model.add(Bidirectional(LSTM(units=256, activation='tanh', return_sequences=True, input_shape=input_shape)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Bidirectional(LSTM(units=128, activation='tanh', return_sequences=True)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Bidirectional(LSTM(units=128, activation='tanh', return_sequences=False)))
model.add(Dropout(0.5))
model.add(BatchNormalization())
model.add(Dense(units=trainy.shape[1], activation="softmax"))
print("Compiling ...")
# Keras optimizer defaults:
# Adam : lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8, decay=0.
# RMSprop: lr=0.001, rho=0.9, epsilon=1e-8, decay=0.
# SGD : lr=0.01, momentum=0., decay=0.
opt = Adam()
model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
I get a 90% accuracy on train and 50% on test
Overfitting is quite common in deep learning.
To circumvent over fitting with your LSTM architecture try the following things in this order:
Decrease the learning rate from 0.1 or 0.01 to 0.001,0.0001,0.00001.
Reduce the number of epochs. You can try to plot the training and validation accuracy as a function of the number of epochs and see when the training accuracy becomes larger than the validation accuracy. That is the number of epochs that you should use. Combine this with the 1st step decreasing the learning rate.
Then you can try to modify the architecture of the LSTM, here you already added dropout (maximum value 0.5), I would suggest to try 0.2, 0.3. You have 3 cells which is better than 2, the size of the nodes look reasonable. What is the embedding dimension you are currently using? Since you are overfitting it is worth a try to reduce the number of cells from 3 to 2 and keeping the same number of nodes.
The batch size might be important as well as the distribution of subclasses in your dataset. Is the dataset equally distributed and equally balanced between training and validation sets? What I mean by this is that if one hand gesture is over represented in the training set compared to the validation set that might be a problem. A good strategy to overcome this is to keep some part of the data as a test set. Then do a train/split cross validation using sklearn (5 times). Then train your architecture on each train/split model separately (5 times) and compare the training and validation accuracy. If there is a big bias in the split or among the sets you will be able to identify it in this manner.
Last, you can try augmentation, specifically rotation and horizontal/vertical flip. This library might help https://github.com/aleju/imgaug
Hope this helps!
How do you know the network is over fitting versus some kind of error in your data set. Does the validation loss improve initially up to some epoch then plateau or start to increase? Then it is over fitting. If it starts at 50% and stays there it is not an over fitting problem. With the amount of drop out you have over fitting does not look very likely. How did you select your validation set? Was it randomly selected from the overall data set or did you do the selection? It is always better to randomly select the data so that its probability distribution mirrors that of the training data. As said in the comments please show your code for model.fit there could be a problem there. How do you input the data? Did you use generators? A 50% validation accuracy leads me to suspect some error in how your validation data is provided to the network or some error in labeling of the validation data. I would also recommend you consider the use of dynamically adjusting your learning rate based on monitoring of validation loss. Keras has a callback for this
called ReduceLROnPlateau. Documentation is here. Set it up to monitor validation loss. I set the parameters patience=3 and factor=.5 which seems to work well. You can think of training as descending into a valley. As you descend the valley gets narrower. If the learning rate is to large and remains fixed you won't be able to reach further down toward the minimum. This should improve your training accuracy which should result in improvement of validation accuracy. As I said with the level of drop out you have I do not think it is over fitting but if it is you can also use Keras regularizes to help avoid over training. Documentation is here.

How do I best optimize my paramters, choices of activation, optimizer ect. in a LSTM?

I'm training a LSTM Neural Network to predict a volatilty (timeseries) in Keras. At the moment, my network is specified as follows:
model = Sequential()
model.add(LSTM(10, input_shape=(1,1), kernel_regularizer = l2(0.0001)))
model.add(Dense(1, activation = 'relu'))
model.compile(loss='mean_squared_error', optimizer='adam')
model.fit(X_train, y_train, validation_split=0.2, epochs=100, batch_size=16)
Here, I have a lot of parameters I could cross validate:
units in LSTM?
More layers?
regularizer (L1 or l2, and amount)?
Activation function?
Optimizer?
Batch size?
However, CV on all these parameters would result in huge computational time, so how do I determind the correct specifications for all of them?
So far as I know, doing grid-search might be the best approach. However, you can lessen your search space by examining your data. If you don't have much data, try to go for a smaller model, don't go too big (or else it will overfit). This can lessen your search space a bit. Some say less layer but more unit works well for low-resource data, but still, it is not guaranteed.
Regularizer can sometimes good or bad, it depends on the task. You'll never know if the setting is correct or not unless you experiment on it.
For batch size, it is recommended to experiment on the batch size from 16 to 512 (or you can go higher if you can). The larger the batch size is, the faster it trains, the more memory it consumes. Smaller batch size also means the model will "walk" more random. In other words, the loss will decrease at a more random pace.
For optimizer, if you want to grid search, just use Adam. It is quite good for most tasks.
All in all, no one can guarantee that tuning different hyperparameters will result in a performance gain. It all needs to be experimented and record. That's why there are so many research papers done on hyperparameters tuning.

Keras Dense Net Overfitting

I am attempting to use keras to build an activity classifier from accelerometer signals. However, I am experiencing extreme overfitting of the data even with the most simplistic of models.
The input data is of shape (10,3) and contains roughly .1 second of data from the accelerometer in 3 dimensions. The model is simply
model = Sequential()
model.add(Flatten(input_shape=(10,3)))
model.add(Dense(2, activation='softmax'))
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])
The model should output the label [1,0] for walking activities and [0,1] for non-walking activities. After training I get 99.8% accuracy (if only it was real...). When I attempt to predict on data that wasn't used for training, I get 50% accuracy, verifying that the net isn't really "learning" anything except to predict a single class value.
The data is being prepared from 100hz triaxial accelerometer signals. I am not preprocessing the data in any way except for windowing it into bins on length 10 that overlap with the previous bin by 50%. What measures can I take to make the network produce actual predictions? I have tried increasing the window size but the results remain the same. Any advice/general tips are greatly appreciated.
Ian
Try adding some hidden layers and dropout layers to your network. You could create a simple Multi Layer Perceptron (MLP) with a couple of extra lines in between your Flatten layer and Dense layer:
model.add(Dense(64, activation='relu', input_dim=30))
model.add(Dropout(0.25))
model.add(Dense(64, activation='relu'))
model.add(Dropout(0.1))
Or check out this guide. which explains how to create a simple MLP.
Without any hidden layers your model will not actually be 'learning' from the input data, rather it will be mapping the input number of features to the output number of features.
The more layers you add, the more intermediate features and patterns it should extract from the input data which should lead to better model predictions for test data. There will be a lot of trial and error to design the best model as too many layers can result in over fitting.
You have not provided information about how you train the model so that may be the cause of the issue as well. You must ensure that the data is spit into training, testing and validation sets. Some possible split ratios to use for training, validation, test data are: 60%:20%:20%, or 70%:15%:15%. This is ultimately something that you must also decide.
The problem of overfitting was caused by the input data type. The values passed to the classifier should have been float values with 2 decimal places. Somewhere along the way, some of these values had been augmented and had significantly more than 2 decimal places. That is, the input should have looked like
[9.81, 10.22, 11.3]
but instead looked like
[9.81000000012, 10.220010431, 11.3000000101]
The classifier was making its prediction based on this feature, which is obviously not the desired behavior! Lessoned learned - make sure the data preparation is consistent! Thanks to #umutto for the suggestions of random forests, the simple structure was helpful for diagnosing purposes.

Categories