DeepSpeech failed to learn Persian language - python

I’m training DeepSpeech from scratch (without checkpoint) with a language model generated using KenLM as stated in its doc. The dataset is a Common Voice dataset for Persian language.
My configurations are as follows:
Batch size = 2 (due to cuda OOM)
Learning rate = 0.0001
Num. neurons = 2048
Num. epochs = 50
Train set size = 7500
Test and Dev sets size = 5000
dropout for layers 1 to 5 = 0.2 (also 0.4 is experimented, same results)
Train and val losses decreases through the training process but after a few epochs val loss does not decrease anymore. Train loss is about 18 and val loss is about 40.
The predictions are all empty strings at the end of the process. Any ideas how to improve the model?

The Persian dataset in Common Voice has around 280 hours of validated audio, so this should be enough to create a model that has better accuracy than you're reporting.
What would help here is to know what the CER and WER figures are for the model? Being able to see these indicates whether the best course of action lies with the hyperparameters of the acoustic model or with the KenLM language model. The difference is explained here in the testing section of the DeepSpeech PlayBook.
It is also likely you would need to perform transfer learning on the Persian dataset. I am assuming that the Persian dataset is written in Alefbā-ye Fārsi. This means that you need to drop the alphabet layer in order to learn from the English checkpoints (which use Latin script).
More information on how to perform transfer learning is in the DeepSpeech documentation, but essentially, you need to do two things:
Use the --drop_source_layers 3 flag to drop the source layers, to allow for transfer learning from another alphabet
Use the --load_checkpoint_dir deepspeech-data/deepspeech-0.9.3-checkpoint flag to specify where to load checkpoints from on which to perform transfer learning.

maybe you need to decrease learning rate or use a learning rate scheduler.

Related

Questions that in case of fluctuating the validation accuracy and loss curve for image binary classification, ask the way of analysis and solution

I implement training and evaluating for binary classification with image data through transfer learning from keras API. I'd like to compare performance each models(ResNet, Inception, Xception, VGG, Efficient Net). The datasets are composed by train(approx.2000ea), valid(approx.250ea), test(approx.250ea).
But I faced unfamiliar situation for me so I'm asking couple of questions here.
As shown below, Valid Accuracy or Loss has a very high up and down deviation.
I wonder which one is the problem and what needs to be changed.
epoch_acc_loss
loss_epoch
acc_epoch
If I want to express validation accuracy with number, what should I say in the above case?
Average or maximum or minimum?
It is being performed using Keras (tensorflow), and there are many examples in the API for
train, valid but the code for Test(evaluation?) is hard to find. When figuring performance,
normally implement until valid? or Do I need to show evaluation result?
Now I use Keras API for transfer learning and set this.
include_top=False
conv_base.trainable=False
Summary
I wonder if there is an effect of transfer learning without includint from top, or if it's not,
is there a way to freeze or learn from a specific layer of conv_base.
I'm a beginner and have not many experience so it could be ridiculous questions but please give kind advice.
Thanks a lot in advance.
It's hard to figure out the problem without any given code/model structure. From your loss graph I can see that your model is facing underfitting (or it has a lots of dropout). Common mistakes, that make models underfit are: very high lr and primitive structure (so model can't figure out the dependencies in your data). And you should never forget about the principle "garbage in - garbage out", so double-check tour data for any structure roughness.
Well, validation accuracy in you model training logs is mean accuracy for validation set. Validation technique is based on statistics - you take random N% out of your set for validation, so average is always better if we're talking about multiple experimets (or cross validation).
I'm not sure if I've understood your question correct here, but if you want to evaluate your model with the metric, that you've specified for it after the training process (fit() function call) you should use model.evaluate(val_x, val_y). Or you may use model.predict(val_x) and compare its results to val_y via your metric function.
If you are using default weights for keras pretrained models (imagenet weights) and you want to use your own fully-connected part with it, you may use ONLY pretrained feature extractor (conv blocks). So you specify include_top=False. Of course there will be some positive effect (I'd say it will be significant in comparison with randomly initialized weights) because conv blocks have params that were trained to extract correct features from image. Also would recommend here to use so called "fine-tuning" technique - freeze all layers in pretrained part except a few in its end (may be few layers or even 2-3 conv blocks). Here's the example of fine-tuning of EfficientNetB0:
effnet = EfficientNetB0(weights="imagenet", include_top=False, input_shape=(540, 960, 3))
effnet.trainable = True
for layer in effnet.layers:
if 'block7a' not in layer.name and 'top' not in layer.name:
layer.trainable = False
Here I freeze all pretrained weights except last conv block ones. I've looked into the model with effnet.summary() and selected names of blocks that I want to unfreeze.

Question about finetuning model to increase number of classes w/additional data using Tensor Flow Custom Object Detection

Using Tensorflow's Custom Object Classification API w/ SSD MobileNet V2 FPNLite 320x320 as the base, I was able to train my model to succesfully detect classes A and B using Training Data 1 (about 200 images). This performed well on Test Set 1, which only has images of class A and B.
I wanted to add several classes to the model, so I constructed a separate dataset, Training Data 2 (about 300 images). This dataset contains labeled data for class B, and new classes C, D and E. However it does NOT include data for class A. Upon training the model on this data, it performed well on Test Set 2 which contained only images of B, C, D and E (however the accuracy on B did not go up despite extra data)
Concerned, I checked the accuracy of the model on Test Set 1 again, and as I had assumed, the model didn't recognize class A at all. In this case I'm assuming I didn't actually refine the model but instead retrained the model completely.
My Question: Am I correct in assuming I cannot refine the model on a completely separate set of data, and instead if I want to add more classes to my trained model that I must combine Training Set 1 and Training Set 2 and train on the entirety of the data?
Thank you!
It mostly depends on your hyperparameters, namely, your learning rate and the number of epochs trained. Higher learning rates will make the model forget the old data faster. Also, be sure not to be overfitting your data, have a validation set as well. Models that have overfit the training data tend to be very sensitive to weight (and data) perturbations.
TLDR. If not trained on all data, ML models tend to forget old data in favor of new data.
There is a lot of "moving parts". I propose the followings:
Take the "SSD MobileNet V2 FPNLite 320x320" as a basemodel without its last classification layer (argument include_top=False when loading the model), and freeze its parameters using command basemodel.trainable=False
Add new prediction layer with command prediction_layer=tf.keras.layers.Dense(1) and make other required things (details step by step in page https://www.tensorflow.org/tutorials/images/transfer_learning)
After the procedure above verify that you have understanding which parameters of the new network (including "old" convolutional part and your own new prediction layer) are trainable and which are not. Change the hyperparameters if needed.
Next train the network using a standard procedures.
Use directly final number of classes according to your idea (25). If you have no data yet for all classes, do not worry, generate some random images for the purpose, and of course take into account that the results are not valid for the classes with no appropriate data.
For simplicity divide the data - principally independently from the number of classes - to training and test data and nothing more complicated in first hand. When amount of data increases the statistics will diminish problems with sampling. And when training, monitor how the amount of data increase the performance of the classification.
So - in a nutshell - 1) make the network - 2) select which parameters to train - 3) train with one dataset and 4) test with another.
And finally direct answer for the question in title and in the end of the question:
-According to experience first utilize out all performance of the basemodel by training only the last layers of the network. After you are sure no more performance can be found this way, begin to finetune the convolutional layers tuning carefully hyperparameters.
-You can refine the model totally only by using your own new data; this is special benefit and art of transfer learning

How to solve classification problem where Dataset amount is low and features between two classes are similar/confusable

I am currently training a cnn to classify ICs into the classes "scratch" and "no scratch" (binary classification). I am fairly new to deep learning and when I trained my cnn a little bit I got very good accuracies (Good validation accuracy as well). But I quickly learned that my models where not as good as I thought, because when using them on a dataset to test it, it got quite a lot of false classification (false positives and false negatives). In my opinion there are 2 problems:
There is too little training data (about 1000 each class)
The ICs have markings (text) on it, which changes with every batch, so my training data has images of ICs with varying markings on it. And since some batches have more scratched ICs and other have less or none, the amount of IC images with different markings is unbalanced
here are two example images of 2 ICs from training set of the class "scratch":
As you see the text varies very strong. Every line has different characters and the amount of characters also varies.
I ask myself how the cnn should be able to differentiate between a scratch and an character?
Nevertheless I am trying to train my cnn and this is for example one model I currently trained (the other models look quite similar):
There are some points while training where the validation accuracy gets up and then down again. What could that mean? I think it is something like that there is a feature in the val data set that is not covered in my training set. Could this be the cause?
As you see Data Augmentation is no option (Or so I think) because of the text. One thing that came into my mind is to seperate the marking and the IC (cut out text region) with preprocessing (Don't know how I could do it properly and fast) and then classfy them seperately, but I don't know if this would be the right approach.
I first used VGG16, ResNet and InceptionV3 (with transfer learning). Now I tried to train my custom cnn (inspired by VGG but with 10 layers similar to this: https://journals.sagepub.com/doi/full/10.1177/1558925019897396)
Do you guys know how I should approach this problem or do you have any tips?

Drop inactive features in Keras

I'm building a Sequential NN model in Keras for binary classification. The training data has about 600,000 rows and 2,000 features, so every epoch and every layer is very time consuming. I believe many of the features are not relevant to the model, and can be dropped altogether, to make the model thinner, so it it would be faster to work with.
I run a simple model with one hidden-layer of 200 neurons. How can I tell which of the features (which are actually the nodes in the input layer) are meaningless, so I could drop them from the data set and re run the model without them?
There is a very big topic in machine learning called feature selection. Though, neural networks are considered to automatically choose the best features for the problem, to an extent, by using their weights, to either consider more or less some of them. Neural networks also need a lot of experience to be tuned correctly. I would definitely suggest you to increase the layers of the network, because you have a lot of data and features and use l1 regularisation, in order to get sparse weights and exclude most of the features. Also, these information are indicative, since I do not know anything about your dataset and your network architecture. At last, I would suggest you to study more about the basics of machine learning and then continue learning about neural networks, before practicing with real data.

NASNet-A fine tuning poor validation accuracy

I have a dataset of roughly 34000 images divided in 2 sets: train (30000 images) and validation (4000 images) sets. Each image is the result of the difference between two images taken from a video (the time offset between the images in each pair is about 1 second). The videos have a static background so the diff images contains too much black with only one or two small regions with colors. Each diff image has a label (there has been an action or no.. 1 or 0) so this is sort of binary classification. Briefly, I'm using the slim models pretrained on ImageNet to do the finetuning on my dataset. I've launched 5 separated training using 5 different networks: InceptionV4, InceptionResnetV2, Resnet152, NASNet-mobile, NASNet. I got very good results using the first 4 networks InceptionV4, InceptionResnetV2, Resnet152, NASNet-mobile but it was not the case using NASNet. The thing is that the Area Under the ROC curve on the validation set is always = 0.5 and the logits of the validation images are roughly having the same values which is really weird. In fact, I got this kind of results using NASNet-mobile on the first 10000 mini-batch but after that the model did converge. Here are the values of the hyperparameters I have in my script:
batch_size=10
weight_decay = 0.00004
optimizer = rmsprop
rmsprop_momentum = 0.9
rmsprop_decay = 0.9
learning_rate_decay_type = exponential
learning_rate = 0.01
learning_rate_decay_factor = 0.94
num_epochs_per_decay = 2.0 #'Number of epochs after which learning rate
I'm still newbie in tensorflow and I did not find anything related anywhere else. This is a really weird behavior because I'm using the same parameters and same inputs but it seems using NASNet there is a problem somewhere. I'm not only looking for a solution (if possible because I know it is tough to troubleshoot such things without too much details about the model) but insights about where to look and how to troubleshoot would be great. Does anybody had this problem with finetuning NASNet before? something like the model didn't converge for example? Finally, I know it is really hard to got answers on such questions but I hope to get at least some insights so I can move forward with my investigations.
EDIT:
Here are the plots of the cross entropy and regularization losses:
EDIT:
As proposed in the answer, I did set the drop_path_keep_prob params to 1 and now the model converged and I got good accuracy on the validation set. But now the question is: what does this param mean? Is it one of the params that we should adapt to our dataset (like learning rate etc..)?
The simplest sanity check you can do would be to run the finetuning on a single minibatch. Any deep network should be able to overfit to that, if there aren't any big problems. If you see that it can't do that, then there must be some problem with the definition, or the way you're using the definition.
The only guess I have in your case is that it could be something to do with the drop_path implementation. It's disabled in the mobile version, but it is enabled during training on the large model. It could make the model unstable enough that it wouldn't fine tune, so it may be worth trying to train with it disabled.

Categories