I'm new on Neural Networks and I am doing a project that has to define a NN and train it. I've defined a NN of 2 hidden layers with 17 inputs and 17 output. The NN has 21 inputs and 3 outputs.
I have a data set of labels of 10 million, and a dataset of samples of another 10 million. My first issue is about the size of the validation set and the training set. I'm using PyTorch and batches, and of what I've read, the batches shouldn't be larger. But I don't know how many approximately should be the size of the sets.
I've tried with larger and small numbers, but I cannot find a correlation that shows me if I'm right choosing a large set o small set in one of them (apart from the time that requires to process a very large set).
My second issue is about the Training and Validation loss, which I've read that can tell me if I'm overfitting or underfitting depending on if it is bigger or smaller. The perfect should be the same value for both, and it also depends on the epochs. But I am not able to tune the network parameters like batch size, learning rate or choosing how much data should I use in the training and validation. If 80% of the set (8 million), it takes hours to finish it, and I'm afraid that if I choose a smaller dataset, it won't learn.
If anything is badly explained, please feel free to ask me for more information. As I said, the data is given, and I only have to define the network and train it with PyTorch.
Thanks!
For your first question about batch size, there is no fix rule for what value should it have. You have to try and see which one works best. When your NN starts performing badly don't go above or below that value for batch size. There is no hard rule here to follow.
For your second question, first of all, having training and validation loss same doesn't mean your NN is performing nicely, it is just an indication that its performance will be good enough on a test set if the above is the case, but it largely depends on many other things like your train and test set distribution.
And with NN you need to try as many things you can try. Try different parameter values, train and validation split size, etc. You cannot just assume that it won't work.
Related
I'm trying to build a convolutional based model. I trained two different structures as following. As you can see for single layer there isn't any obvious change along number of epochs. Bi-layer Conv2D presents improving in accuracy and losses for train dataset, but validation characteristics are going to be a tragedy.
According to the fact that I can't increase my data-set what should I do to improve validation characteristics?
I've examined regularizer L1 & L2 but they didn't affect my model.
1) You can use adaptive learning rate (exponential decay or step dependent may work for you) Furthermore, you can try extreme high learning rates when your model goes into local minimum.
2) If you are training with images, you can flip, rotate or other stuff to increase your dataset size and maybe some other augmentation techniques might work for your case.
3) Try to change the model like deeper, shallower, wider, narrower.
4) If you are doing a classification model, please ensure that you are not using sigmoid as your activation function in the end unless you are doing binary classification.
5) Always check your dataset's situation before training session.
Your train-test split may not be suitable for your case.
There might be extreme noises in your data.
Some amount of your data might be corrupted.
Note: I will update them whenever a new idea comes to my mind. Furthermore, I didn't want to repeat the comments and other answers, both of them are having valuable information for your case.
The validation becomes a tragedy because model is overfitting on the training data you can try if any of this works,
1)Batch normalisation would be a good option to go with.
2)Try reducing batch size.
I tried a variety of models known to work well on small datasets, but as I suspected, and as is my ultimate verdict - it is a lost cause.
You don't have nearly enough data to train a good DL model, or even an ML model like SVM - as matter's exacerbated by having eight separate classes; your dataset would stand some chance with an SVM for binary classification, but none for 8-class. As a last resort, you can try XGBoost, but I wouldn't bet on it.
What can you do? Get more data. There's no way around it. I don't have an exact number, but for 8-class classification, I'd say you need anywhere from 50-200x your current data to get reasonable results. Mind also that your validation performance is bound to be much worse on a bigger validation set, accounted for in this number.
For readers, OP shared his dataset with me; shapes are: X = (1152, 1024, 1), y = (1152, 8)
I already have a model that has trained 130,000 sentences.
I want to categorize sentences with bidirectional lstm.
We plan to use this service.
However, the model must continue to be trained throughout the service.
so i Think
Until the accuracy of the model increases
I will look at the sentences that the model has categorized and I will answer them myself.
I will train sentence to answer.
Is there a difference between training the sentences one by one and training them by merge them into one file?
Every time I give a sentence
One by one training
Does it matter?
Yes, there is a difference. Suppose, you have a dataset of 10,000 sentences.
If you are training one sentence at each time, then optimization will take place at every sentence ( backpropagation ). This consumes more time and memory and is not a good choice. This is not possible if you have a large dataset. Computing gradient on each instance is noisy and the speed of convergence is less.
If you are training in batches, suppose the batch size is 1000, then you have 10 batches. These batches together go in the network and thus gradients are computed over these batches. Hence, the gradients receive enough noise to converge at the global minima rather than the local minima. Also, it is memory efficient and converges faster.
You can check out answers from here, here and here.
https://github.com/wenxinxu/resnet-in-tensorflow#overall-structure
The link above is the Resnet model for cifar10.
I am modifying above code to do object detection using Resnet and Cifar10 as training/validating dataset. ( I know the dataset is for object classification) I know that it sounds strange, but hear me out. I use Cifar10 for training and validation then during testing I use a sliding window approach, and then I classify each of the windows to one of 10 classes + "background" classes.
for background classes, I used images from ImageNet. I search ImageNet with following keyword: construction, landscape, byway, mountain, sky, ocean, furniture, forest, room, store, carpet, and floor. then I clean bad images out as much as I can including images that contain Cifar10 classes, for example, I delete a few "floor" images that have dogs in it.
I am currently running the result in Floydhub. Total steps that I am running is 60,000 which is where section under "training curve" from the link about suggests that the result starts to consolidate and do not converge further ( I personally run this code myself and I can back up the claim)
My question is:
what is the cause of the sudden step down in training and validation data which occurs at about the same step?
What if(or Is it possible that)training and validation data don't converge in a step-like fashion at about the same step? what I mean is, for example, training steps down at around 40,000 and validation just converge with no step-down? (smoothly converge)
The sudden step down is caused by the learning rate decay happening at 40k steps (you can find this parameter in hyper_parameters.py). The leraning rate suddenly gets divided by 10, which allows you to tune the parameters more precisely, which in this case improves your performance a lot. You still need the first part, with a pretty big learning rate, to get in a "good" area for your parameters, then the part with a 10x smaller learning rate will refine it and find a very good spot in that area for your parameters.
This would be surprising, since there is a clear difference between before and after 40k, that affects training and validation the same way. You could still see different behaviors from that point: for instance you might start overtraining because of a too small LR, and see you train error drop down and validation go up, because the refinements you're doing are too specific to the training data.
This question is related to convolutionals neural networks (especially YoloV3)
Since one epoch is one forward pass and one backward pass of all the training examples, for the model to converge properly, is it the same (in terms of precision and time to converge) to :
train with n*k images during m epochs ?
train with n images during m*k epochs ?
You will generally get a better model using n*k images on m epochs, otherwise you are prone to overfitting very easily.
There are also many papers that research this area (why more data seems to always be better), e.g. this one.
I'd recommend to train on all available data (minus a test and validation set) for as long as either the model has not converged or there is no consistent downward trend in the test metric (in which case you are probably overfitting the training data).
No, they are not the same.
*The number of examples you show the network defines what it will be looking for - a network with more examples will tend to be more general. If there are, for example, 1000 pictures with different dogs in it, and you only show 300/300000 pictures, the network (on average) will only recognize one specific kind of dog, and be unable to pick out common traits of all dogs.
*An epoch is basically modifying the network in a small step, and the key word here is small - taking too big steps risk overshooting our target values for the network parameters. Since we´re taking small steps, we have to take several of them to get where we want.
I have a dataset of 3372149 rows, and I batch them every 3751 rows as the code shown below:
train_input_fn = tf.estimator.inputs.numpy_input_fn(
x={"x": train_features_numpy},
y=train_labels_numpy,
batch_size = 3751,
num_epochs= 1,
shuffle=False)
# Train
nn.train(input_fn=train_input_fn)#, steps=10000)
If I set num_epochs = 1 as what I have in the code, it means that the training process would go through the dataset once right? And that leads to the total steps equals to 3372149/3751 = 899.
If I uncomment the "steps = 10000" part, and set "num_epochs=none", the training part would be forced to train all the way to step 10000.
I have two questions then:
Since I only have 899 sets of valid data but I set the step to 10000, what is Tensorflow training after step 899? Does it just go back to the top and repeat the training?
If I trained more then 899 steps, is it going to mess up the model that relates the features and labels? Or is it redundant since the training loop just go over and over the same data set?
I did ask about the loss not reduced during training in my other posts and I am now thinking if I have too few data sets to train on and thus all the excessive steps are useless.
Iterating over a dataset many times is quite common and normal. Each "step" of your model (that is each batch) takes one gradient update step. In intuitive terms it has taken one step towards the goal in the direction dictated by that mini batch. It does NOT learn everything it can about a particular sample by seeing it once, it just takes a step closer to the goal, and how big a step is dictated by the learning rate (and other more complex factors). If you cut your learning rate in half you'd need twice as many steps to get there. Notice how that had nothing to do with epochs, just "update steps" (aka batches).
The typical way of knowing when it's time to stop is to plot test data accuracy over time as you train your model. It is certainly possible that your model will begin to overfit at some point. If it does so test accuracy will start to get worse, this is an obvious optimal stopping point.
Also note that batches of data are not sequential, each batch is randomly selected by permuting the data. The next time through the dataset will end up with different batches of data, and thus each of these batches will produce a different gradient update. So even going through the dataset twice will not produce the same set of updates on each epoch.
I don't actually know the answer to question #1 because I don't use the estimator API much, but I'm 90% sure it simply permutes the samples and iterates through them again after each epoch. That's the most common approach.