Keras - Using large numbers of features - python

I'm developing a Keras NN that predicts the label using 20,000 features. I can build the network, but have to use system RAM since the model is too large to fit in my GPU, which has meant it's taken days to run the model on my machine. The input is currently 500,20000,1 to an output of 500,1,1
-I'm using 5,000 nodes in the first fully connected (Dense) layer. Is this sufficient for the number of features?
-Is there a way of reducing the dimensionality so as to run it on my GPU?

I suppose each input entry has size (20000, 1) and you have 500 entries which make up your database?
In that case you can start by reducing the batch_size, but I also suppose that you mean that even the network weights don't fit in you GPU memory. In that case the only thing (that I know of) that you can do is dimensionality reduction.
You have 20000 features, but it is highly unlikely that all of them are important for the output value. With PCA (Principal Component Analysis) you can check the importance of all you parameters and you will probably see that only a few of them combined will be 90% or more important for the end result. In this case you can disregard the unimportant features and create a network that predicts the output based on let's say only 1000 (or even less) features.
An important note: The only reason I can think of where you would need that many features, is if you are dealing with an image, a spectrum (you can see a spectrum as a 1D image), ... In this case I recommend looking into convolutional neural networks. They are not fully-connected, which removes a lot of trainable parameters while probably performing even better.

Related

Training & Validation loss and dataset size

I'm new on Neural Networks and I am doing a project that has to define a NN and train it. I've defined a NN of 2 hidden layers with 17 inputs and 17 output. The NN has 21 inputs and 3 outputs.
I have a data set of labels of 10 million, and a dataset of samples of another 10 million. My first issue is about the size of the validation set and the training set. I'm using PyTorch and batches, and of what I've read, the batches shouldn't be larger. But I don't know how many approximately should be the size of the sets.
I've tried with larger and small numbers, but I cannot find a correlation that shows me if I'm right choosing a large set o small set in one of them (apart from the time that requires to process a very large set).
My second issue is about the Training and Validation loss, which I've read that can tell me if I'm overfitting or underfitting depending on if it is bigger or smaller. The perfect should be the same value for both, and it also depends on the epochs. But I am not able to tune the network parameters like batch size, learning rate or choosing how much data should I use in the training and validation. If 80% of the set (8 million), it takes hours to finish it, and I'm afraid that if I choose a smaller dataset, it won't learn.
If anything is badly explained, please feel free to ask me for more information. As I said, the data is given, and I only have to define the network and train it with PyTorch.
Thanks!
For your first question about batch size, there is no fix rule for what value should it have. You have to try and see which one works best. When your NN starts performing badly don't go above or below that value for batch size. There is no hard rule here to follow.
For your second question, first of all, having training and validation loss same doesn't mean your NN is performing nicely, it is just an indication that its performance will be good enough on a test set if the above is the case, but it largely depends on many other things like your train and test set distribution.
And with NN you need to try as many things you can try. Try different parameter values, train and validation split size, etc. You cannot just assume that it won't work.

Can we normalize features extracted from pre-trained models

I am working with features extracted from pre-trained VGG16 and VGG19 models. The features have been extracted from second fully connected layer (FC2) of the above networks.
The resulting feature matrix (of dimensions (8000,4096)) has values in the range [0,45]. As a result, when I am using this feature matrix in gradient based optimization algorithms, the value for loss function, gradient, norms etc. take very high values.
In order to do away with such high values, I applied MinMax normalization to this feature matrix and since then the values are manageable. Also, the optimization algorithm is behaving properly. Is my strategy OK i.e. is it fair enough to normalize features that have been extracted from a pre-trained models for further processing.
From experience, as long as you are aware of the fact that your results are coming from normalized values, it is okay. If normalization helps you show gradients, norms, etc. better then I am for it.
What I would be cautious about though, would be any further analysis on those feature matrices as they are normalized and not the true values. Say, if you were to study the distributions and such, you should be fine, but I am not sure what is your next step, and if this can/will be harmful.
Can you share more details around "further analysis"?

Keras: Over fitting Conv2D

I'm trying to build a convolutional based model. I trained two different structures as following. As you can see for single layer there isn't any obvious change along number of epochs. Bi-layer Conv2D presents improving in accuracy and losses for train dataset, but validation characteristics are going to be a tragedy.
According to the fact that I can't increase my data-set what should I do to improve validation characteristics?
I've examined regularizer L1 & L2 but they didn't affect my model.
1) You can use adaptive learning rate (exponential decay or step dependent may work for you) Furthermore, you can try extreme high learning rates when your model goes into local minimum.
2) If you are training with images, you can flip, rotate or other stuff to increase your dataset size and maybe some other augmentation techniques might work for your case.
3) Try to change the model like deeper, shallower, wider, narrower.
4) If you are doing a classification model, please ensure that you are not using sigmoid as your activation function in the end unless you are doing binary classification.
5) Always check your dataset's situation before training session.
Your train-test split may not be suitable for your case.
There might be extreme noises in your data.
Some amount of your data might be corrupted.
Note: I will update them whenever a new idea comes to my mind. Furthermore, I didn't want to repeat the comments and other answers, both of them are having valuable information for your case.
The validation becomes a tragedy because model is overfitting on the training data you can try if any of this works,
1)Batch normalisation would be a good option to go with.
2)Try reducing batch size.
I tried a variety of models known to work well on small datasets, but as I suspected, and as is my ultimate verdict - it is a lost cause.
You don't have nearly enough data to train a good DL model, or even an ML model like SVM - as matter's exacerbated by having eight separate classes; your dataset would stand some chance with an SVM for binary classification, but none for 8-class. As a last resort, you can try XGBoost, but I wouldn't bet on it.
What can you do? Get more data. There's no way around it. I don't have an exact number, but for 8-class classification, I'd say you need anywhere from 50-200x your current data to get reasonable results. Mind also that your validation performance is bound to be much worse on a bigger validation set, accounted for in this number.
For readers, OP shared his dataset with me; shapes are: X = (1152, 1024, 1), y = (1152, 8)

Increasing accuracy by changing batch-size and input image size

I am extracting a road network from satellite imagery. Herein the pixel classification is binary ( 0 = non-road, 1 = road). Hence, the mask of the complete satellite image which is 6400 x 6400 pixels shows one large road network where each road is connected to another road. For the implementation of the U-net I divided that large image in 625 images of 256 x 256 pixels.
My question is: Can a neural network easier find structure with an increase in batch size (thus can it find structure between different batches), or can it only find structure if the input image size is enlarged?
If your model is a regular convolutional network (without any weird hacks), the samples in a batch will not be connected to each other.
Depending on which loss function you use, the batch size might be important too. For regular functions (available 'mse', 'binary_crossentropy', 'categorical_crossentropy', etc.), they all keep the samples independent from each other. But some losses might consider the entire batch. (F1 metrics, for instance). If you're using a loss function that doesn't treat samples independently, then the batch size is very important.
That said, having a bigger batch size may help the net to find its way more easily, since one image might push weights towards one direction, while another may want a different direction. The mean results of all images in the batch should then be more representative of a general weight update.
Now, entering an experimenting field (we never know everything about neural networks until we test them), consider this comparison:
a batch with 1 huge image versus
a batch of patches of the same image
Both will have the same amount of data, and for a convolutional network, it wouldn't make a drastic difference. But for the first case, the net will probably be better at finding connections between roads, maybe find more segments where the road might be covered by something, while the small patches, being full of borders might be looking more into textures and be not good at identifying these gaps.
All of this is, of course, a guess. Testing is the best.
My net in a GPU cannot really use big patches, which is bad for me...

Can I train a model in steps in Keras?

I've got a model in Keras that I need to train, but this model invariably blows up my little 8GB memory and freezes my computer.
I've come to the limit of training just one single sample (batch size = 1) and still it blows up.
Please assume my model has no mistakes or bugs and this question is not about "what is wrong with my model". (Yes, smaller models work ok with the same data, but aren't good enough for the task).
How can I split my model in two and train each part separately, but propagating the gradients between them?
Is there a possibility? (There is no limitation about using theano or tensorflow)
Using CPU only, no GPU.
You can do this thing, but it will cause your training time to approach sizes that will only make the results useful for future generations.
Let's consider what all we have in our memory when we train with a batch size of 1 (assuming you've only read in that one sample into memory):
1) that sample
2) the weights of your model
3) the activations of each layer #your model stores these for backpropogation
None of this stuff is unnecessary for training. However, you could, theoretically, do a forward pass on the first half of the model, dump the weights and activations to disk, load the second half of the model, do a forward pass on that, then the backward pass on that, dump those weights and activations to disk, load back the weights and activations of the first half, then complete the backward pass on that. This process could be split up even more to the point of doing one layer at a time.
OTOH, this is akin to what swap space does, without you having to think about it. If you want a slightly less optimized version of this (which, optimization is clearly moot at this point), you can just increase your swap space to 500GB and call it a day.

Categories