optimizers other than GradientDescentOptimizer tensorflow having zero for gradient value

optimizers other than GradientDescentOptimizer tensorflow having zero for gradient value - python

I am really at wit's end and don't know where else I can ask so I am asking here. I know that my question may not be of the best quality but I am hoping for at least some guidance on the direction I should look to figure out my problem.
I am replicating sci-kit learn's implementation of Elastic Net Multiple Linear Regression in tensorflow and tensorboard as a learning exercise so I can eventually move on to implement and visualize more difficult machine learning algorithms.
I have some code that does a Multiple Linear Regression using the Elastic Net Regularization as the loss function. With gradient descent, it converges to a suboptimal solution compared to sci-kit learn's algorithm. Through some searching, I learned that sci-kit learn initializes weights using the Xavier method, so I did that in tensorflow as well. Performance improved slightly but still was no where close to sklearn. My next improvement was to change the optimizer to attempt to match performance although my research told me that scikit learn uses coordinate descent which is a method that isn't implemented in tensorflow.
However, this is where I am stuck. It seems that simply switching out the optimizer for another optimizer does not seem to work (not that I expected it to, but I'm also having trouble finding material that will tell me how to set up properly). Currently I've simply performed the switch the following way, can anyone give me a hint why my gradients are 0?
Thanks!
# Declare optimizer
my_opt = tf.train.GradientDescentOptimizer(0.001)
my_opt = tf.train.AdamOptimizer(epsilon = 0.1)
Histogram of gradients:
Loss function showing that Adam optimizer isn't doing anything:
EDIT:
I have updated my learning rate to be higher, but convergence still doesn't seem that great. I think I will proceed to try to implement Coordinate Descent in tensorflow to match sci-kit learn's method as close as possible. I've attached an image of the difference for those curious:
In comparison to SGD:

Related

Is there a cost-sensitive loss function implementation in PyTorch?

I would like to implement a cost-sensitive loss function in PyTorch. My two-class training dataset is heavily imbalanced, where 75% of the data are label '0' and only 25% of the data are label '1'.
I am new to PyTorch but my supervisor is adamant that I use it (they have more experience with it).
I found some implementations in Keras, but I am not that strong in coding to be able to port it over to PyTorch.
I have read around to find some resources to create a cost-sensitive loss function.
This paper uses something which I think might work (https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9417097), but I do not understand how the code is implemented despite having access to it here (https://github.com/emadeldeen24/AttnSleep/blob/f993511426900f9fca20594a738bf8bee1116381/utils/util.py).
This website describes the math very detailedly but I do not understand it: https://medium.com/rv-data/how-to-do-cost-sensitive-learning-61848bf4f5e7
Here is an implementation in Keras which I have trouble with converting to PyTorch: https://towardsdatascience.com/fraud-detection-with-cost-sensitive-machine-learning-24b8760d35d9
I also found this implementation in PyTorch, but have trouble with understanding it: https://discuss.pytorch.org/t/dealing-with-imbalanced-datasets-in-pytorch/22596/21
Could you please help me to understand the last link's implementation of the cost-sensitive loss function?
Thank you.

How to confirm convergence of LSTM network?

I am using LSTM for time-series prediction using Keras. I am using 3 LSTM layers with dropout=0.3, hence my training loss is higher than validation loss. To monitor convergence, I using plotting training loss and validation loss together. Results looks like the following.
After researching about the topic, I have seen multiple answers for example ([1][2] but I have found several contradictory arguments on various different places on the internet, which makes me a little confused. I am listing some of them below :
1) Article presented by Jason Brownlee suggests that validation and train data should meet for the convergence and if they don't, I might be under-fitting the data.
https://machinelearningmastery.com/diagnose-overfitting-underfitting-lstm-models/
https://machinelearningmastery.com/learning-curves-for-diagnosing-machine-learning-model-performance/
2) However, following answer on here suggest that my model is just converged :
How do we analyse a loss vs epochs graph?
Hence, I am just bit confused about the whole concept in general. Any help will be appreciated.

Convergence implies you have something to converge to. For a learning system to converge, you would need to know the right model beforehand. Then you would train your model until it was the same as the right model. At that point you could say the model converged! ... but the whole point of machine learning is that we don't know the right model to begin with.
So when do you stop training? In practice, you stop when the model works well enough to do what you want it to do. This might be when validation error drops below a certain threshold. It might just be when you can't afford any more computing power. It's really up to you.

SKLearn NMF Vs Custom NMF

I am trying to build a recommendation system using Non-negative matrix factorization. Using scikit-learn NMF as the model, I fit my data, resulting in a certain loss(i.e., reconstruction error). Then I generate recommendation for new data using the inverse_transform method.
Now I do the same using another model I built in TensorFlow. The reconstruction error after training is close to that obtained using sklearn's approach earlier.
However, neither are the latent factors similar to one another nor the final recommendations.
One difference between the 2 approaches that I am aware of is:
In sklearn, I am using the Coordinate Descent solver whereas in TensorFlow, I am using the AdamOptimizer which is based on Gradient Descent.
Everything else seems to be the same:
Loss function used is the Frobenius Norm
No regularization in both cases
Tested on the same data using same number of latent dimensions
Relevant code that I am using:
1. scikit-learn approach:
model = NMF(alpha=0.0, init='random', l1_ratio=0.0, max_iter=200,
n_components=2, random_state=0, shuffle=False, solver='cd', tol=0.0001,
verbose=0)
model.fit(data)
result = model.inverse_transform(model.transform(data))
2. TensorFlow approach:
w = tf.get_variable(initializer=tf.abs(tf.random_normal((data.shape[0],
2))), constraint=lambda p: tf.maximum(0., p))
h = tf.get_variable(initializer=tf.abs(tf.random_normal((2,
data.shape[1]))), constraint=lambda p: tf.maximum(0., p))
loss = tf.sqrt(tf.reduce_sum(tf.squared_difference(x, tf.matmul(w, h))))
My question is that if the recommendations generated by these 2 approaches do not match, then how can I determine which are the right ones?
Based on my use case, sklearn's NMF is giving me good results, but not the TensorFlow implementation. How can I achieve the same using my custom implementation?

The choice of the optimizer has a big impact on the quality of the training. Some very simple models (I'm thinking of GloVe for example) do work with some optimizer and not at all with some others. Then, to answer your questions:
how can I determine which are the right ones ?
The evaluation is as important as the design of your model, and it is as hard i.e. you can try these 2 models and several available datasets and use some metrics to score them. You could also use A/B testing on a real case application to estimate the relevance of your recommendations.
How can I achieve the same using my custom implementation ?
First, try to find a coordinate descent optimizer for Tensorflow and make sure all step you implemented are exactly the same as the one in scikit-learn. Then, if you can't reproduce the same, try different solutions (why don't you try a simple gradient descent optimizer first ?) and take profit of the great modularity that Tensorflow offers !
Finally, if the recommendations provided by your implementation are that bad, I suggest you have an error in it. Try to compare with some existing codes.

Can I use SQP(Sequential quadratic programming) in scipy for neural network regression optimization?

As title, after training and testing my neural network model in python.
Can I use SQP function in scipy for neural network regression problem optimization?
For example, I am using temperature,humid,wind speed ,these three feature for input,predicting energy usage in some area.
So I use neural network to model these input and output's relationship, now I wanna know some energy usage lowest point, what input feature are(i.e. what temperature,humid,wind seed are).This just example so may sound unrealistic.
Because as far as I know, not so many people just use scipy for neural network optimization. But in some limitation , scipy is the most ideal optimization tool what I have by now(p.s.: I can't use cvxopt).
Can someone give me some advice? I will be very appreciate!

Sure, that's possible, but your question is too broad to give a complete answer as all details are missing.
But: SLSQP is not the right tool!
There is a reason, NN training is dominated by first-order methods like SGD and all it's variants
Fast calculation of gradients and easy to do in mini-batch mode (not paying for the full gradient; less memory)
Very different convergence theory for Stochastic-Gradient-Descent which is usually much better for large-scale problems
In general: fast iteration speed (e.g. time per epoch) while possibly needing more epochs (for full convergence)
NN is unconstrained continuous optimization
SLSQP is a very general optimization able to tackle constraints and you will pay for that (performance and robustness)
LBFGS is actually the only tool (which i saw) sometimes used to do that (and also available in scipy)
It's a bound-constrained optimizer (no general constraints as SLSQP)
It approximates the inverse-hessian and therefore memory-usage is greatly reduced compared to BFGS and also SLSQP
Both methods are full-batch methods (opposed to the online/minibatch nature of SGD
They are also using Line-searches or something similar which results less hyper-parameters to tune: no learning-rates!
I think you should stick to SGD and it's variants.
If you want to go for the second-order approach: learn from sklearn's implementation using LBFGS

TensorFlow RandomForest vs Deep learning

I am using TensorFlow for training model which has 1 output for the 4 inputs. The problem is of regression.
I found that when I use RandomForest to train the model, it quickly converges and also runs well on the test data. But when I use a simple Neural network for the same problem, the loss(Random square error) does not converge. It gets stuck on a particular value.
I tried increasing/decreasing number of hidden layers, increasing/decreasing learning rate. I also tried multiple optimizers and tried to train the model on both normalized and non-normalized data.
I am new to this field but the literature that I have read so far vehemently asserts that the neural network should marginally and categorically work better than the random forest.
What could be the reason behind non-convergence of the model in this case?

If your model is not converging it means that the optimizer is stuck in a local minima in your loss function.
I don't know what optimizer you are using but try increasing the momentum or even the learning rate slightly.
Another strategy employed often is the learning rate decay, which reduces your learning rate by a factor every several epochs. This can also help you not get stuck in a local minima early in the training phase, while achieving maximum accuracy towards the end of training.
Otherwise you could try selecting an adaptive optimizer (adam, adagrad, adadelta, etc) that take care of the hyperparameter selection for you.
This is a very good post comparing different optimization techniques.
Deep Neural Networks need a significant number of data to perform adequately. Be sure you have lots of training data or your model will overfit.

A useful rule for beginning training models, is not to begin with the more complex methods, for example, a Linear model, which you will be able to understand and debug more easily.
In case you continue with the current methods, some ideas:
Check the initial weight values (init them with a normal distribution)
As a previous poster said, diminish the learning rate
Do some additional checking on the data, check for NAN and outliers, the current models could be more sensitive to noise. Remember, garbage in, garbage out.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

optimizers other than GradientDescentOptimizer tensorflow having zero for gradient value - python

Related

Is there a cost-sensitive loss function implementation in PyTorch?

How to confirm convergence of LSTM network?

SKLearn NMF Vs Custom NMF

Can I use SQP(Sequential quadratic programming) in scipy for neural network regression optimization?

TensorFlow RandomForest vs Deep learning

Categories

Resources