Python Sklearn Neural Network Classifier Iteration and Loss

Python Sklearn Neural Network Classifier Iteration and Loss - python

I was trying to implement a paper I read. Basically it uses 3 neural network classifiers with different parameter to work on the same loan default data with 9 different training-to-testing ratios.
In order to find best parameter, we use following criterion, when(1) max_iteration=25000
and (2) Loss value is less than 0.008, we measure the accuracy value, and pick the best.
However, when I try to use python sklearnsklearn.neural_network.MLPClassifier to finish this, I met a problem. When the training-to-test ratio increases, the iterating time the program runs drops dramatically. In the mean while, the loss value increases,
Classifier Performance Table.
This is clearly not what I want, the iterating number should keep rising to 25000 before stop.
This is how I defined classifer:
clf1= MLPClassifier(activation='relu',solver='sgd',early_stopping=False,alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(18),momentum=0.7,learning_rate_init=0.0081,tol=0,random_state=3)
clf2= MLPClassifier(activation='relu',early_stopping=False,solver='sgd',alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(23),momentum=0.69,learning_rate_init=0.0095,tol=0,random_state=3)
clf3= MLPClassifier(activation='relu',early_stopping=False,solver='sgd',alpha=1e-5,max_iter=25000,\
hidden_layer_sizes=(27),momentum=0.79,learning_rate_init=0.0075,tol=0,random_state=3)
As you can see, I already set Tolerance=0, so that every time when we iterate, it could surely decrease the loss. And I had tried other value, but still, the iterate number is way smaller than I expected.
Hope someone can help me, thanks!

Related

Retrain your CNN model successifely with two different datasets

I had implemented a CNN with 3 Convolutional layers with Maxpooling and dropout after each layer
I had noticed that when I trained the model for the first time it gave me 88% as testing accuracy but after retraining it for the second time successively, with the same training dataset it gave me 92% as testing accuracy.
I could not understand this behavior, is it possible that the model had overfitting in the second training process?
Thank you in advance for any help!

It is quite possible if you have not provided the seed number set.seed( ) in the R language or tf.random.set_seed(any_no.) in python

Well I am no expert when it comes to machine learning but I do know the math behind it. What you are doing when you train a neural network you basicly find the local minima to the loss function. What this means is that the end result will heavily depend on the initial guess of all of the internal varaibles.
Usually the variables are randomized as a initial estimation and you could therefore reach quite different results from running the training process multiple times.
That being said, from when I studied the subject I was told that you usually reach similar regardless of the initial guess of the parameters. However it is hard to say if 0.88 and 0.92 would be considered similar or not.
Hope this gives a somewhat possible answer to your question.
As mentioned in another answer, you could remove the randomization, both in the parameter initialization of the parameters and the randomization of the data used for each epoch of training by introducing a seed. This would insure that when you run it twice, everything will get "randomized" in the exact same order. In tensorflow this is done using for example tf.random.set_seed(1), the number 1 can be changed to any number to get a new seed.

Normalizing Rewards to Generate Returns in reinforcement learning

The question is about vanilla, non-batched reinforcement learning. Basically what is defined here in Sutton's book.
My model trains, (woohoo!) though there is an element that confuses me.
Background:
In an environment where duration is rewarded (like pole-balancing), we have rewards of (say) 1 per step. After an episode, before sending this array of 1's to the train step, we do the standard discounting and normalization to get returns:
returns = self.discount_rewards(rewards)
returns = (returns - np.mean(returns)) / (np.std(returns) + 1e-10) // usual normalization
The discount_rewards is the usual method, but here is gist if curious.
So an array of rewards [1,1,1,1,1,1,1,1,1] becomes an array of returns [1.539, 1.160, 0.777, 0.392, 0.006, -0.382, -0.773, -1.164, -1.556].
Given that basic background I can ask my question:
If positive returns are enforced, and negative returns are discouraged (in the optimize step), then no matter the length of the episode, roughly the first half of the actions will be encouraged, and the latter half will be discouraged. Is that true, or am I misunderstanding something?
If its NOT true, would love to understand what I got wrong.
If it IS true, then I don't understand why the model trains, since even a good-performing episode will have the latter half of its actions discouraged.
To reiterate, this is non-batched learning (so the returns are not relative to returns in another episode in the training step). After each episode, the model trains, and again, it trains well :)
Hoping this makes sense, and is short enough to feel like a proper clear question.

Background
Yes, positive rewards are better than negative rewards
No, positive rewards are not good on an absolute scale
No, negative rewards are not bad on an absolute scale
If you increase or decrease all rewards (good and bad) equally, nothing changes really.
The optimizer tries to minimize the loss (maximize the reward), that means it's interested only in the delta between values (the gradient), not their absolute value or their sign.
Reinforcement Learning
Let's say your graph looks something like this:
...
logits = tf.nn.softmax(...)
labels = tf.one_hot(q_actions, n_actions)
loss = tf.losses.softmax_cross_entropy(labels, logits, weights=q_rewards)
The losses for the individual "classes" get scaled by weights which in this case are q_rewards:
loss[i] = -q_rewards[i] * tf.log( tf.nn.softmax( logits[i] ) )
The loss is a linear function of the reward, the gradient stays monotonic under linear transformation.
Reward Normalization
doesn't mess with the sign of the gradient
makes the gradient steeper for rewards far from the mean
makes the gradient shallower for rewards near the mean
When the agent performs rather badly, it receives much more bad rewards than good rewards. Normalization makes the gradient steeper for (puts more weight on) the good rewards and shallower for (puts less weight on) the bad rewards.
When the agent performs rather good, it's the other way 'round.
Your questions
If positive returns are enforced, and negative returns are discouraged (in the optimize step) ...
It's not the sign (absolute value) but the delta (relative values).
... then no matter the length of the episode, roughly the first half of the actions will be encouraged, and the latter half will be discouraged.
If there are either much more high or much more low reward values, then you have a smaller half with a steeper gradient (more weight) and a larger half with a shallower gradient (less weight).
If it IS true, then I don't understand why the model trains, since even a good-performing episode will have the latter half of its actions discouraged.
Your loss value is actually expected to stay about constant at some point. So you have to measure your progress by running the program and looking at the (un-normalized) rewards.
For reference, see the example network from Google IO:
github.com/GoogleCloudPlatform/tensorflow-without-a-phd/.../tensorflow-rl-pong/... and search for _rollout_reward
This isn't a bad thing, however. It's just that your loss is (more or less) "normalized" as well. But the network keeps improving anyway by looking at the gradient at each training step.
Classification problems usually have a "global" loss which keeps falling over time. Some optimizers keep a history of the gradient to adapt the learning rate (effectively scaling the gradient) which means that internally, they also kinda "normalize" the gradient and thus don't care if we do either.
If you want to learn more about behind-the-scenes gradient scaling, I suggest taking a look at ruder.io/optimizing-gradient-descent
To reiterate, this is non-batched learning (so the returns are not relative to returns in another episode in the training step). After each episode, the model trains, and again, it trains well :)
The larger your batch size, the more stable your distribution of rewards, the more reliable the normalization. You could even normalize rewards across multiple episodes.

In my opinion, the accepted answer is wrong.
I read it, and I thought it was plausible, and then I stopped worrying about gradient normalization and checked something else. Only much later did I notice that it was precisely the gradient normalization breaking my training process.
First off, "Reward Normalization doesn't mess with the sign of the gradient" is just plain wrong.
returns = (returns - np.mean(returns)) / (np.std(returns) + 1e-10)
Obviously, if you subtract the mean, that'll flip some signs. So yes, reward normalization does affect the sign of the gradient.
Second, tf.losses.softmax_cross_entropy is, in everyday words, a measurement of how many plausible options the AI had when choosing what it did. Select 1 out of 10 actions randomly? Your cross-entropy is very high. Always select the exact same item? Your cross-entropy is low, because the other choices are irrelevant if you statistically never take them.
In line with that, what
loss[i] = -q_rewards[i] * tf.log( tf.nn.softmax( logits[i] ) )
actually does is this:
If your reward is positive, it will minimize the cross-entropy, meaning it will increase the chance that the AI will take the exact same action again when it sees a similar situation in the future.
If your reward is negative, it will maximize the cross-entropy, meaning it will make the AI choose more randomly when it sees a similar situation in the future.
And that's the purpose of reward normalization: Yes, after normalization, half of the items in your trajectory have a positive sign and the other half has a negative sign. What you are basically saying is: Do more of these things that worked, try something random for these things.
And that leads to very actionable advice:
If your model is behaving too randomly, make sure you have enough positive rewards (after normalization).
If your model is always doing the same and not exploring, make sure you have enough negative rewards (after normalization).

Neural net termination rule

I want to build my first neural net for recognazing hand-written digits in python, but I cant find a good, simple termination rule?
What I mean by "termination rule" is when to stop updating my weights and biases, or how to know I've reached a lcal min.
Let me be clear. I’m not looking for the best performing-most advanced-most sophisticated rule. On the contrary, I want the simplest, easiest to implement, good to get started rule that will just get the job reasonably done.
If there is any more information requierd for you to answer, please do tell and I'll add it here.

Though the question is somewhat too broad, I'll try to provide you with general guidance.
Neural network training is the process of optimization of high-dimensional (almost always) non-convex loss function. As a result, it's very rare to have a formal proof about its global or local minima or convergence speed. There're mere observations that, for instance, all local minima are approximately the same in terms of test accuracy (loss), which makes the learning process easier as finding the global minimum is no longer mandatory.
The "termination rule" you're asking about is in the same category: it's a general rule that seems to be working in most cases. When you're doing cross-validation, you should stop training when the validation accuracy (loss) stops improving and goes flat or gets worse for some period of time. The result model can be reasonably selected as the best over the whole training process. One can also apply early-stopping (see this and this question), to save training time and still avoid overfitting. Essentially, in practice, the researches let the network train as long as the time limit allows and increase the number of epochs only if the accuracy (loss) still does not look flat, which is rare.
For instance, on the chart below, 10 epochs is too early to stop, because there's a lot of potential for improvement. It's still unclear after 15 epochs. It's ok to stop after 20 epochs, if there's lack of time, but I'd let it run until epoch 25 to be sure. At this point, the training score is almost 1.0, validation score is flat, i.e., no sign it could improve further.

neural network: cost is erratic

I implemented a simple neural network for classification (one class) of images in python. Layers are simple (image_matrix, 5,1). Using relu and sigmoid for the hidden layers.
I am iterating 5000 times. At first it looks like the cost goes down gradually in a sensible way.
However, no matter how many training examples I use, or what my learning_rate is, the costs starts behaving erratically after around 3000 iterations every time...
cost (click to see image)
Can someone help me understand what's going on?
Thanks

In training models, you should remember that their are multiple local minima for the its cost. Your graph shows that you're cost is moving around this local minima whilst finding your global minimum, which is the goal finding the best performance for a model.
1st - you should probably try checking for accuracy, f1-score, or loss per iteration/epoch to check if the performance is actually improving.
2nd - do cross validation and check for same metrics above for validation
3rd - implement an early stopping function that should check if you're model is improving or not.
*note: find the best alpha that would help you find the global minimum better.

Fitting the training error of a neural network

I am attempting to curve fit the training error of a neural network as a function of the number of training iterations. An example is shown in red in the image below. Here I've trained for 3000 iterations. What I'm interested in is whether I can find a function that I can fit on the first 1000 (or so) iterations to extrapolate out to 3000 iterations with some reasonable accuracy.
However, I don't know what functional form would be best for me to use. At first I tried an exponential of the form f(x)=A+Bexp(-Cx), which is shown in blue. Obviously this doesn't work too well. The exponential dies off way too fast and then basically just becomes the constant term.
Perhaps it's just difficult, since the beginning of the training shows a very sharp drop off of the error but then transitions to something much more gradual for higher iterations. But maybe someone with experience in neural network training and/or experience in fitting unknown functions might have some ideas. I've been trying various exponential forms and polynomial fits within scipy/numpy but with no success. I've varied the number of iterations used in the fit as well (including throwing out the small iteration numbers).
Any thoughts?

I think exponential fitting may work. In your f(x)=A+B*exp(-C*x), I choose A = 0.005, B = 0.045, and C = 1/250, I will get,
It's just about the parameter tuning. Yet I am trying to understand the motivation that you want to fit the learning curve. I think the interpolation method includes the 'extrapolation' option that you can used to predict the error after more epochs. If you want to precisely learn the curve, you can use another neural network with linear hidden layer and output to 'learn' the curve again, though I didn't try whether it works.

Check out this page: http://www.astroml.org/sklearn_tutorial/practical.html
The useful thing for your situation it describes is diagnosing whether your algorithms appear to be high bias or high variance on your data set, and offering specific directions to go in for either case.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.