sklearn setting learning rate of SGDClassifier vs LogsticRegression - python

As in sklearn, LogisticRegression(short for LR) has not direct method for solving weighted LR, so i pass to SGDClassifier(SGD).
As with my experiment: i generate data follow LR distribution with parametre intercept=0, beta=2. And run LR and SGD to estimate them.
To compare this two, i set the same penalty parameter(my original idea is setting them to 0, but as they can't be setted to 0, i give big C for LR and small alpha for SGD )
As i see, LR almost do well for estimation, but it's difficult to setting the parametre for SGD. The main problem is the choice of eta0 and the learning_rate: 'constant' (too slow), 'optimal' or 'invscaling'.
My idea is watching the loss function, if it's likely to go down, increase the n_iter. if it go down too slow, increase the eta0.
But
1.how to return the value of loss function for each epoch, i see them by change verbose to 1, but i don't know how to get return the value. (may be partiel_fit?)
2.is there more intelligent (automatique) way to this work? if not i should relance the training process many times And more complicated if i use cross validation
Thanks you for all your advice. If i was not clear, please let me know.
P.S. the code in Python require indent block, as i'm new to stackoverflow, i don't know how to do this, so if you want to execute de code, please add indented block after the def.
import random
import numpy as np
from sklearn.linear_model import LogisticRegression,SGDClassifier
def simule_logistic(n):
beta=0.2
x=[]
seuil=[]
for i in range(n):
x.append(random.normalvariate(1, 2))
seuil.append(random.uniform(0, 1))
x=np.array(x)
seuil=np.array(seuil)
p=1.0/(1+ np.exp(-x*beta))
y=[]
for i in range(n):
if p[i]<seuil[i]:
y.append(0)
else:
y.append(1)
y=np.array(y)
return x, p,y
if __name__=='__main__':
n=100000
x,p,y=simule_logistic(n)
x=x.reshape((n,1))
print x.shape
print y.shape
l=LogisticRegression(C=1000000,penalty='l1')
l.fit(x,y)
sgd=SGDClassifier(n_iter=100,n_jobs=1, loss='log',alpha=1.0/1000000,l1_ratio=1,learning_rate='optimal',eta0=0.01)
print sgd
sgd.fit(x,y)
#methode regression
print 'l',l.coef_
print l.intercept_
print 'sgd',sgd.coef_
print sgd.intercept_

Related

How to make neural network training less dependent on initial conditions?

In this simple toy example, the network leans the XOR operation:
import tensorflow as tf
import numpy as np
from sklearn.metrics import accuracy_score, precision_score, recall_score
model = tf.keras.Sequential(layers=[
tf.keras.layers.Input(shape=(2,)),
tf.keras.layers.Dense(4, activation='relu'),
tf.keras.layers.Dense(1)
])
model.compile(
loss=tf.keras.losses.binary_crossentropy,
optimizer=tf.keras.optimizers.SGD(learning_rate=0.01)
)
x_train = np.random.uniform(-1, 1, (10000, 2))
tmp = x_train > 0
y_train = (tmp[:, 0] ^ tmp[:, 1])
model.fit(x=x_train, y=y_train, epochs=10)
x_test = np.random.uniform(-1, 1, (1000, 2))
tmp = x_test > 0
y_test = (tmp[:, 0] ^ tmp[:, 1])
prediction = model.predict(x_test) > 0.5
print(f'Accuracy: {accuracy_score(y_pred=prediction, y_true=y_test)}')
print(f'recall: {recall_score(y_pred=prediction, y_true=y_test)}')
print(f'precision: {precision_score(y_pred=prediction, y_true=y_test)}')
This example can also be found in the tensorflow playground
When the initial loss is <3, this will quickly converge (in 2-3 epochs). But sometimes the initial conditions lead it to have ~7 loss, in which case it never converges (not even after 1000 epochs).
It's easy to know right after the first epoch if it's going to work or not, but it makes searching for hyper parameters very difficult, since you never know if converged successfully by chance due to initial conditions, or if the hyper parameter is the cause.
Is there a way to make this network less dependent on initial conditions? A different optimizer? Some optimizer hyper parameter? weight regularization?
I've tried changing these, but didn't get consistent improvements.
In the playground example, it never gets stuck at this kind of high loss.
Edit: If you make the training long enough, it might jump to loss 7 even after settling on a good solution with loss < 0.03.
Theoretically, there's no way to be 100% sure if it's the hyper params or the initial config. You'll need to implement something for the case when there's divergence.
Practically though, you could:
Train multiple times, and incorporate how often the network converges into the strategy on selecting the best hyperparameters;.
Find some ranges for which you feel like the model is consistent.
Incorporate the initialization of your weights into your hyperparameter optimization. Right now they are randomly initiliazed, and are the cause of the problem. There's a number of ways to do this. Try playing around with different initilizers, but there's no "one best initiliazers for every ML problem".
Just fix the initial conditions. Fix the random seed that tensorflow uses for initilization using tf.random.set_seed, but that will affect your performance by a lot of course, so I don't think that's really what you want. You could make the claim that you are now sure that a network performs well because of the architecture, but that's only true for that specific random seed, not for all.
According to this blog, adding batchnorm should make the network less sensitive to the initialisation approach.

Why does a GPflow model not seem to learn anything with TensorFlow optimizers such as tf.optimizers.Adam?

My inducing points are set to trainable but do not change when I call opt.minimize(). Why is it and what does it mean? Does it mean, the model is not learning?
What is the difference between tf.optimizers.Adam(lr) and gpflow.optimizers.Scipy?
The following is the simple classification example adapted from the documentation. When I run this code example with gpflow's Scipy optimizer then I get the trained results and the values for inducing variables keep changing. But when I use Adam optimizer then I get only a straight line prediction, and the values for inducing points remain the same. It indicates that the model is not learning with Adam optimizer.
plot of data before training
plot of data after training with Adam
plot of data after training with gpflow optimizer Scipy
The link for the example is https://gpflow.readthedocs.io/en/develop/notebooks/advanced/multiclass_classification.html
import numpy as np
import tensorflow as tf
import warnings
warnings.filterwarnings('ignore') # ignore DeprecationWarnings from tensorflow
import matplotlib.pyplot as plt
import gpflow
from gpflow.utilities import print_summary, set_trainable
from gpflow.ci_utils import ci_niter
from tensorflow2_work.multiclass_classification import plot_posterior_predictions, colors
np.random.seed(0) # reproducibility
# Number of functions and number of data points
C = 3
N = 100
# RBF kernel lengthscale
lengthscale = 0.1
# Jitter
jitter_eye = np.eye(N) * 1e-6
# Input
X = np.random.rand(N, 1)
kernel_se = gpflow.kernels.SquaredExponential(lengthscale=lengthscale)
K = kernel_se(X) + jitter_eye
# Latents prior sample
f = np.random.multivariate_normal(mean=np.zeros(N), cov=K, size=(C)).T
# Hard max observation
Y = np.argmax(f, 1).reshape(-1,).astype(int)
print(Y.shape)
# One-hot encoding
Y_hot = np.zeros((N, C), dtype=bool)
Y_hot[np.arange(N), Y] = 1
data = (X, Y)
plt.figure(figsize=(12, 6))
order = np.argsort(X.reshape(-1,))
print(order.shape)
for c in range(C):
plt.plot(X[order], f[order, c], '.', color=colors[c], label=str(c))
plt.plot(X[order], Y_hot[order, c], '-', color=colors[c])
plt.legend()
plt.xlabel('$X$')
plt.ylabel('Latent (dots) and one-hot labels (lines)')
plt.title('Sample from the joint $p(Y, \mathbf{f})$')
plt.grid()
plt.show()
# sum kernel: Matern32 + White
kernel = gpflow.kernels.Matern32() + gpflow.kernels.White(variance=0.01)
# Robustmax Multiclass Likelihood
invlink = gpflow.likelihoods.RobustMax(C) # Robustmax inverse link function
likelihood = gpflow.likelihoods.MultiClass(C, invlink=invlink) # Multiclass likelihood
Z = X[::5].copy() # inducing inputs
#print(Z)
m = gpflow.models.SVGP(kernel=kernel, likelihood=likelihood,
inducing_variable=Z, num_latent_gps=C, whiten=True, q_diag=True)
# Only train the variational parameters
set_trainable(m.kernel.kernels[1].variance, True)
set_trainable(m.inducing_variable, True)
print(m.inducing_variable.Z)
print_summary(m)
training_loss = m.training_loss_closure(data)
opt.minimize(training_loss, m.trainable_variables)
print(m.inducing_variable.Z)
print_summary(m.inducing_variable.Z)
print(m.inducing_variable.Z)
# %%
plot_posterior_predictions(m, X, Y)
The example given in the question isn't copy&pastable, but it seems like you simply exchange opt = gpflow.optimizers.Scipy() with opt = tf.optimizers.Adam(). The minimize() method of gpflow's Scipy optimizer runs one call of scipy.optimize.minimize, which by default runs to convergence (you can also specify a maximum number of iterations by passing, e.g., options=dict(maxiter=100) to the minimize() call).
In contrast, the minimize() method of TensorFlow optimizers runs only a single optimization step. To run more steps, say iter = 100, you need to manually write a loop:
for _ in range(iter):
opt.minimize(model.training_loss, model.trainable_variables)
For this to actually run fast, you also need to wrap the optimization step in tf.function:
#tf.function
def optimization_step():
opt.minimize(model.training_loss, model.trainable_variables)
for _ in range(iter):
optimization_step()
This runs exactly iter steps - in TensorFlow you have to handle convergence checks yourself, your model may or may not be converged after this many steps.
So in your usage, you only ran one step - this did change the parameters, but presumably too little to notice the difference. (You could see a larger effect in one step by making the learning rate much higher, though that would not be a good idea for actually optimizing the model with many steps.)
Usage of the Adam optimizer with GPflow models is demonstrated in the notebook on stochastic variational inference, though it also works for non-stochastic optimization.
Note that, in any case, all parameters such as inducing point locations are set trainable by default, so your call to set_trainable(..., True) doesn't affect what's going on here.

Memory error when using logistic regression

I have an array with the shape 57159x924 which I will use as training data. 896 of these 924 columns are feature and the remaining labels. I want to use logistic regression on this, but when I use the fit function from logistic regression I get a memory error. I guess it is because it's too much data for my computer's memory to handle. Is there any way to get around this problem?
The code I want to use is
lr = LogisticRegression(random_state=1)
lr.fit(train_set, train_label)
lr.predict_proba(x_test)
And the following is the error
line 21, in main
lr.fit(train_set, train_label)
....
return array(a, dtype, copy=False, order=order)
MemoryError
You haven't given enough details to really understand the problem or give a definite answer, but here are a few options I hope will help:
The amount of memory available might be configurable.
Training over all the data at the same time would raise OOM problems in many contexts, which is why the common practice is to use SGD (stochastic gradient descent) by training over batches, i.e. introducing only subsets of the data every iteration and getting a global optimization solution in a stochastic sense. If I'm guessing correctly, you're using sklearn.linear_model.LogisticRegression, which has different "solvers". Maybe the saga solver will handle your situation better.
There are other implementations out there, and some of them definitely have batching options built-in in a highly configurable way. And if worst comes to worst, implementing a logistic-regression model is fairly simple, and then you can batch easy as pie.
Edit (due to discussion in comments):
Here's a practical way to go about it, with a very very simple (and easy) example -
from sklearn.linear_model import SGDClassifier
import numpy as np
import random
X1 = np.random.multivariate_normal(mean=[10, 5], cov = np.diag([3, 8]), size=1000) # diagonal covariance for simplicity
Y1 = np.zeros((1000, 1))
X2 = np.random.multivariate_normal(mean=[-4, 55], cov = np.diag([5, 1]), size=1000) # diagonal covariance for simplicity
Y2 = np.ones((1000, 1))
X = np.vstack([X1, X2])
Y = np.vstack([Y1, Y2]).reshape([2000,])
sgd = SGDClassifier(loss='log', warm_start=True) # as mentioned in answer. note that shuffle is defaulted to True.
sgd.partial_fit(X, Y, classes = [0, 1]) # first time you need to say what your classes are
for k in range(1000):
batch_indexs = random.sample(range(2000), 20)
sgd.partial_fit(X[batch_indexs, :], Y[batch_indexs])
In practice you should be looking at the loss and accuracy and using a suitable while instead of for, but that much is left for the reader ;-)
Note that you can control more than I've shown (like the number of iterations etc.), so you should read the documentation of SGDClassifier properly.
Another thing to note is that there are different practices of batching. I just took a random subset every iteration, but some prefer to make sure every point in the data has been seen an equal amount of times (e.g. shuffle the data and then take batches in order indexes or something).

Trying to understand this simple TensorFlow code

I'm interested in Deep Learning and recently found out about TenserFlow. I installed it and followed the tutorial found at https://www.tensorflow.org/get_started/get_started .
This is the code I came up with by following that tutorial:
import tensorflow as tf
W = tf.Variable(0.3, tf.float32)
b = tf.Variable(-0.3, tf.float32)
x = tf.placeholder(tf.float32)
y = tf.placeholder(tf.float32)
linear_model = W * x + b
squared_deltas = tf.square(linear_model - y)
loss = tf.reduce_sum(squared_deltas)
sess = tf.Session()
init = tf.global_variables_initializer()
sess.run(init)
optimizer = tf.train.GradientDescentOptimizer(0.01)
train = optimizer.minimize(loss)
sess.run(init)
for i in range(1000):
sess.run(train, {x:[1,2,3,4], y:[0,-1,-2,-3]})
print(sess.run([W, b]))
For the time being, I'm only interested in the code before it does the training, as to not get overwhelmed.
Now, I understand (or at least I think I do) parts of this code. It produces the result as expected following the tutorial, but most lines in this code are confusing to me. It might be because I'm not familiar with the mathematics involved, but I don't know how much math is actually involved here so it's hard to tell if that's the problem.
Anyway, I understand the first 6 lines.
Then there's this line:
squared_deltas = tf.square(linear_model - y)
As I understand it, it simply returns the square of (linear_model - y)
However, y has no value yet.
Then, loss is assigned the value of tf.reduce_sum(squared_deltas). I understand that loss needs to be as low as possible.
How do I even interpret these last two lines?
I sort of understand tf.Session() and tf.global_variables_initializer() so I'm not too concerned with those two functions right now.
Bonus question: changing the value in the argument of tf.train.GradientDescentOptimizer() in either direction (increase or decrease) gives me the wrong result. How come 0.01 works when 0.1, 0.001 doesn't?
I appreciate any help I can get!
Thanks
As I understand it, it simply returns the square of (linear_model - y) However, y has no value yet.
Then, loss is assigned the value of tf.reduce_sum(squared_deltas). I understand that loss needs to be as low as possible.
How do I even interpret these last two lines?
You clearly need to go through TensorFlow documents. You are missing the core idea behind TF - that it defines computational graph, at this point there are no computations being involved, you are right - there is no "y" yet, no values at least - it is just a symbolic variable (placeholder) thus we say that our loss will be a mean of square of differences between predictions and true values (y), but we are not providing it yet. Actual values start "living" in the session, before that this is just a graph of computations, instructions for the TF so it knows "what to anticipate".
Bonus question: changing the value in the argument of tf.train.GradientDescentOptimizer() in either direction (increase or decrease) gives me the wrong result. How come 0.01 works when 0.1, 0.001 doesn't?
Linear regression (which you are working with) converges iff learning rate is small enough and you have enough iterations. 0.1 is probably just too big, 0.01 is fine and so is 0.001, you simply need more than a 1000 iterations for 0.001, but it will work (and so will any smaller value, but again - much slower).

What is the difference between partial fit and warm start?

Context:
I am using Passive Aggressor from scikit library and confused whether to use warm start or partial fit.
Efforts hitherto:
Referred this thread discussion:
https://github.com/scikit-learn/scikit-learn/issues/1585
Gone through the scikit code for _fit and _partial_fit.
My observations:
_fit in turn calls _partial_fit.
When warm_start is set, _fit calls _partial_fit with self.coef_
When _partial_fit is called without coef_init parameter and self.coef_ is set, it continues to use self.coef_
Question:
I feel both are ultimately providing the same functionalities.Then, what is the basic difference between them? In which contexts, either of them are used?
Am I missing something evident? Any help is appreciated!
I don't know about the Passive Aggressor, but at least when using the SGDRegressor, partial_fit will only fit for 1 epoch, whereas fit will fit for multiple epochs (until the loss converges or max_iter
is reached). Therefore, when fitting new data to your model, partial_fit will only correct the model one step towards the new data, but with fit and warm_start it will act as if you would combine your old data and your new data together and fit the model once until convergence.
Example:
from sklearn.linear_model import SGDRegressor
import numpy as np
np.random.seed(0)
X = np.linspace(-1, 1, num=50).reshape(-1, 1)
Y = (X * 1.5 + 2).reshape(50,)
modelFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
shuffle=True, max_iter=2000, tol=1e-3, warm_start=True)
modelPartialFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
shuffle=True, max_iter=2000, tol=1e-3, warm_start=False)
# first fit some data
modelFit.fit(X, Y)
modelPartialFit.fit(X, Y)
# for both: Convergence after 50 epochs, Norm: 1.46, NNZs: 1, Bias: 2.000027, T: 2500, Avg. loss: 0.000237
print(modelFit.coef_, modelPartialFit.coef_) # for both: [1.46303288]
# now fit new data (zeros)
newX = X
newY = 0 * Y
# fits only for 1 epoch, Norm: 1.23, NNZs: 1, Bias: 1.208630, T: 50, Avg. loss: 1.595492:
modelPartialFit.partial_fit(newX, newY)
# Convergence after 49 epochs, Norm: 0.04, NNZs: 1, Bias: 0.000077, T: 2450, Avg. loss: 0.000313:
modelFit.fit(newX, newY)
print(modelFit.coef_, modelPartialFit.coef_) # [0.04245779] vs. [1.22919864]
newX = np.reshape([2], (-1, 1))
print(modelFit.predict(newX), modelPartialFit.predict(newX)) # [0.08499296] vs. [3.66702685]
If warm_start = False, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will reset the model's trainable parameters for the initialisation. If warm_start = True, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will retain the values of the model's trainable parameters from the previous run, and use those initially.
Regardless of the value of warm_start, each call to partial_fit() will retain the previous run's model parameters and use those initially.
Example using MLPRegressor:
import sklearn.neural_network
import numpy as np
np.random.seed(0)
x = np.linspace(-1, 1, num=50).reshape(-1, 1)
y = (x * 1.5 + 2).reshape(50,)
cold_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=False, max_iter=1)
warm_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=True, max_iter=1)
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[0.17009494]])] [array([0.74643783])]
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60819342]])] [array([-1.21256186])]
#after second run of .fit(), values are completely different
#because they were re-initialised before doing the second run for the cold model
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39815616]])] [array([1.651504])]
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39715616]])] [array([1.652504])]
#this time with the warm model, params change relatively little, as params were
#not re-initialised during second call to .fit()
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60719343]])] [array([-1.21156187])]
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60619347]])] [array([-1.21056189])]
#with partial_fit(), params barely change even for cold model,
#as no re-initialisation occurs
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39615617]])] [array([1.65350392])]
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39515619]])] [array([1.65450372])]
#and of course the same goes for the warm model
First, let us look at the difference between .fit() and .partial_fit().
.fit() would let you train from the scratch. Hence, you could think of this as a option that can be used only once for a model. If you call .fit() again with a new set of data, the model would be build on the new data and will have no influence of previous dataset.
.partial_fit() would let you update the model with incremental data. Hence, this option can be used more than once for a model. This could be useful, when the whole dataset cannot be loaded into the memory, refer here.
If both .fit() or .partial_fit() are going to be used once, then it makes no difference.
warm_start can be only used in .fit(), it would let you start the learning from the co-eff of previous fit(). Now it might sound similar to the purpose to partial_fit(), but recommended way would be partial_fit(). May be do the partial_fit() with same incremental data few number of times, to improve the learning.
About difference. Warm start it just an attribute of class. Partial fit it is method of this class. It's basically different things.
About same functionalities. Yes, partial fit will use self.coef_ because it still needed to get some values to update on training period. And for empty coef_init we just put zero values to self.coef_ and go to the next step of training.
Description.
For first start:
Whatever how (with or without warm start). We will train on zero coefficients but in result we will save average of our coefficients.
N+1 start:
With warm start. We will check via method _allocate_parameter_mem our previous coefficients and take it to train. In result save our average coefficients.
Without warm start. We will put zero coefficients (as first start) and go to training step. In result we will still write average coefficients to memory.

Categories