I am currently learning ML on coursera with the help of course on ML by Andrew Ng. I am performing the assignments in python because I am more used to it rather than Matlab. I have recently come to a problem regarding my understanding of the topic of Regularization. My understanding is that by doing regularization, one can add less important features which are important enough in prediction. But while implementing it, I don't understand why the 1st element of theta(parameters) i.e theta[0] is skipped while calculating the cost. I have referred other solutions but they also have done the same skipping w/o explanation.
Here is the code:
`
term1 = np.dot(-np.array(y).T,np.log(h(theta,X)))
term2 = np.dot((1-np.array(y)).T,np.log(1-h(theta,X)))
regterm = (lambda_/2) * np.sum(np.dot(theta[1:].T,theta[1:])) #Skip theta0. Explain this line
J=float( (1/m) * ( np.sum(term1 - term2) + regterm ) )
grad=np.dot((sigmoid(np.dot(X,theta))-y),X)/m
grad_reg=grad+((lambda_/m)*theta)
grad_reg[0]=grad[0]
`
And here is the formula:
Here J(theta) is cost function
h(x) is the sigmoid function or hypothesis.
lamnda is the regularization parameter.
Theta0 is referring to bias.
Bias comes in to picture when we want our decision boundaries to be separated properly. just consider an example of
Y1=w1 * X and then Y2= w2 * X
when the values of X comes close to zero, there could be a case when its a tough deal to separate them, here comes bias into the role.
Y1=w1 * X + b1 and Y2= w2 * X + b2
now, via learning, the decision boundaries will be clear all the time.
Let’s consider why we use regularization now.
So that we don’t over-fit, and smoothen the curve. As you can see the equation, its the slopes w1 and w2, that needs smoothening, bias are just the intercepts of segregation. So, there is no point of using them in regularization.
Although we can use it, in the case of neural networks it won’t make any difference. But we might face the issues of reducing bias value so much, that it might confuse data points. Thus, it's better to not use Bias in Regularization.
Hope it answers your question.
Originally published: https://medium.com/#shrutijadon10104776/why-we-dont-use-bias-in-regularization-5a86905dfcd6
Related
I have a Logistic Regression model. There are around 10 features, 3 of which are basically highly correlated (Lets call them x_5, x_6, x_7). In fact x_5 + x_6 = x_7. But they are all kind of important in the business sense.
I did a log transformation on the data, and since there are quite a number of zeros, I also added 1 to all data. That means:
1) x_5 + x_6 = x_7
2) I did log(1 + x_5), log(1 + x_6) and log(1 + x_7) (and also other features)
And then I fit a Logistic Regression in different cases, and checked the coefficients.(Lets call them beta_5, beta_6, beta_7 for x_5, x_6, x_7 respectively). The cases are summarized below. (zero means I omit the variable, i.e. in case 2 I omitted x_7)
There are something that I find confused.
1) The signs of beta_5 and beta_6 change from case 1 to case 2. I understand this is becoz of the multicollinearity issue. But does it affect the predictability of my Logistic Model?
2) The value of beta_7 drops quite significantly from case 1 to case 3. Does case 3 explain better the importance of x_7?
3) Based on this findings, which case should I use? Or how should I make the decision?
Thanks for your help!
as you have the governing equation x5+x6 = x7, then you may drop one of them from the beginning.
To be confident about final solution, you could apply regularization using Lasso to know which feature(s) could be removed.
I'm learning to train a Linear Regression model via TensorFlow.
It's quite a simple formula:
y = W * x + b
I have generated a sample data:
After the model training I can see in Tensorboard that "W" is correct when "b" goes a completely wrong way. So, Loss is quite high.
Here is my code.
QUESTION
Why is "b" being trained a wrong way?
Shall I do something with the optimizer?
On line 16, you are adding gaussian noise with a standard deviation of 300!!
noise = np.random.normal(scale=n, size=(N, 1))
Try using:
noise = np.random.normal(size=(N, 1))
That's using mean=0 and std=1 (standard Gaussian noise).
Also, 20k iterations is more than enough (in this problem) for training.
For a more comprehensive explanation of what is happening, look at your plot. Given an x value, the possible values for y have thousands of units of difference. That means that there are a lot of lines that explain your data. Hence a lot of values for B are possible, but no matter which one you choose (even the true b value) all of them are going to have a big loss.
The optimization is working correctly but the problem is with the b parameter whose estimation is much more heavily influenced by the initial "roll of dice" of noise (which has a standard deviation of N) than the actual value of b_true (which is much smaller than N).
I am playing with Rule 110 of Wolfram cellular automata. Given line of zeroes and ones, you can calculate next line with these rules:
Starting with 00000000....1 in the end you get this sequence:
Just of curiosity I decided to approximate these rules with a polynomial, so that cells could be not only 0 and 1, but also gray color in between:
def triangle(x,y,z,v0):
v=(y + y * y + y * y * y - 3. * (1. + x) * y * z + z * (1. + z + z * z)) / 3.
return (v-v0)*(v-v0)
so if x,y,z and v0 matches any of these rules from the table, it will return 0, and positive nonzero value otherwise.
Next I've added all possible groups of 4 neighbors into single sum, which will be zero for integer solutions:
def eval():
s = 0.
for i in range(W - 1):
for j in range(1, W + 1):
xx = x[i, (j - 1) % W]
yy = x[i, j % W]
zz = x[i, (j + 1) % W]
r = x[i + 1, j % W]
s += triangle(xx, yy, zz, r)
for j in range(W - 1): s += x[0, j] * x[0, j]
s += (1 - x[0, W - 1]) * (1 - x[0, W - 1])
return torch.sqrt(s)
Also in the bottom of this function I add ordinary conditions for first line, so that all elements are 0 except last one, which is 1. Finally I've decided to minimize this sum of squares on W*W matrix with pytorch:
x = Variable(torch.DoubleTensor(W,W).zero_(), requires_grad=True)
opt = torch.optim.LBFGS([x],lr=.1)
for i in range(15500):
def closure():
opt.zero_grad()
s=eval()
s.backward()
return s
opt.step(closure)
Here is full code, you can try it yourself. The problem is that for 10*10 it converges to correct solution in ~20 steps:
But if I take 15*15 board, it never finishes convergence:
The graph on the right shows how sum of squares is changing with each next iteration and you can see that it never reaches zero. My question is why this happens and how can I fix this. Tried different pytorch optimisers, but all of them perform worse then LBFGS. Tried different learning rates. Any ideas why this happen and how I can reach final point during optimisation?
UPD: improved convergence graph, log of SOS:
UPD2: I also tried doing same in C++ with dlib, and I don't have any convergency issues there, it goes much deeper in much less time:
I am using this code for optimisation in C++:
find_min_using_approximate_derivatives(bfgs_search_strategy(),
objective_delta_stop_strategy(1e-87),
s, x, -1)
What's you're trying to do here is non convex optimisation and this is a notoriously difficult problem. Once you think about it, it make sense because just about any practical mathematical problem can be formulated as an optimisation problem.
1. Prelude
So, before giving you hints as to where to find a solution to your particular problem, I want to illustrate why certain optimisation problems are easy to solve.
I'm going to start by discussing convex problems. These are easy to solve even in the constrained case, and the reason for this is that when you compute the gradient you actually get a lot of information of where the minimum cannot be (the Taylor expansion of a convex function, f, is always an underestimate of f), additionally there is only one minimum and no sadle points. If you're interested in learning more about convex optimisation I recommend seeing Stephen Boyd's class in convex optimisation on YouTube
Now then if non convex optimization is so difficult, how come we are able to solve it in deep learning? The answer is simply that the non convex function we are minimising in deep learning it's quite nice as demonstrated by Henaff et al.
It is therefore important that machine learning practitioners realise that the operation procedures used in deep learning will most likely not yield a good minimum, if they converge to a minimum in the first place, on other non-convex problems.
2. Answer to your question
Now then to answer your problem, You're not you probably not gonna find and fast solution as nonconvex optimisation is NP complete. But fear not, SciPy has a few global optimisation algorithms to choose from. Here is a link to another stack overflow thread with a good answer to your question.
3. Moral of the story
Finally, I want to remind you that convergence guarantees are important, forgetting it has led to an oil rig collapsing.
PS. Please forgive typos, I'm usong my phone for this
Update: As to why BFGS works with dlib, there might be two reasons, firstly, BFGS is better at using curvature information than L-BFGS, and secondly it uses a line search to find an optimal step size. I'd recommend checking if PyTorch allow line searches and if not, setting an decreasing step size (or just a really low one).
I am trying to predict quality of metal coil. I have the metal coil with width 10 meters and length from 1 to 6 kilometers. As training data I have ~600 parameters measured each 10 meters, and final quality control mark - good/bad (for whole coil). Bad means there is at least 1 place there is coil is bad, there is no data where is exactly. I have data for approx 10000 coils.
Lets imagine we want to train logistic regression for this data(with 2 factors).
X = [[0, 0],
...
[0, 0],
[1, 1], # coil is actually broken here, but we don't know it yet.
[0, 0],
...
[0, 0]]
Y = ?????
I can't just put all "bad" in Y and run classifier, because I will be confusing for classifier. I can't put all "good" and one "bad" in Y becuase I don't know where is the bad position.
The solution I have in mind is the following, I could define loss function as sum( (Y-min(F(x1,x2)))^2 ) (min calculated by all F belonging to one coil) not sum( (Y-F(x1,x2))^2 ). In this case probably I get F trained correctly to point bad place. I need gradient for that, it there is impossible to calculate it in all points, the min is not differentiable in all places, but I could use weak gradient instead(using values of functions which is minimal in coil in every place).
I more or less know how to implement it myself, the question is what is the simplest way to do it in python with scikit-learn. Ideally it should be same (or easily adaptable) with several learning method(a lot of methods based on loss function and gradient), is where possible to make some wrapper for learning methods which works this way?
update: looking at gradient_boosting.py - there is internal abstract class LossFunction with ability to calculate loss and gradient, looks perspective. Looks like there is no common solution.
What you are considering here is known in machine learning community as superset learning, meaning, that instead of typical supervised setting where you have training set in the form of {(x_i, y_i)} you have {({x_1, ..., x_N}, y_1)} such that you know that at least one element from the set has property y_1. This is not a very common setting, but existing, with some research available, google for papers in the domain.
In terms of your own loss functions - scikit-learn is a no-go. Scikit-learn is about simplicity, it provides you with a small set of ready to use tools with very little flexibility. It is not a research tool, and your problem is researchy. What can you use instead? I suggest you go for any symbolic-differentiation solution, for example autograd which gives you ability to differentiate through python code, simply apply scipy.optimize.minimize on top of it and you are done! Any custom loss function will work just fine.
As a side note - minimum operator is not differentiable, thus the model might have hard time figuring out what is going on. You could instead try to do sum((Y - prod_x F(x_1, x_2) )^2) since multiplication is nicely differentiable, and you will still get the similar effect - if at least one element is predicted to be 0 it will remove any "1" answer from the remaining ones. You can even go one step further to make it more numerically stable and do:
if Y==0 then loss = sum_x log(F(x_1, x_2 ) )
if Y==1 then loss = sum_x log(1-F(x_1, x_2))
which translates to
Y * sum_x log(1-F(x_1, x_2)) + (1-Y) * sum_x log( F(x_1, x_2) )
you can notice similarity with cross entropy cost which makes perfect sense since your problem is indeed a classification. And now you have perfect probabilistic loss - you are attaching such probabilities of each segment to be "bad" or "good" so the probability of the whole object being bad is either high (if Y==0) or low (if Y==1).
I'm teaching myself PyMC but got stuck with the following problem:
I have a model whose parameters should be determined from successive measurements. In the beginning the parameter's prior is uninformative, but should be updated after each measurement (i.e. replaced by the posterior). In short, I want to do sequential updating with PyMC.
Consider the following (somewhat constructed) example:
Measurement 1: 10 questions, 9 correct answers
Measurement 2: 5 questions, 3 correct answers
Of course, this can be solved analytically with beta/binomial conjugate priors, but this is not the point here :)
Alternatively, both measurements could be combined to n=15 and k=12. However, this is too simple. I want to take the hard way for educational purposes.
I found a solution in this answer, where new priors are sampled from the posterior. This is almost what I want, but sampling the prior feels a bit messy because the results depends on the number of samples and other settings.
My attempted solution puts both measurement and priors separately in one model, like this:
n1, k1 = 10, 9
n2, k2 = 5, 3
theta1 = pymc.Beta('theta', alpha=1, beta=1)
outcome1 = pymc.Binomial('outcome1', n=n1, p=theta1, value=k1, observed=True)
theta2 = ? # should be the posterior of theta1
outcome2 = pymc.Binomial('outcome2', n=n2, p=theta2, value=k2, observed=True)
How can I get the posterior of theta1 as the prior of theta2?
Is this even possible, or did I just demonstrate ultimate ignorance about Bayesian statistics?
The only way sequential updating works sensibly is in two different models. Specifying them in the same model does not make any sense, since we have no posteriors until after MCMC has completed.
In principle, you would examine the distribution of theta1 and specify a prior that best resembles it. In this simple case it is easy -- it would be:
theta2 = pymc.Beta('theta2', alpha=10, beta=2)
since you don't need MCMC to determine what the posterior of theta is. More generally, you could fit a Beta distribution to the posterior, say using scipy.stats.beta.fit.