loss function as min of several points, custom loss function and gradient

loss function as min of several points, custom loss function and gradient - python

I am trying to predict quality of metal coil. I have the metal coil with width 10 meters and length from 1 to 6 kilometers. As training data I have ~600 parameters measured each 10 meters, and final quality control mark - good/bad (for whole coil). Bad means there is at least 1 place there is coil is bad, there is no data where is exactly. I have data for approx 10000 coils.
Lets imagine we want to train logistic regression for this data(with 2 factors).
X = [[0, 0],
...
[0, 0],
[1, 1], # coil is actually broken here, but we don't know it yet.
[0, 0],
...
[0, 0]]
Y = ?????
I can't just put all "bad" in Y and run classifier, because I will be confusing for classifier. I can't put all "good" and one "bad" in Y becuase I don't know where is the bad position.
The solution I have in mind is the following, I could define loss function as sum( (Y-min(F(x1,x2)))^2 ) (min calculated by all F belonging to one coil) not sum( (Y-F(x1,x2))^2 ). In this case probably I get F trained correctly to point bad place. I need gradient for that, it there is impossible to calculate it in all points, the min is not differentiable in all places, but I could use weak gradient instead(using values of functions which is minimal in coil in every place).
I more or less know how to implement it myself, the question is what is the simplest way to do it in python with scikit-learn. Ideally it should be same (or easily adaptable) with several learning method(a lot of methods based on loss function and gradient), is where possible to make some wrapper for learning methods which works this way?
update: looking at gradient_boosting.py - there is internal abstract class LossFunction with ability to calculate loss and gradient, looks perspective. Looks like there is no common solution.

What you are considering here is known in machine learning community as superset learning, meaning, that instead of typical supervised setting where you have training set in the form of {(x_i, y_i)} you have {({x_1, ..., x_N}, y_1)} such that you know that at least one element from the set has property y_1. This is not a very common setting, but existing, with some research available, google for papers in the domain.
In terms of your own loss functions - scikit-learn is a no-go. Scikit-learn is about simplicity, it provides you with a small set of ready to use tools with very little flexibility. It is not a research tool, and your problem is researchy. What can you use instead? I suggest you go for any symbolic-differentiation solution, for example autograd which gives you ability to differentiate through python code, simply apply scipy.optimize.minimize on top of it and you are done! Any custom loss function will work just fine.
As a side note - minimum operator is not differentiable, thus the model might have hard time figuring out what is going on. You could instead try to do sum((Y - prod_x F(x_1, x_2) )^2) since multiplication is nicely differentiable, and you will still get the similar effect - if at least one element is predicted to be 0 it will remove any "1" answer from the remaining ones. You can even go one step further to make it more numerically stable and do:
if Y==0 then loss = sum_x log(F(x_1, x_2 ) )
if Y==1 then loss = sum_x log(1-F(x_1, x_2))
which translates to
Y * sum_x log(1-F(x_1, x_2)) + (1-Y) * sum_x log( F(x_1, x_2) )
you can notice similarity with cross entropy cost which makes perfect sense since your problem is indeed a classification. And now you have perfect probabilistic loss - you are attaching such probabilities of each segment to be "bad" or "good" so the probability of the whole object being bad is either high (if Y==0) or low (if Y==1).

Related

How do I make decision/ interpret results from the feature coefficients when their signs change in Logistic Regression?

I have a Logistic Regression model. There are around 10 features, 3 of which are basically highly correlated (Lets call them x_5, x_6, x_7). In fact x_5 + x_6 = x_7. But they are all kind of important in the business sense.
I did a log transformation on the data, and since there are quite a number of zeros, I also added 1 to all data. That means:
1) x_5 + x_6 = x_7
2) I did log(1 + x_5), log(1 + x_6) and log(1 + x_7) (and also other features)
And then I fit a Logistic Regression in different cases, and checked the coefficients.(Lets call them beta_5, beta_6, beta_7 for x_5, x_6, x_7 respectively). The cases are summarized below. (zero means I omit the variable, i.e. in case 2 I omitted x_7)
There are something that I find confused.
1) The signs of beta_5 and beta_6 change from case 1 to case 2. I understand this is becoz of the multicollinearity issue. But does it affect the predictability of my Logistic Model?
2) The value of beta_7 drops quite significantly from case 1 to case 3. Does case 3 explain better the importance of x_7?
3) Based on this findings, which case should I use? Or how should I make the decision?
Thanks for your help!

as you have the governing equation x5+x6 = x7, then you may drop one of them from the beginning.
To be confident about final solution, you could apply regularization using Lasso to know which feature(s) could be removed.

Tensorflow: How to train a Linear Regression?

I'm learning to train a Linear Regression model via TensorFlow.
It's quite a simple formula:
y = W * x + b
I have generated a sample data:
After the model training I can see in Tensorboard that "W" is correct when "b" goes a completely wrong way. So, Loss is quite high.
Here is my code.
QUESTION
Why is "b" being trained a wrong way?
Shall I do something with the optimizer?

On line 16, you are adding gaussian noise with a standard deviation of 300!!
noise = np.random.normal(scale=n, size=(N, 1))
Try using:
noise = np.random.normal(size=(N, 1))
That's using mean=0 and std=1 (standard Gaussian noise).
Also, 20k iterations is more than enough (in this problem) for training.
For a more comprehensive explanation of what is happening, look at your plot. Given an x value, the possible values for y have thousands of units of difference. That means that there are a lot of lines that explain your data. Hence a lot of values for B are possible, but no matter which one you choose (even the true b value) all of them are going to have a big loss.

The optimization is working correctly but the problem is with the b parameter whose estimation is much more heavily influenced by the initial "roll of dice" of noise (which has a standard deviation of N) than the actual value of b_true (which is much smaller than N).

Why is theta0 skipped while performing regulariztion on regression?

I am currently learning ML on coursera with the help of course on ML by Andrew Ng. I am performing the assignments in python because I am more used to it rather than Matlab. I have recently come to a problem regarding my understanding of the topic of Regularization. My understanding is that by doing regularization, one can add less important features which are important enough in prediction. But while implementing it, I don't understand why the 1st element of theta(parameters) i.e theta[0] is skipped while calculating the cost. I have referred other solutions but they also have done the same skipping w/o explanation.
Here is the code:
`
term1 = np.dot(-np.array(y).T,np.log(h(theta,X)))
term2 = np.dot((1-np.array(y)).T,np.log(1-h(theta,X)))
regterm = (lambda_/2) * np.sum(np.dot(theta[1:].T,theta[1:])) #Skip theta0. Explain this line
J=float( (1/m) * ( np.sum(term1 - term2) + regterm ) )
grad=np.dot((sigmoid(np.dot(X,theta))-y),X)/m
grad_reg=grad+((lambda_/m)*theta)
grad_reg[0]=grad[0]
`
And here is the formula:
Here J(theta) is cost function
h(x) is the sigmoid function or hypothesis.
lamnda is the regularization parameter.

Theta0 is referring to bias.
Bias comes in to picture when we want our decision boundaries to be separated properly. just consider an example of
Y1=w1 * X and then Y2= w2 * X
when the values of X comes close to zero, there could be a case when its a tough deal to separate them, here comes bias into the role.
Y1=w1 * X + b1 and Y2= w2 * X + b2
now, via learning, the decision boundaries will be clear all the time.
Let’s consider why we use regularization now.
So that we don’t over-fit, and smoothen the curve. As you can see the equation, its the slopes w1 and w2, that needs smoothening, bias are just the intercepts of segregation. So, there is no point of using them in regularization.
Although we can use it, in the case of neural networks it won’t make any difference. But we might face the issues of reducing bias value so much, that it might confuse data points. Thus, it's better to not use Bias in Regularization.
Hope it answers your question.
Originally published: https://medium.com/#shrutijadon10104776/why-we-dont-use-bias-in-regularization-5a86905dfcd6

Possible to use Rank Correlation as cost function in TensorFlow?

I'm working with extremely noisy data occasionally peppered with outliers, so I'm relying mostly on correlation as a measure of accuracy in my NN.
Is it possible to explictly use something like rank correlation (the Spearman correlation coefficient) as my cost function? Up to now, I've relied mostly on MSE as a proxy for correlation.
I have three major stumbling blocks right now:
1) The notion of ranking becomes much fuzzier with mini-batches.
2) How do you dynamically perform rankings? Will TensorFlow not have a gradient error/be unable to track how a change in a weight/bias affects the cost?
3) How do you determine the size of the tensors you're looking at during runtime?
For example, the code below is what I'd like to roughly do if I were to just use correlation. In practice, length needs to be passed in rather than determined at runtime.
length = tf.shape(x)[1] ## Example code. This line not meant to work.
original_loss = -1 * length * tf.reduce_sum(tf.mul(x, y)) - (tf.reduce_sum(x) * tf.reduce_sum(y))
divisor = tf.sqrt(
(length * tf.reduce_sum(tf.square(x)) - tf.square(tf.reduce_sum(x))) *
(length * tf.reduce_sum(tf.square(y)) - tf.square(tf.reduce_sum(y)))
)
original_loss = tf.truediv(original_loss, divisor)

Here is the code for the Spearman correlation:
predictions_rank = tf.nn.top_k(predictions_batch, k=samples, sorted=True, name='prediction_rank').indices
real_rank = tf.nn.top_k(real_outputs_batch, k=samples, sorted=True, name='real_rank').indices
rank_diffs = predictions_rank - real_rank
rank_diffs_squared_sum = tf.reduce_sum(rank_diffs * rank_diffs)
six = tf.constant(6)
one = tf.constant(1.0)
numerator = tf.cast(six * rank_diffs_squared_sum, dtype=tf.float32)
divider = tf.cast(samples * samples * samples - samples, dtype=tf.float32)
spearman_batch = one - numerator / divider
The problem with the Spearman correlation is that you need to use a sorting algorithm (top_k in my code). And there is no way to translate it to a loss value. There is no derivade of a sorting algorithm. You can use a normal correlation but I think there is no mathematically difference to use the mean squared error.
I am working on this right now for images. What I have read in papers that they use to add the ranking into the loss function is to compare 2 or 3 images (where I say images you can say anything you want to rank).
Comparing two elements:
Where N is the total number of elements and α a margin value. I got this equation from Photo Aesthetics Ranking Network with Attributes and Content Adaptation
You can also use losses with 3 elemens where you compare two of them with similar ranking with another one with a different one:
But in this equation you also need to add the direction of the ranking, more details in Will People Like Your Image?. In the case of this paper they use a vector encodig instead of a real value but you can do it for just a number too.
In the case of images, the comparison between images makes more sense when those images are related. So it is a good idea to run a clustering algorithm to create (maybe?) 10 clusters, so you can use elements of the same cluster to make comparisons instead of very different things. This will help the network as the inputs are related somehow and not completely different.
As a side note you should know what is more important for you, if it is the final rank order or the rank value. If it is the value you should go with mean square error, if it is the rank order you can use the losses I wrote before. Or you can even combine them.
How do you determine the size of the tensors you're looking at during runtime?
tf.shape(tensor) returns a tensor with the shape. Then you can use tf.gather(tensor,index) to get the value you want.

Sequential updating in PyMC

I'm teaching myself PyMC but got stuck with the following problem:
I have a model whose parameters should be determined from successive measurements. In the beginning the parameter's prior is uninformative, but should be updated after each measurement (i.e. replaced by the posterior). In short, I want to do sequential updating with PyMC.
Consider the following (somewhat constructed) example:
Measurement 1: 10 questions, 9 correct answers
Measurement 2: 5 questions, 3 correct answers
Of course, this can be solved analytically with beta/binomial conjugate priors, but this is not the point here :)
Alternatively, both measurements could be combined to n=15 and k=12. However, this is too simple. I want to take the hard way for educational purposes.
I found a solution in this answer, where new priors are sampled from the posterior. This is almost what I want, but sampling the prior feels a bit messy because the results depends on the number of samples and other settings.
My attempted solution puts both measurement and priors separately in one model, like this:
n1, k1 = 10, 9
n2, k2 = 5, 3
theta1 = pymc.Beta('theta', alpha=1, beta=1)
outcome1 = pymc.Binomial('outcome1', n=n1, p=theta1, value=k1, observed=True)
theta2 = ? # should be the posterior of theta1
outcome2 = pymc.Binomial('outcome2', n=n2, p=theta2, value=k2, observed=True)
How can I get the posterior of theta1 as the prior of theta2?
Is this even possible, or did I just demonstrate ultimate ignorance about Bayesian statistics?

The only way sequential updating works sensibly is in two different models. Specifying them in the same model does not make any sense, since we have no posteriors until after MCMC has completed.
In principle, you would examine the distribution of theta1 and specify a prior that best resembles it. In this simple case it is easy -- it would be:
theta2 = pymc.Beta('theta2', alpha=10, beta=2)
since you don't need MCMC to determine what the posterior of theta is. More generally, you could fit a Beta distribution to the posterior, say using scipy.stats.beta.fit.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

loss function as min of several points, custom loss function and gradient - python

Related

How do I make decision/ interpret results from the feature coefficients when their signs change in Logistic Regression?

Tensorflow: How to train a Linear Regression?

Why is theta0 skipped while performing regulariztion on regression?

Possible to use Rank Correlation as cost function in TensorFlow?

Sequential updating in PyMC

Categories

Resources