Tensorflow: How to train a Linear Regression? - python

I'm learning to train a Linear Regression model via TensorFlow.
It's quite a simple formula:
y = W * x + b
I have generated a sample data:
After the model training I can see in Tensorboard that "W" is correct when "b" goes a completely wrong way. So, Loss is quite high.
Here is my code.
QUESTION
Why is "b" being trained a wrong way?
Shall I do something with the optimizer?

On line 16, you are adding gaussian noise with a standard deviation of 300!!
noise = np.random.normal(scale=n, size=(N, 1))
Try using:
noise = np.random.normal(size=(N, 1))
That's using mean=0 and std=1 (standard Gaussian noise).
Also, 20k iterations is more than enough (in this problem) for training.
For a more comprehensive explanation of what is happening, look at your plot. Given an x value, the possible values for y have thousands of units of difference. That means that there are a lot of lines that explain your data. Hence a lot of values for B are possible, but no matter which one you choose (even the true b value) all of them are going to have a big loss.

The optimization is working correctly but the problem is with the b parameter whose estimation is much more heavily influenced by the initial "roll of dice" of noise (which has a standard deviation of N) than the actual value of b_true (which is much smaller than N).

Related

How variable alpha changes SGDRegressor behavior for outlier?

I am using SGDRegressor with a constant learning rate and default loss function. I am curious to know how changing the alpha parameter in the function from 0.0001 to 100 will change regressor behavior. Below is the sample code I have:
from sklearn.linear_model import SGDRegressor
out=[(0,2),(21, 13), (-23, -15), (22,14), (23, 14)]
alpha=[0.0001, 1, 100]
N= len(out)
plt.figure(figsize=(20,15))
j=1
for i in alpha:
X= b * np.sin(phi) #Since for every alpha we want to start with original dataset, I included X and Y in this section
Y= a * np.cos(phi)
for num in range(N):
plt.subplot(3, N, j)
X=np.append(X,out[num][0]) # Appending outlier to main X
Y=np.append(Y,out[num][1]) # Appending outlier to main Y
j=j+1 # Increasing J so we move on to next plot
model=SGDRegressor(alpha=i, eta0=0.001, learning_rate='constant',random_state=0)
model.fit(X.reshape(-1, 1), Y) # Fitting the model
plt.scatter(X,Y)
plt.title("alpha = "+ str(i) + " | " + "Slope :" + str(round(model.coef_[0], 4))) #Adding title to each plot
abline(model.coef_[0],model.intercept_) # Plotting the line using abline function
plt.show()
As shown above I had the main datset of X and Y and in each iteration, I am adding a point as an outlier to the main dataset and train the model and plot regression line (hyperplane). Below you can see the result for different values of alpha:
I am looking at results and am still confused and can't make solid conclusion as how alhpa parameter changes the model? what's the effect of alpha? is it causing overfitting? underfitting?
From scikit-learn:
alpha : float, default=0.0001
Constant that multiplies the regularization term. The higher the value, the stronger the regularization. Also used to compute the learning rate when set to learning_rate is set to 'optimal'.
As for regularization, this technique discourages learning a more complex or flexible model, so as to avoid the risk of overfitting. If there is noise (not "true" data) in the training data, then the model's estimated coefficients won’t generalize well to the future (test) data. This is where regularization comes in and shrinks or regularizes these learned estimates towards zero.
From Towards Data Science (paraphrased):
A standard least squares model tends to have some variance in it, i.e. this model won’t generalize well for a data set different than its training data. Regularization significantly reduces the variance of the model, without substantial increase in its bias. The tuning parameter alpha controls the impact on bias and variance. As the value of alpha rises, it reduces the value of coefficients, thus reducing the variance.Till a point, this increase in alpha is beneficial as it is only reducing the variance (hence avoiding overfitting), without losing any important properties in the data. But after certain value, the model starts losing important properties, giving rise to bias in the model and thus underfitting.
In your example, comparing the rows of the third column highlights this effect (slope).

Interpreting logistic regression coefficients of scaled features

I'm using a logistic regression to estimate the probability of scoring a goal in soccer/footbal. I've got 5 features. My target values are 1 (goal) or 0 (no goal).
As is always a must, I've scaled my features before fitting my model. I've used the MinMaxScaler, who scales all features in the range [0-1] as follows:
X_scaled = (x - x_min)/(x_max - x_min)
The coefficients of my logistic regression model are the following:
coef = [[-2.26286643 4.05722387 0.74869811 0.20538172 -0.49969841]]
My first thoughts are that the second features is the most important, followed by the first. Is this always true?
I read that "In other words, for a one-unit increase in the 'the second feature', the expected change in log odds is 4.05722387." on this site, but there, their features were normalized with a mean of 50 and some std deviation.
If I do not scale my features, the coefficients of the model are the following:
coef = [[-0.04743728 0.04394143 -0.00247654 0.23769469 -0.55051824]]
And now it seems that the first feature is more important than the second one. I read in literature about my topic that this is indeed true. So this confuses me off course.
My questions are:
Which of my features is the most important and what/why is the best methodology to find it?
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1*coef1 + feature2*coef2 + ... (with all features scaled).
Which of my features is the most important and what/why is the best methodology to find it?
Look at several versions of marginal effects calculations. For example, see overview/discussion in a blog Stata's example resources for R
How can I interprete the meaning of the scaled coefficients? E.g. what does an increase with 1 meter in feature 1 mean? Can I throw 1 meter in the MinMaxScaler, see what comes out and use that as 'the one inut increase'?
The interpretation depends on which marginal effects you calculate. You just need to account for scaling when you talk about one unit of X increasing/decreasing the change in probability or odds ratio etc.
Is it true that the final probability wil be computed as y = 1/(1 + exp(-fx)) with fx = intercept + feature1coef1 + feature2coef2 + ... (with all features scaled).
Yes, it's just that features x are in scaled measures.

How do I make decision/ interpret results from the feature coefficients when their signs change in Logistic Regression?

I have a Logistic Regression model. There are around 10 features, 3 of which are basically highly correlated (Lets call them x_5, x_6, x_7). In fact x_5 + x_6 = x_7. But they are all kind of important in the business sense.
I did a log transformation on the data, and since there are quite a number of zeros, I also added 1 to all data. That means:
1) x_5 + x_6 = x_7
2) I did log(1 + x_5), log(1 + x_6) and log(1 + x_7) (and also other features)
And then I fit a Logistic Regression in different cases, and checked the coefficients.(Lets call them beta_5, beta_6, beta_7 for x_5, x_6, x_7 respectively). The cases are summarized below. (zero means I omit the variable, i.e. in case 2 I omitted x_7)
There are something that I find confused.
1) The signs of beta_5 and beta_6 change from case 1 to case 2. I understand this is becoz of the multicollinearity issue. But does it affect the predictability of my Logistic Model?
2) The value of beta_7 drops quite significantly from case 1 to case 3. Does case 3 explain better the importance of x_7?
3) Based on this findings, which case should I use? Or how should I make the decision?
Thanks for your help!
as you have the governing equation x5+x6 = x7, then you may drop one of them from the beginning.
To be confident about final solution, you could apply regularization using Lasso to know which feature(s) could be removed.

loss function as min of several points, custom loss function and gradient

I am trying to predict quality of metal coil. I have the metal coil with width 10 meters and length from 1 to 6 kilometers. As training data I have ~600 parameters measured each 10 meters, and final quality control mark - good/bad (for whole coil). Bad means there is at least 1 place there is coil is bad, there is no data where is exactly. I have data for approx 10000 coils.
Lets imagine we want to train logistic regression for this data(with 2 factors).
X = [[0, 0],
...
[0, 0],
[1, 1], # coil is actually broken here, but we don't know it yet.
[0, 0],
...
[0, 0]]
Y = ?????
I can't just put all "bad" in Y and run classifier, because I will be confusing for classifier. I can't put all "good" and one "bad" in Y becuase I don't know where is the bad position.
The solution I have in mind is the following, I could define loss function as sum( (Y-min(F(x1,x2)))^2 ) (min calculated by all F belonging to one coil) not sum( (Y-F(x1,x2))^2 ). In this case probably I get F trained correctly to point bad place. I need gradient for that, it there is impossible to calculate it in all points, the min is not differentiable in all places, but I could use weak gradient instead(using values of functions which is minimal in coil in every place).
I more or less know how to implement it myself, the question is what is the simplest way to do it in python with scikit-learn. Ideally it should be same (or easily adaptable) with several learning method(a lot of methods based on loss function and gradient), is where possible to make some wrapper for learning methods which works this way?
update: looking at gradient_boosting.py - there is internal abstract class LossFunction with ability to calculate loss and gradient, looks perspective. Looks like there is no common solution.
What you are considering here is known in machine learning community as superset learning, meaning, that instead of typical supervised setting where you have training set in the form of {(x_i, y_i)} you have {({x_1, ..., x_N}, y_1)} such that you know that at least one element from the set has property y_1. This is not a very common setting, but existing, with some research available, google for papers in the domain.
In terms of your own loss functions - scikit-learn is a no-go. Scikit-learn is about simplicity, it provides you with a small set of ready to use tools with very little flexibility. It is not a research tool, and your problem is researchy. What can you use instead? I suggest you go for any symbolic-differentiation solution, for example autograd which gives you ability to differentiate through python code, simply apply scipy.optimize.minimize on top of it and you are done! Any custom loss function will work just fine.
As a side note - minimum operator is not differentiable, thus the model might have hard time figuring out what is going on. You could instead try to do sum((Y - prod_x F(x_1, x_2) )^2) since multiplication is nicely differentiable, and you will still get the similar effect - if at least one element is predicted to be 0 it will remove any "1" answer from the remaining ones. You can even go one step further to make it more numerically stable and do:
if Y==0 then loss = sum_x log(F(x_1, x_2 ) )
if Y==1 then loss = sum_x log(1-F(x_1, x_2))
which translates to
Y * sum_x log(1-F(x_1, x_2)) + (1-Y) * sum_x log( F(x_1, x_2) )
you can notice similarity with cross entropy cost which makes perfect sense since your problem is indeed a classification. And now you have perfect probabilistic loss - you are attaching such probabilities of each segment to be "bad" or "good" so the probability of the whole object being bad is either high (if Y==0) or low (if Y==1).

I have some image data and I am training my SVM classifier on it and I see that the support vectors are all zero. what does it mean?

Here is the code that I was using.
nfiles=40
i=0
y=[1]*20
y.extend([-1]*20)
traindata =np.zeros((40,1024));
for im in glob.glob("/home/name/Desktop/database/20trainset/*.png "):
img = cv2.imread(im)
img = im2double(img)
img = rgb2gray(img)
traindata[i] =img.reshape((1,1024))
i+=1
clf = svm.SVC( kernel='rbf',C=0.05)
clf.fit(traindata,y)
print clf.support_vectors_
The i is to keep count of the number of files.
Is there anything wrong?
This means that using your parameters, optimal SVM is to do nothing (build a trivial model which simply answers one label).
cl(x) = sign( SUM_i alpha_i y_i K(x_i, x) + b ) = sign( b ) = const.
so yes, there is something wrong, as your model does not use your training data at all.
What are the reasons? In order to use RBF-SVM you need to fit two hyperparameters: gamma and C, in particular C=0.05 is (probably) orders of magnitude too small to work well. Also - remember to normalize your data, because images are often represented as pixel intensitiy values (0-255) while SVM works better for a normally distributed values. Gamma also has to be carefully fitted (its default value in scikit-learn is 1/no. of features, thus it is really rough guess of a good value, rarely a good one).
As an example refer to libsvm website and a plot showing relation of C and gamma values to the obtained accuracy on heart dataset

Categories