Fitting a higher degree function using PolynomialFeatures and LinearRegression - python

In a book I have found the following code which fits a LinearRegression to quadratic data:
m = 100
X = 6 * np.random.rand(m, 1) - 3
y = 0.5 * X**2 + X + 2 + np.random.randn(m, 1)
poly_features = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly_features.fit_transform(X)
lin_reg = LinearRegression()
lin_reg.fit(X_poly, y)
But how could that be? I know from the documentation that PolynomialFeatures(degree=2, include_bias=False) is creating an array which looks like:
[[X[0],X[0]**2]
[X[1],X[1]**2]
.....
[X[n],X[n]**2]]
BUT: How is the LinearRegression able to fit this data? Means WHAT is the LinearRegression doing and what is the concept behind this.
I am grateful for any explanations!

PolynomialFeatures with degree two will create an array that looks like:
[[1, X[0], X[0]**2]
[1, X[1], X[1]**2]
.....
[1, X[n] ,X[n]**2]]
Let's call the matrix above X. Then the LinearRegression is looking for 3 numbers a,b,c so that the vector
X* [[a],[b],[c]] - Y
has the smallest possible mean squared error (which is just the mean of the sum of the squares in the vector above).
Note that the product X* [[a],[b],[c]] is just a product of the matrix X with the column vector [a,b,c].T . The result is a vector of the same dimension as Y.
Regarding the questions in your comment:
This function is linear in the new set of features: x, x**2. Just think about x**2 as an additional feature in your model.
For the particular array mentioned in your question, the LinearRegression method is looking for numbers a,b,c that minimize the sum
(a*1+bX[0]+cX[0]**2-Y[0])**2+(a*1+bX[1]+cX[1]**2-Y[1])**2+..+(a*1+bX[n]+cX[n]**2-Y[n])**2
So it will find a set of such numbers a,b,c. Hence the suggested function y=a+b*x+c*x**2 is not based only on the first row. Instead, it is based on all the rows, because the parameters a,b,c that are chosen are those that minimize the sum above, and this sum involves elements from all the rows.
Once you created the vector x**2, the linear regression just regards it as an additional feature. You can give it a new name v=x**2. Then the linear regression is of the form y=a+b*x+c*v, which means, it is linear in x and v. The algorithm does not care how you created v. It just treats v as an additional feature.

Related

How do perform polynomial features and linear regression on 3 dimensional data (x,y,z)?

I need help performing polynomial features on 3 dimensional data and performing linear regression to create a line of best fit on the 3 dimensional polynomial.
I have a random dataframe with x, y, and z as the columns that forms a polynomial scatterplot.
X and Y are similar values while z is vastly different.
Example:
X=(-3,9,-20,-8,-14)
Y=(-2,8,-19,-8,-13)
Z=(-960,110,4867,-149,1493)
I have done this for 2 dimensional data but not 3d.
poly=PolynomialFeatures(degree=2,include_bias=False)
X_poly=poly.fit_transform(X.reshape(-1,1))
X_poly[0]
However, how do I handle the data when I have x, y, and z? Do I need to perform poly.fit_transform x and y?
Next I did linear regression
from sklearn.linear_model import LinearRegression
LinReg = LinearRegression()
LinReg.fit(X_poly,z)
Then when I create test data for x and y and perform the predict method on z, the resulting line is linear instead of a polynomial.
Any help would be much appreciated.
I finally figured it out. I needed to pass a DataFrame containing only x and y through the polynomial features and then use the XY_poly and z in the linreg.fit(). This trains the model for my next steps to create the line of best fit for the polynomial.
When PolynomialFeatures says that fit_transform() method requires x and y, here x is an n-dimensional array of features and y are the target values which are optional. In your case I would do the following:
X=(-3,9,-20,-8,-14)
Y=(-2,8,-19,-8,-13)
Z=(-960,110,4867,-149,1493)
foo = np.array([X, Y, Z])
foo = foo.transpose() # This transposes the array to bring it to shape (n, 3)
poly = PolynomialFeatures(3)
poly.fit_transform(foo)
One this is done you can use fit_transform(foo).

Is there a way to get the values for the explanatory variables from a prediction using sklearn for multiple linear regression?

I'm trying to understand if there is a way, using the sklearn.linear_model.LinearRegression() to get the values for the explanatory variables, given a prediction of X.
For example - looking at MPG of a car. I can build the model using multiple explanatory variables and then predict (successfully) the MPG for a given set of X. However, can I do the reverse and give Y and then get the predict X values?
Sorry if not very clear!
When approximating some values Y associated to some points X using linear regression what we are looking is for the nearest linear function (f(x) = ax + k) to the points in a least square sense. So you are not getting Y but you are rather getting the linear function that best approximates your input.
Can you do the opposite that is build a function that predicts $X$ rather than $Y$. Consider the example in sklearn.linear_model.LinearRegression we are going to tune it a bit.
>>> import numpy as np
>>> from sklearn.linear_model import LinearRegression
>>> X = np.array([[1, 1], [1, 2], [2, 2], [2, 3]])
>>> # y = 1 * x_0 + 2 * x_1 + 3
>>> y = np.dot(X, np.array([1, 2])) + 3
>>> # Changing y <-> X
>>> reg = LinearRegression().fit(y, X)
>>> # Changing y <-> X
>>> reg.score(y, X)
0.8269230769230769
>>> reg.coef_
array([[0.23076923],
[0.38461538]])
>>> reg.intercept_
array([-0.46153846, -1.26923077])
>>> reg.predict(np.array([[16.]]))
array([[3.23076923, 4.88461538]])
This could let you predict your explanatory variables from the MPG of your vehicles. Although this works take into account that this approach will probably give you very bad results as you are trying to approach a cloud of points inside a multidimensional space with a line inside that space. Take a look at your score before trying to predict anything with your model.

How to do the least square to find the relationship or the equation when X is 2D array, Y is 1D array. Python

I have the X array shape (40*100) Y array contains 40 elements.
IS it possible to do OLS, WLS?? how to do that?
after the scatter plot.
How to apply the least square to find the relationship between X and Y. for example I would like to generate the equation of X and Y .
Here I give the simple example.
X=[[0.0,0.03,0.04,0.0,0.1,0.1,0.7,0.5,0.3,0.6],
[0.0,0.0,0.4,0.5,0.1,0.1,0.03,0.04,0.0,0.1],
[0.6,0.7,0.0,0.8,0.1,0.1,0.1,0.1,0.7,0.5],
[0.3,0.6,0.1,0.5,0.6,0.1,0.4,0.5,0.1,0.1]]
Y=[1,4,2,5]
Whether or not OLS or WLS is appropriate is one question (e.g. linear dependence among features is requires a different approach or if your response (Y variable) is discrete then you wouldn't use OLS, but instead use logistic regression or something else), but performing it in Python using your data is as follows:
import numpy as np
import numpy.linalg as la
X = np.array([[0.0,0.03,0.04,0.0,0.1,0.1,0.7,0.5,0.3,0.6],
[0.0,0.0,0.4,0.5,0.1,0.1,0.03,0.04,0.0,0.1],
[0.6,0.7,0.0,0.8,0.1,0.1,0.1,0.1,0.7,0.5],
[0.3,0.6,0.1,0.5,0.6,0.1,0.4,0.5,0.1,0.1]])
Y = np.array([1,4,2,5])
OLS = la.lstsq(X,Y)[0]
print OLS
[-0.60940892 0.19707325 3.94166269 4.06073677 2.76949291
0.90614714 0.92161768 1.5417828 -1.87887552 -0.63917305]
Note that this yields a perfect solution:
np.allclose(X.dot(OLS),Y)
True

Fitting a bias in a weight decay regression using least-squares

I'm calcualting the weights for a linear regression with weight-decay, i.e. normally I am trying to find beta = (X'X + lambda I)^-1 X'Y where X has n rows of D features each and Y is a vector of outputs for each row of X.
I've been fitting without a bias term by using:
def wd_fit(A, y, lamb=0):
n_col = A.shape[1]
return np.linalg.lstsq(A.T.dot(A) + lamb * np.identity(n_col), A.T.dot(y))
I'd like to also calculate a bias or intercept term for the fit, instead of having it pass through the origin. I'd like to keep the same call to lstsq, so if there's some matrix transform I can carry out, that would be ideal. My inclination is to append column of 1s somewhere, so that X_mod say would then have D+1 features where the last relates to the intercept value, but I'm not quite sure where that should be or even if it's correct.
If you don't want to mean-center your variables, adding a column of ones will work and is a perfectly acceptable solution.
The bias term will just be the coefficient at the position of the added column.

Standard errors for multivariate regression coefficients

I've done a multivariate regression using sklearn.linear_model.LinearRegression and obtained the regression coefficients doing this:
import numpy as np
from sklearn import linear_model
clf = linear_model.LinearRegression()
TST = np.vstack([x1,x2,x3,x4])
TST = TST.transpose()
clf.fit (TST,y)
clf.coef_
Now, I need the standard errors for these same coefficients. How can I do that?
Thanks a lot.
Based on this stats question and wikipedia, my best guess is:
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
However, my linear algebra and stats are both quite poor, so I could be missing something important. Another option might be to bootstrap the variance estimate.
MSE = np.mean((y - clf.predict(TST).T)**2)
var_est = MSE * np.diag(np.linalg.pinv(np.dot(TST.T,TST)))
SE_est = np.sqrt(var_est)
I guess that this answer is not entirely correct.
In particular, if I am not wrong, according to your code sklearn is adding the constant term in order to compute your coefficient by default.
Then, you need to include in your matrix TST the column of ones. Then, the code is correct and it will give you an array with all the SE
These code has been tested with data. They are correct.
find the X matrix for each data set, n is the length of dataset, m is the variables number
X, n, m=arrays(data)
y=***.reshape((n,1))
linear = linear_model.LinearRegression()
linear.fit(X, y , n_jobs=-1) ## delete n_jobs=-1, if it's one variable only.
sum square
s=np.sum((linear.predict(X) - y) ** 2)/(n-(m-1)-1)
standard deviation, square root of the diagonal of variance-co-variance matrix (sigular vector decomposition)
sd_alpha=np.sqrt(s*(np.diag(np.linalg.pinv(np.dot(X.T,X)))))
(t-statistics using, linear.intercept_ for one variable)
t_stat_alpha=linear.intercept_[0]/sd_alpha[0] #( use linear.intercept_ for one variable_
I found that the accepted answer had some mathematical glitches that in total would require edits beyond the recommended etiquette for modifying posts. So here is a solution to compute the standard error estimate for the coefficients obtained through the linear model (using an unbiased estimate as suggested here):
# preparation
X = np.concatenate((np.ones(TST.shape[0], 1)), TST), axis=1)
y_hat = clf.predict(TST).T
m, n = X.shape
# computation
MSE = np.sum((y_hat - y)**2)/(m - n)
coef_var_est = MSE * np.diag(np.linalg.pinv(np.dot(X.T,X)))
coef_SE_est = np.sqrt(var_est)
Note that we have to add a column of ones to TST as the original post used the linear_model.LinearRegression in a way that will fit the intercept term. Furthermore, we need to compute the mean squared error (MSE) as in ANOVA. That is, we need to divide the sum of squared errors (SSE) by the degrees of freedom for the error, i.e., df_error = df_observations - df_features.
The resulting array coef_SE_est contains the standard error estimates of the intercept and all other coefficients in coef_SE_est[0] and coef_SE_est[1:] resp. To print them out you could use
print('intercept: coef={:.4f} / std_err={:.4f}'.format(clf.intercept_[0], coef_SE_est[0]))
for i, coef in enumerate(clf.coef_[0,:]):
print('x{}: coef={:.4f} / std_err={:.4f}'.format(i+1, coef, coef_SE_est[i+1]))
The example from the documentation shows how to get the mean square error and explained variance score:
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
regr.fit(diabetes_X_train, diabetes_y_train)
# The coefficients
print('Coefficients: \n', regr.coef_)
# The mean square error
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(diabetes_X_test) - diabetes_y_test) ** 2))
# Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(diabetes_X_test, diabetes_y_test))
Does this cover what you need?

Categories