Multivariate linear regression in python - python

I have a data set as follows arranged in X and Y matrices as follows:
I want to find a 2*2matrix A such that y_i=A x_i for all i=1,...,n. So I am using the following code for linear regression in python:
import numpy as np
import matplotlib.pyplot as plt
import sklearn
from sklearn import datasets, linear_model
#n=5
X=np.random.uniform(0,1,(2,5))
A=np.random.uniform(0,1,(2,2))
y=np.dot(A,X)
print(y)
# Create linear regression object
regr = linear_model.LinearRegression()
# Train the model using the training sets
model=regr.fit(X, y)
#model.predict(X)
model.coef_
However my model.coef_ command is printing a 5*5 matrix instead of a 2*2 matrix that I want for A. How do I achieve this?

Fit the model on a transpose of the samples and outcome - the second dimensions will be used to create the model - to get a 2x2 array:
model=regr.fit(X.T, y.T)
# test
np.testing.assert_allclose(y.T, model.predict(X.T))

Related

how python calculates predictions with linear regression?

I'm having trouble getting the formula that python use for linear predictions. I did a linear regression using:
from sklearn.linear_model import LinearRegression
lm = LinearRegression()
lm.fit(X_tr_pre_close,Y_tr_pre_close)
then I made predictions using:
predictions=lm.predict(X_te_pre_close)
I had great results with this model but now the problem is that I can't figure out how the lm.predict() formula works, the model should be ordinary least squares as I read in the documentation
in this case, the predictions formula supposes to be x'b (vector of coefficients * vector of explanatory variables) but it doesn't fit my results.
LinearRegression doesn't store the intercept as one of the coefficients, but as intercept_.
So you can reproduce the predict function like that:
# using sklearn
pred_sklearn = lm.predict(X_te_pre_close)
# using coefficients directly:
pred_coef = X_te_pre_close # lm.coef_.T + lm.intercept_
assert all(pred_coef == pred_sklearn)

SKLearn Linear Regression but setting certain coefficients before starting

I'd like to run a linear regression using SKLearn on a dataset with say 50 variables. However, I'd like to set the coefficients for say 2 of the variables before it starts training. Is that possible?
You are looking to provide an initial value or guess for the coefficients, and this is not possible for LinearRegression because it calls scipy.linalg.lstsq from scipy.
Not very sure what is the purpose for providing initial guess because for linear regression, you can fit the model, that is find the least square solution by using QR decomposition or SVD, there's no need to provide an initial guess or so.
If you want to try it for some purpose, I think you can try something like lsmr or curve_fit, but bear in mind, it's not really the commonly known linear regression from here:
from sklearn import datasets, linear_model
from scipy.optimize import curve_fit
from sklearn.preprocessing import StandardScaler
X, y = datasets.load_diabetes(return_X_y=True)
X = StandardScaler().fit_transform(X)
regr = linear_model.LinearRegression()
regr.fit(X,y)
regr.coef_
array([ -0.47623169, -11.40703082, 24.72625713, 15.42967916,
-37.68035801, 22.67648701, 4.80620008, 8.422084 ,
35.73471316, 3.21661161])
#lmsr
lsmr(X,y,x0 = np.repeat(2.0,X.shape[1]))
(array([ -0.4762317 , -11.40703083, 24.72625712, 15.42967915,
-37.68035803, 22.67648699, 4.8062001 , 8.42208398,
35.73471314, 3.21661159])
#non linear least square
def func(x,*params):
return x # params
coef_, cov_ = curve_fit(func,X,y,p0 = np.repeat(2,X.shape[1]))
coef_
array([ -0.47623371, -11.40702964, 24.72625986, 15.42967394,
-37.68022801, 22.67639202, 4.8061298 , 8.42205138,
35.73466837, 3.21661273])

Multiple Linear Regression. Coeffs don't match

So I have this small dataset and ı want to perform multiple linear regression on it.
first I drop the deliveries column for it's high correlation with miles. Although gasprice is supposed to be removed, I don't remove it so that I can perform multiple linear regression and not simple linear regression.
finally I removed the outliers and did the following:
Dataset
import math
import numpy as np
import pandas as pd
import seaborn as sns
from scipy import stats
import matplotlib.pyplot as plt
import statsmodels.api as sm
from statsmodels.stats import diagnostic as diag
from statsmodels.stats.outliers_influence import variance_inflation_factor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score, mean_absolute_error
from sklearn import linear_model
%matplotlib inline
X = dfafter
Y = dfafter[['hours']]
# Split X and y into X_
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
# create a Linear Regression model object
regression_model = LinearRegression()
# pass through the X_train & y_train data set
regression_model.fit(X_train, y_train)
y_predict = regression_model.predict(X_train)
#lets find out what are our coeffs of the multiple linear regression and olso find intercept
intercept = regression_model.intercept_[0]
coefficent = regression_model.coef_[0][0]
print("The intercept for our model is {}".format(intercept))
print('-'*100)
# loop through the dictionary and print the data
for coef in zip(X.columns, regression_model.coef_[0]):
print("The Coefficient for {} is {}".format(coef[0],coef[1]))
#Coeffs here don't match the ones that will appear later
#Rebuild the model using Statsmodel for easier analysis
X2 = sm.add_constant(X)
# create a OLS model
model = sm.OLS(Y, X2)
# fit the data
est = model.fit()
# calculate the mean squared error
odel_mse = mean_squared_error(y_train, y_predict)
# calculate the mean absolute error
model_mae = mean_absolute_error(y_train, y_predict)
# calulcate the root mean squared error
model_rmse = math.sqrt(model_mse)
# display the output
print("MSE {:.3}".format(model_mse))
print("MAE {:.3}".format(model_mae))
print("RMSE {:.3}".format(model_rmse))
print(est.summary())
#????????? something is wrong
X = df[['miles', 'gasprice']]
y = df['hours']
regr = linear_model.LinearRegression()
regr.fit(X, y)
print(regr.coef_)
So the code ends here. I found different coeffs every time I printed them out. what did I do wrong and is any of them correct?
I see you are trying 3 different things here, so let me summarize:
sklearn.linear_model.LinearRegression() with train_test_split(X, Y, test_size=0.2, random_state=1), so only using 80% of the data (but the split should be the same every time you run it since you fixed the random state)
statsmodels.api.OLS with the full dataset (you're passing X2 and Y, which are not cut up into train-test)
sklearn.linear_model.LinearRegression() with the full dataset, as in n2.
I tried to reproduce with the iris dataset, and I am getting identical results for cases #2 and #3 (which are trained on the same exact data), and only slightly different coefficients for case 1.
In order to evaluate if any of them are "correct", you will need to evaluate the model on unseen data and look at adjusted R^2 score, etc (hence you need the holdout (test) set). If you want to further improve the model you can try to understand better the interactions of the features in the linear model. Statsmodels has a neat "R-like" formula way to specify your model: https://www.statsmodels.org/dev/example_formulas.html

Expected 2D Array,got 1D array instead.Where's the mistake?

I am beginning to learn SVM and PCA.I tried to apply SVM on the Sci-Kit Learn 'load_digits' dataset.
When i apply the .fit method to SVC,i get an error:
"Expected 2D array, got 1D array instead:
array=[ 1.9142151 0.58897807 1.30203491 ... 1.02259477 1.07605691
-1.25769703].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature
or array.reshape(1, -1) if it contains a single sample."
Here is the code i wrote:**
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
from sklearn.svm import SVC
clf = SVC(kernel='rbf', C=1E6)
X=[reduced_data[:,0]
y=reduced_data[:,1]
clf.fit(X, y)
Can someone help me out?Thank you in advance.
Your error results from the fact that clf.fit() requires the array X to be of dimension 2 (currently it is 1 dimensional), and by using X.reshape(-1, 1), X becomes a (N,1) (2D - as we would like) array, as opposed to (N,) (1D), where N is the number of samples in the dataset. However, I also believe that your interpretation of reduced_data may be incorrect (from my limited experience of sklearn):
The reduced_data array that you have contains two principle components (the two most important features in the dataset, n_components=2), which you should be using as the new "data" (X).
Instead, you have taken the first column of reduced_data to be the samples X, and the second column to be the target values y. It is to my understanding that a better approach would be to make X = reduced_data since the sample data should consist of both PCA features, and make y = y_digits, since the labels (targets) are unchanged by PCA.
(I also noticed you defined pca = PCA(n_components=10).fit_transform(data), but did not go on to use it, so I have removed it from the code in my answer).
As a result, you would have something like this:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.preprocessing import scale
from sklearn.svm import SVC
X_digits, y_digits = load_digits(return_X_y=True)
data = scale(X_digits)
# pca=PCA(n_components=10).fit_transform(data)
reduced_data = PCA(n_components=2).fit_transform(data)
clf = SVC(kernel='rbf', C=1e6)
clf.fit(reduced_data, y_digits)
I hope this has helped!

Difference in PCA implementation between numpy only vs sklearn

from tensorflow.examples.tutorials.mnist import input_data
mnist=input_data.read_data_sets('data/MNIST/', one_hot=True)
numpy implementation
# Entire Data set
Data=np.array(mnist.train.images)
#centering the data
mu_D=np.mean(Data, axis=0)
Data-=mu_D
COV_MA = np.cov(Data, rowvar=False)
eigenvalues, eigenvec=scipy.linalg.eigh(COV_MA, eigvals_only=False)
together = zip(eigenvalues, eigenvec)
together = sorted(together, key=lambda t: t[0], reverse=True)
eigenvalues[:], eigenvec[:] = zip(*together)
n=3
pca_components=eigenvec[:,:n]
print(pca_components.shape)
data_reduced = Data.dot(pca_components)
print(data_reduced.shape)
data_original = np.dot(data_reduced, pca_components.T) # inverse_transform
print(data_original.shape)
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
sklearn implementation
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(Data)
data_reduced = np.dot(Data, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
I'd like to implement PCA algorithms by using numpy. However I don't know how to reconstruct the images from that and I don't even know if this code is correct.
Actually, when I used sklearn.decomposition.PCA, the result is different from the numpy implementation.
Can you explain the differences?
I can spot a few differences already.
For one:
n=300
projections = only_2.dot(eigenvec[:,:n])
Xhat = np.dot(projections, eigenvec[:,:n].T)
Xhat += mu_D
plt.imshow(Xhat[5].reshape(28,28),cmap='Greys',interpolation='nearest')
The point I'm trying to make is, if my understanding is correct n = 300, you are trying to fit 300 eigen vectors whose eigen values go from high to low.
But in sklearn
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(only_2)
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # invers
It seems to me you are fitting just the FIRST component (the component that maximizes variance) and you're not taking all 300.
Further more:
One thing I can clearly, say is that you seem to understand what's happening in PCA but you're having trouble implementing it. Correct me if I'm wrong but:
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
In this part, you are trying to PROJECT your eigenvectors to your data which is what you should go about doing in PCA, but in sklearn, what you should do is the following:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=300)
pca.fit_transform(only_2)
If you could tell me how you created only_2, I can give you a much more specific answer tomorrow.
Here is what sklearn says about fit_transform for PCA: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform:
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
y : Ignored
Returns:
X_new : array-like, shape (n_samples, n_components)

Categories