I want to interpret the regression model weights in a model where the input data has been pre-processed with PCA. In reality, I have 100s of input dimensions which are highly correlated, so I know that PCA is useful. However, for the sake of illustration I will use the Iris dataset.
The sklearn code below illustrates my question:
import numpy as np
import sklearn.datasets, sklearn.decomposition
from sklearn.linear_model import LinearRegression
# load data
X = sklearn.datasets.load_iris().data
w = np.array([0.3, 10, -0.1, -0.01])
Y = np.dot(X, w)
# set number of components to keep from PCA
n_components = 4
# reconstruct w
reg = LinearRegression().fit(X, Y)
w_hat = reg.coef_
print(w_hat)
# apply PCA
pca = sklearn.decomposition.PCA(n_components=n_components)
pca.fit(X)
X_trans = pca.transform(X)
# reconstruct w
reg_trans = LinearRegression().fit(X_trans, Y)
w_trans_hat = np.dot(reg_trans.coef_, pca.components_)
print(w_trans_hat)
Running this code, one can see that the weights are reproduced fine.
However, if I set the number of components to 3 (i.e. n_components = 3) then then weights printed out deviate substantially from the true ones.
Am I misunderstanding how I can transform back these weights? Or is it because of PCA's information loss moving from 4 to 3 components?
I think this was working fine, it's just that I was looking at the w_trans_hat instead of the reconstructed Y:
import numpy as np
import sklearn.datasets, sklearn.decomposition
from sklearn.linear_model import LinearRegression
# load data
X = sklearn.datasets.load_iris().data
# create fake loadings
w = np.array([0.3, 10, -0.1, -0.01])
# centre X
X = np.subtract(X, np.mean(X, 0))
# calculate Y
Y = np.dot(X, w)
# set number of components to keep from PCA
n_components = 3
# reconstruct w using linear regression
reg = LinearRegression().fit(X, Y)
w_hat = reg.coef_
print(w_hat)
# apply PCA
pca = sklearn.decomposition.PCA(n_components=n_components)
pca.fit(X)
X_trans = pca.transform(X)
# regress Y on principal components
reg_trans = LinearRegression().fit(X_trans, Y)
# reconstruct Y using regressed weights and transformed X
Y_trans = np.dot(X_trans, reg_trans.coef_)
# show MSE to original Y
print(np.mean((Y - Y_trans) ** 2))
# show w implied by reduced model in original space
w_trans_hat = np.dot(reg_trans.coef_, pca.components_)
print(w_trans_hat)
Related
My linear regression model has negative coefficient of determination RΒ².
How can this happen? Any idea is helpful.
Here is my dataset:
year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
The code of the LinearRegression model is as follows:
import pandas as pd
from sklearn.linear_model import LinearRegression
data =pd.read_csv("data.csv", header=None )
data = data.drop(0,axis=0)
X=data[0]
Y=data[1]
from sklearn.model_selection import train_test_split
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size = 0.1,shuffle =False)
lm = LinearRegression()
lm.fit(X_train.values.reshape(-1,1), Y_train.values.reshape(-1,1))
Y_pred = lm.predict(X_test.values.reshape(-1,1))
accuracy = lm.score(Y_test.values.reshape(-1,1),Y_pred)
print(accuracy)
output
-3592622948027972.5
Here is the formula of the RΒ² score:
\hat{y_i} is the predictor of the i-th observation y_i and \bar{y} is the mean of all observations.
Therefore, a negative RΒ² means that if someone knew the mean of your y_test sample and always used it as a "prediction", this "prediction" would be more accurate than your model.
Moving on to your dataset (thanks to #Prayson W. Daniel for the convenient loading script), let us have a quick look at your data.
df.population.plot()
It looks like a logarithmic transformation could help.
import numpy as np
df_log = df.copy()
df_log.population = np.log(df.population)
df_log.population.plot()
Now let us perform a linear regression using OpenTURNS.
import openturns as ot
sam = ot.Sample(np.array(df_log)) # convert DataFrame to openturns Sample
sam.setDescription(['year', 'logarithm of the population'])
linreg = ot.LinearModelAlgorithm(sam[:, 0], sam[:, 1])
linreg.run()
linreg_result = linreg.getResult()
coeffs = linreg_result.getCoefficients()
print("Best fitting line = {} + year * {}".format(coeffs[0], coeffs[1]))
print("R2 score = {}".format(linreg_result.getRSquared()))
ot.VisualTest_DrawLinearModel(sam[:, 0], sam[:, 1], linreg_result)
Output:
Best fitting line = -38.35148311467912 + year * 0.028172928802559845
R2 score = 0.9966261033648469
This is an almost exact fit.
EDIT
As suggested by #Prayson W. Daniel, here is the model fit after it is transformed back to the original scale.
# Get the original data in openturns Sample format
orig_sam = ot.Sample(np.array(df))
orig_sam.setDescription(df.columns)
# Compute the prediction in the original scale
predicted = ot.Sample(orig_sam) # start by copying the original data
predicted[:, 1] = np.exp(linreg_result.getMetaModel()(predicted[:, 0])) # overwrite with the predicted values
error = np.array((predicted - orig_sam)[:, 1]) # compute error
r2 = 1.0 - (error**2).mean() / df.population.var() # compute the R2 score in the original scale
print("R2 score in original scale = {}".format(r2))
# Plot the model
graph = ot.Graph("Original scale", "year", "population", True, '')
curve = ot.Curve(predicted)
graph.add(curve)
points = ot.Cloud(orig_sam)
points.setColor('red')
graph.add(points)
graph
Output:
R2 score in original scale = 0.9979032805107133
Sckit-learnβs LinearRegression scores uses π
2 score. A negative π
2 means that the model fitted your data extremely bad. Since π
2 compares the fit of the model with that of the null hypothesis( a horizontal straight line ), then π
2 is negative when the model fits worse than a horizontal line.
π
2 = 1 - (SUM((y - ypred)**2) / SUM((y - AVG(y))**2))
So if SUM((y - ypred)**2 is greater than SUM((y - AVG(y))**2, then π
2 will be negative.
reasons and ways to correct it
Problem 1: You are performing a random split of time-series data. Random split will ignore the temporal dimension.
Solution: Preserve time flow (See code below)
Problem 2: Target values are so large.
Solution: Unless we use Tree-base models, you would have to do some target feature engineering to scale data in a range that models can learn.
Here is a code example. Using defaults parameters of LinearRegression and log|exp transformation of our target values, my attempt yield ~87% R2 score:
import pandas as pd
import numpy as np
# we need to transform/feature engineer our target
# I will use log from numpy. The np.log and np.exp to make the value learnable
from sklearn.linear_model import LinearRegression
from sklearn.compose import TransformedTargetRegressor
# your data, df
# transform year to reference
df = df.assign(ref_year = lambda x: x.year - 1960)
df.population = df.population.astype(int)
split = int(df.shape[0] *.9) #split at 90%, 10%-ish
df = df[['ref_year', 'population']]
train_df = df.iloc[:split]
test_df = df.iloc[split:]
X_train = train_df[['ref_year']]
y_train = train_df.population
X_test = test_df[['ref_year']]
y_test = test_df.population
# regressor
regressor = LinearRegression()
lr = TransformedTargetRegressor(
regressor=regressor,
func=np.log, inverse_func=np.exp)
lr.fit(X_train,y_train)
print(lr.score(X_test,y_test))
For those interested in making it better, here is a way to read that dataset
import pandas as pd
import io
df = pd.read_csv(io.StringIO('''year,population
1960,22151278.0
1961,22671191.0
1962,23221389.0
1963,23798430.0
1964,24397022.0
1965,25013626.0
1966,25641044.0
1967,26280132.0
1968,26944390.0
1969,27652709.0
1970,28415077.0
1971,29248643.0
1972,30140804.0
1973,31036662.0
1974,31861352.0
1975,32566854.0
1976,33128149.0
1977,33577242.0
1978,33993301.0
1979,34487799.0
1980,35141712.0
1981,35984528.0
1982,36995248.0
1983,38142674.0
1984,39374348.0
1985,40652141.0
1986,41965693.0
1987,43329231.0
1988,44757203.0
1989,46272299.0
1990,47887865.0
1991,49609969.0
1992,51423585.0
1993,53295566.0
1994,55180998.0
1995,57047908.0
1996,58883530.0
1997,60697443.0
1998,62507724.0
1999,64343013.0
2000,66224804.0
2001,68159423.0
2002,70142091.0
2003,72170584.0
2004,74239505.0
2005,76346311.0
2006,78489206.0
2007,80674348.0
2008,82916235.0
2009,85233913.0
2010,87639964.0
2011,90139927.0
2012,92726971.0
2013,95385785.0
2014,98094253.0
2015,100835458.0
2016,103603501.0
2017,106400024.0
2018,109224559.0
'''))
Results:
from tensorflow.examples.tutorials.mnist import input_data
mnist=input_data.read_data_sets('data/MNIST/', one_hot=True)
numpy implementation
# Entire Data set
Data=np.array(mnist.train.images)
#centering the data
mu_D=np.mean(Data, axis=0)
Data-=mu_D
COV_MA = np.cov(Data, rowvar=False)
eigenvalues, eigenvec=scipy.linalg.eigh(COV_MA, eigvals_only=False)
together = zip(eigenvalues, eigenvec)
together = sorted(together, key=lambda t: t[0], reverse=True)
eigenvalues[:], eigenvec[:] = zip(*together)
n=3
pca_components=eigenvec[:,:n]
print(pca_components.shape)
data_reduced = Data.dot(pca_components)
print(data_reduced.shape)
data_original = np.dot(data_reduced, pca_components.T) # inverse_transform
print(data_original.shape)
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
sklearn implementation
from sklearn.decomposition import PCA
pca = PCA(n_components=3)
pca.fit(Data)
data_reduced = np.dot(Data, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
plt.imshow(data_original[10].reshape(28,28),cmap='Greys',interpolation='nearest')
I'd like to implement PCA algorithms by using numpy. However I don't know how to reconstruct the images from that and I don't even know if this code is correct.
Actually, when I used sklearn.decomposition.PCA, the result is different from the numpy implementation.
Can you explain the differences?
I can spot a few differences already.
For one:
n=300
projections = only_2.dot(eigenvec[:,:n])
Xhat = np.dot(projections, eigenvec[:,:n].T)
Xhat += mu_D
plt.imshow(Xhat[5].reshape(28,28),cmap='Greys',interpolation='nearest')
The point I'm trying to make is, if my understanding is correct n = 300, you are trying to fit 300 eigen vectors whose eigen values go from high to low.
But in sklearn
from sklearn.decomposition import PCA
pca = PCA(n_components=1)
pca.fit(only_2)
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # invers
It seems to me you are fitting just the FIRST component (the component that maximizes variance) and you're not taking all 300.
Further more:
One thing I can clearly, say is that you seem to understand what's happening in PCA but you're having trouble implementing it. Correct me if I'm wrong but:
data_reduced = np.dot(only_2, pca.components_.T) # transform
data_original = np.dot(data_reduced, pca.components_) # inverse_transform
In this part, you are trying to PROJECT your eigenvectors to your data which is what you should go about doing in PCA, but in sklearn, what you should do is the following:
import numpy as np
from sklearn.decomposition import PCA
pca = PCA(n_components=300)
pca.fit_transform(only_2)
If you could tell me how you created only_2, I can give you a much more specific answer tomorrow.
Here is what sklearn says about fit_transform for PCA: http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA.fit_transform:
fit_transform(X, y=None)
Fit the model with X and apply the dimensionality reduction on X.
Parameters:
X : array-like, shape (n_samples, n_features)
Training data, where n_samples is the number of samples and n_features is the number of features.
y : Ignored
Returns:
X_new : array-like, shape (n_samples, n_components)
i'm trying to implement a custom kernel, precisely the exponential Chi-Squared kernel, to pass as parameter to sklearn svm function, but when i run it the subsequent error is raised :
ValueError: X.shape[0] should be equal to X.shape[1]
I read about the broadcasting operation performed by numpy's functions in order to speedup the computation but i can't manage the error.
The code is:
import numpy as np
from sklearn import svm, datasets
# import the iris dataset (http://en.wikipedia.org/wiki/Iris_flower_data_set)
iris = datasets.load_iris()
train_features = iris.data[:, :2] # Here we only use the first two features.
train_labels = iris.target
def my_kernel(x, y):
gamma = 1
return np.exp(-gamma * np.divide((x - y) ** 2, x + y))
classifier = svm.SVC(kernel=my_kernel)
classifier = classifier.fit(train_features, train_labels)
print "Train Accuracy : " + str(classifier.score(train_features, train_labels))
Any help?
I believe the Chi-Squared Kernel is already implemented for you (in from sklearn.metrics.pairwise import chi2_kernel).
Like so
from functools import partial
from sklearn import svm, datasets
from sklearn.metrics.pairwise import chi2_kernel
# import the iris dataset (http://en.wikipedia.org/wiki/Iris_flower_data_set)
iris = datasets.load_iris()
train_features = iris.data[:, :2] # Here we only use the first two features.
train_labels = iris.target
my_chi2_kernel = partial(chi2_kernel, gamma=1)
classifier = svm.SVC(kernel=my_chi2_kernel)
classifier = classifier.fit(train_features, train_labels)
print("Train Accuracy : " + str(classifier.score(train_features, train_labels)))
====================
EDIT:
So turns out the question is really about how one can implement the chi square kernel. My shot at this would be:-
def my_chi2_kernel(X):
gamma = 1
nom = np.power(X[:, np.newaxis] - X, 2)
denom = X[:, np.newaxis] + X
# NOTE: We need to fix some entries, since division by 0 is an issue here.
# So we take all the index of would be 0 denominator, and fix them.
zero_denom_idx = denom == 0
nom[zero_denom_idx] = 0
denom[zero_denom_idx] = 1
return np.exp(-gamma * np.sum(nom / denom, axis=len(X.shape)))
So in essence x - y and x + y in OP's attempt is wrong, since it's not pairwise subtraction or addition.
Curiously, the custom version seems to be faster than sklearn's cythonised version (at least for small dataset?)
Please take a look at two codes bellow. There are two setups for NN system. First code (with results) show result where data is not scaled, the second one show results with data being scaled. I am worried, because dataset is small and categorical and I can not find a solution for scaling process. Now imagine features would have continuous values and label as well. The result would be even worse. Is there something I can do to improve scaling code results?
setup for NN in python, with no scaler:
import numpy as np
X = np.array([[1,0,0], [1,1,0], [0,0,1]])
y = np.array([[0,1,0]]).T
def relu(x):
return np.maximum(x,0,x) #relu activation
def relu_d(x): #derivate of relu
x[x<0] = 0
return x
np.random.seed(0)
w0 = np.random.normal(size=(3,5), scale=0.1)
w1 = np.random.normal(size=(5,1), scale=0.1)
result:
epoch nr:0 results in mean square error: 0.572624041985418
epoch nr:100000 results in mean square error: 0.1883460901967186
epoch nr:200000 results in mean square error: 0.08173913195938957
epoch nr:300000 results in mean square error: 0.04658778224325014
epoch nr:400000 results in mean square error: 0.03058257621363338
Scaled data code:
import numpy as np
X = np.array([[1,0,0], [1,1,0], [0,0,1]])
y = np.array([[0,1,0]]).T
from sklearn.preprocessing import StandardScaler
sx = StandardScaler()
X = sx.fit_transform(X)
sy = StandardScaler()
y = sy.fit_transform(y)
def relu(x):
return np.maximum(x,0,x)
def relu_d(x):
x[x<0] = 0
return x
np.random.seed(0)
w0 = np.random.normal(size=(3,5), scale=0.1)
w1 = np.random.normal(size=(5,1), scale=0.1)
result is:
epoch nr:0 results in mean square error: 1.0039400468232
epoch nr:100000 results in mean square error: 0.5778610517002227
epoch nr:200000 results in mean square error: 0.5773502691896257
epoch nr:300000 results in mean square error: 0.5773502691896257
epoch nr:400000 results in mean square error: 0.5773502691896257
In general the scaling is applied on the features. Here you apply it also on the targets.
Try to remove:
sy = StandardScaler()
y = sy.fit_transform(y)
and use the raw y = np.array([[0,1,0]]) and see that happen
EDIT 1
You can try to binarize the labels using for example LabelBinarizer link .
If you have y values like 80,140,180... you could use this to binarize the y values and then after scaling the X features you can train the NN.
EDIT 2
Simple example using multi-layer perceptron regressor and without scaling:
from sklearn.neural_network import MLPRegressor
import numpy as np
X = np.array([[0,100,200], [1,22,44], [0,40,50] ])
y = np.array([200, 60, 20])
nn= MLPRegressor()
nn.fit(X,y)
X_new = np.array([[21,10,22]])
y_pred = nn.predict(X_new)
print(y_pred)
Result:
[ 29.28949475]
P.S: You can normalize/scale the data and use the same approach but this time using X_scaled (and y_scaled if this is the case). See below
EDIT 3
Same but using scaling
from sklearn.neural_network import MLPRegressor
import numpy as np
from sklearn.preprocessing import StandardScaler
X = np.array([[0,100,200], [1,22,44], [0,40,50] ])
y = np.array([200, 60, 20])
nn= MLPRegressor()
sc_x = StandardScaler()
X_scaled = sc_x.fit_transform(X)
sc_y = StandardScaler()
y_scaled = sc_y.fit_transform(y)
nn.fit(X_scaled,y_scaled)
X_new = np.array([[21,10,22]])
X_new_scaled = sc_x.fit_transform(X_new)
y_pred = nn.predict(X_new)
print(y_pred)
Result:
[ 10.03179535]
EDIT 4
If you want to binarize the values, you can use the following:
Replace
sc_y = StandardScaler()
y_scaled = sc_y.fit_transform(y)
With
sc_y = LabelBinarizer()
y_scaled = sc_y.fit_transform(y)
Important:
If you use LabelBinarizer then the y = np.array([200, 60, 20]) will become y_scaled
[[0 0 1]
[0 1 0]
[1 0 0]]
Without any information about the architecture and the parameters it is very difficult to pitpoint the problem.
But in general you don't need to scale binary variables. Scaling is used so that all features have similar bounds. You already have them.
I want to do a linear regression on survey data with survey weights.
The survey data is from the EU and each observation has a weight. (.4 for an one respondent, 1.5 for another.)
This weight is described as:
"The European Weight, variable 6, produces a representative sample of
the European Community as a whole when used in analysis. This variable
adjusts the size of each national sample according to each nation's
contribution to the population of the European Community."
To do my calculation I'm using sklearn.
from sklearn import linear_model
regr = linear_model.LinearRegression()
regr.fit(X,y, sample_weight = weights)
X is a pandas DataFrame. y is a numpy.ndarray. weights is a pandas Series.
Am I using 'sample_weight' correctly, does is this the correct way to handle survey weights in scikit?
TL DR; Yes.
Here is a very simple example of it working,
import numpy as np
import matplotlib.pylab as plt
from sklearn import linear_model
regr = linear_model.LinearRegression()
X = np.array([1, 2, 4]).reshape(-1, 1)
y = np.array([10, 20, 60]).reshape(-1, 1)
weights = np.array([1, 1, 1])
def weighted_lr(X, y, weights):
"""Quick function to run weighted linear regression and return a
plot and some predictions"""
regr.fit(X,y, sample_weight=weights)
y_pred = regr.predict(X)
plt.scatter(X, y)
plt.plot(X, y_pred)
plt.title('Weights: %s' % ', '.join(str(i) for i in weights))
plt.show()
return y_pred
y_pred = weighted_lr(X, y, weights)
print(y_pred)
weights = np.array([1000, 1000, 1])
y_pred = weighted_lr(X, y, weights)
print(y_pred)
[[ 7.14285714]
[ 24.28571429]
[ 58.57142857]]
[[ 9.96051333]
[ 20.05923001]
[ 40.25666338]]
On the first linear regression model with even weights we see the model behave as expected from a normal linear regression model.
Next, however, we see that in the second model, with low weighing on the last value, almost ignores this last value. The majority of the training has been weighted to the other two values here.