Pandas/sklearn: Vectorize large number of LinearRegression calculations - python

I have a Pandas DataFrame where I need to calculate a large numbers of regression coefficients. Each calculation will be only two dimensional. The independent variable will be a ['Base'] which is the same for all cases. The dependent variable series is organized along columns in my DataFrame.
This is easy to accomplish with a for loop but in my real life DataFrame I have thousands of columns on which to run the regression, so it takes forever. Is there a vectorized way to accomplish this?
Below is a MRE:
import pandas as pd
import numpy as np
from sklearn import linear_model
import time
df_data = {
'Base':np.random.randint(1, 100, 1000),
'Adder':np.random.randint(-3, 3, 1000)}
df = pd.DataFrame(data=df_data)
result_df = pd.DataFrame()
df['Thing1'] = df['Base'] * 3 + df['Adder']
df['Thing2'] = df['Base'] * 6 + df['Adder']
df['Thing3'] = df['Base'] * 12 + df['Adder']
df['Thing4'] = df['Base'] * 4 + df['Adder']
df['Thing5'] = df['Base'] * 2.67 + df['Adder']
things = ['Thing1', 'Thing2', 'Thing3', 'Thing4', 'Thing5']
for t in things:
reg = linear_model.LinearRegression()
X, y = df['Base'].values.reshape(-1,1), df[t].values.reshape(-1,1)
reg.fit(X, y)
b = reg.coef_[0][0]
result_df.loc[t, 'Beta'] = b
print(result_df.to_string())

You can use np.polyfit for linear regression:
pd.DataFrame(np.polyfit(df['Base'], df.filter(like='Thing'), deg=1)).T
Output:
0 1
0 3.002379 -0.714256
1 6.002379 -0.714256
2 12.002379 -0.714256
3 4.002379 -0.714256
4 2.672379 -0.714256

#Quang-Hoang 's idea of using df.filter solves the problem. If you really want to use sklearn, this also works:
reg = linear_model.LinearRegression()
X = df['Base'].values.reshape(-1,1)
y = df.filter(items=things).values
reg.fit(X, y)
result_df['Betas'] = reg.coef_
y_predict = reg.predict(X)
result_df['Rsq'] = r2_score(y, y_predict)

Related

linear regression: my plotting doesn't show the line

I am working on implementing from scratch a linear regression model means without using Sklearn package.
all was working just fine , until i tried ploting the result.
my fit line isn't showing:
i looked at a bunch of solution but neither of them was for myy problem
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
data = pd.read_csv(r'C:\Salary.csv')
x=data['Salary']
y=data['YearsExperience']
#y= mx+b
m = 0
b = 0
Learning_Rate = .01
epochs = 5000
n = np.float(x.shape[0])
error = []
for i in range(epochs):
Y_hat = m*x+b
#error
mse= (1/n)*np.sum((y-Y_hat)**2)
error.append(mse)
#gradient descend
db = (-2/n) * np.sum(x*(y-Y_hat))
dm = (-2/n) * np.sum((y-Y_hat))
m = m - Learning_Rate * dm
b = b - Learning_Rate * db
#tracing x and y line
x_line = np.linspace(0, 15, 100)
y_line = (m*x_line)+ b
#ploting result
plt.figure(figsize=(8,6))
plt.title('LR result')
**plt.plot(x_line,y_line) #the problem is apparently here
# i just don't know what to do**
plt.scatter(x,y)
plt.show()
appart from that, there is no problem with the code .
Your code has multiple problems:
you are plotting the line from 0 and 15, while data range from about 40000 to 140000. Even if you are correctly computing the line, you are going to plot it in a region far away from your data
in the loop there is a mistake in the computation of dm and db, they are swapped. The corrected expressions are:
dm = (-2/n)*np.sum(x*(y - Y_hat))
db = (-2/n)*np.sum((y - Y_hat))
your x and y data are on very different scales: x is ~10⁴ magnitude, while y is ~10¹. For this reason, also m and b will likely be very different from each other (different orders of magnitude). This is the reason why you should use two different learning rate for the different quantities you are optimizing: Learning_Rate_m for m and Learning_Rate_b for b
finally, the gradient descent method is strongly affected by the initial guess: it may lead to find local minima (fake solutions) in place of the global minima (true solution). For this reason, you should try with different initial guesses for m and b, possibly close to their estimated value:
m = 0
b = -2
Complete Code
import numpy as np
import matplotlib.pyplot as plt
N = 40
np.random.seed(42)
x = np.random.randint(low = 38000, high = 145000, size = N)
y = (13 - 1)/(140000 - 40000)*(x - 40000) + 1 + 0.5*np.random.randn(N)
# initial guess
m = 0
b = -2
Learning_Rate_m = 1e-10
Learning_Rate_b = 1e-2
epochs = 5000
n = np.float(x.shape[0])
error = []
for i in range(epochs):
Y_hat = m*x + b
mse = 1/n*np.sum((y - Y_hat)**2)
error.append(mse)
dm = -2/n*np.sum(x*(y - Y_hat))
db = -2/n*np.sum((y - Y_hat))
m = m - Learning_Rate_m*dm
b = b - Learning_Rate_b*db
x_line = np.linspace(x.min(), x.max(), 100)
y_line = (m*x_line) + b
plt.figure(figsize=(8,6))
plt.title('LR result')
plt.plot(x_line,y_line, 'red')
plt.scatter(x,y)
plt.show()
Plot
The problem is not happening while plotting, the problem is with the parameters in plt.plot(x_line,y_line), I tested your code and found that y_line is all NaN values, double check the calculations (y_line, m, dm).

Linear regression implementation from scratch

I'm trying to understand the gradient descent algorithm.
Can someone please explain why I'm getting high MSE values using the following code, or if I missed some concept can you please clarify?
import numpy as np
import pandas as pd
my_data = pd.DataFrame({'x': np.arange(0,100),
'y': np.arange(0,100)})
X = my_data.iloc[:,0:1].values
y = my_data.iloc[:,1].values
def gradientDescent(X, y, lr = 0.001, n = 1000):
n_samples, n_features = X.shape
cost = []
weight = np.zeros([n_features])
b = 0
for _ in range(n):
# predict
y_hat = np.dot(X, weight) + b # y = ax + b
residual = y - y_hat
db = -(2/n_samples) * np.sum(residual)
dw = -(2/n_samples) * np.sum(X.T * residual, axis = 1)
# update weights
weight -= (lr * dw)
b -= (lr * db)
cost.append(((y-y_hat) **2).mean())
return weight, b, cost
gradientDescent(X,y)
Not an expert, but I think you are currently experiencing the exploding gradient problem. If you step through your code you will notice that your weight value is swinging from positive to negative in increasing steps. I believe you cannot find the minimum because using mse for this dataset is causing you to jump back and forth never converging. Your x and y ranges to 100, so when you look at the cost it is just blowing up.
If you want to use mse with your current x and y values you should normalize your data. You can do this by subtracting the mean and dividing by the standard deviation, or just normalize both x and y to 1.
For example:
my_data.x = my_data.x.transform(lambda x: x / x.max())
my_data.y = my_data.y.transform(lambda x: x / x.max())
If you do this you should see your cost converge to ~0 with enough iterations.

Multiple Regression Python

I really don't understand what's wrong with my (simple) code...
i just want to test a multiple linear regression (....!).
import pandas as pd
import numpy as np
import scipy.stats as st
import sklearn
n = 1000
X1 = linspace(2, 8.5, n)
X2 = linspace(-4, 2.9, n)
X3 = linspace(-1, 16, n)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size = n, loc = 0, scale = 0.6)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression()
#Train:
my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
And I get some stupid results, like the coefficients are [ 0.25127347 0.26673645 0.65717676] ...
I also tried the OLS way, but I still get non sense coef (slighty different but still stupid)
(It's work with a one variable linear regression, something like Y = 2*X + 5, I would get coef and intercept really close to real one)
Thanks all!
I didn't perform a linear reression for a while, and of course it's because X is not invertible (in R, it gives me 'nan').
So it wasn't a smart question...
Thanks again!
The fact that the coefficients do not at all resemble the "true" ones that you have set indicates that multicollinearity might be a problem. The issue with your code is that your X matrix is near-singular, which renders numerical results instable. As can be seen in #R.yan's graphs, your X1 and X2 are almost identical except for a linear shift. This is corroborated by the fact that your X matrix, which has 1000 rows and three columns, only has a rank of 2. See:
np.linalg.matrix_rank(X)
Out[26]: 2
Try the following instead:
import pandas as pd
import numpy as np
import scipy.stats as st
import sklearn
from sklearn.linear_model import LinearRegression
n = 1000
# adding noise to your data:
X1 = np.linspace(2, 8.5, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X2 = np.linspace(-4, 2.9, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X3 = np.linspace(-1, 16, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size=1000 ,loc = 0, scale = 1)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression(fit_intercept = True)
#Train:
res = my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
Coefficients:
[[ 1.99273588 3.20068392 -1.19688422]]
Constant:
[ 4.02296003]
Now, we get the right coefficients, and a matrix of full rank:
np.linalg.matrix_rank(X)
Out[32]: 3
Note that in linear regression, X needs to have a rank equal to the number of columns (or rows, if that is less). If it does not, this means there is multicollinearity, which renders numerical results for the inverse of X'X instable (depending on which algorithm is used). See this description for more information on multicollinearity.
I guess the code gives you the correct answer. I plot the predicted Y base on the coef_ and intercept_ from your regression and have the following graph.
import pandas as pd
import numpy as np
import scipy.stats as st
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
n = 1000
X1 = np.linspace(2, 8.5, n)
X2 = np.linspace(-4, 2.9, n)
X3 = np.linspace(-1, 16, n)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size=1000 ,loc = 0, scale = 0.6)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression()
plt.plot(Y, color='blue', label='Y')
#Train:
res = my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
plt.scatter(X.index.values,X['X1'], c='black')
plt.scatter(X.index.values,X['X2'], c='black')
plt.scatter(X.index.values,X['X3'], c='black')
Y_pred = my_reg.coef_[0][0]*X['X1'] + my_reg.coef_[0][1]*X['X2'] +my_reg.coef_[0][2]*X['X3'] + my_reg.intercept_
plt.plot(Y_pred, color="red", label='predict')
plt.legend()
Out[]: ('Coefficients: \n', array([[ 3.13842691e+12, 1.01316187e+13, -5.31223199e+12]]))
('Constant: \n', array([ 2.89373889e+13]))

How to use least squares with weight matrix?

I know how to solve A.X = B by least squares using Python:
Example:
A=[[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,0,0]]
B=[1,1,1,1,1]
X=numpy.linalg.lstsq(A, B)
print X[0]
# [ 5.00000000e-01 5.00000000e-01 -1.66533454e-16 -1.11022302e-16]
But what about solving this same equation with a weight matrix not being Identity:
A.X = B (W)
Example:
A=[[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,0,0]]
B=[1,1,1,1,1]
W=[1,2,3,4,5]
I found another approach (using W as a diagonal matrix, and matricial products) :
A=[[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,0,0]]
B = [1,1,1,1,1]
W = [1,2,3,4,5]
W = np.sqrt(np.diag(W))
Aw = np.dot(W,A)
Bw = np.dot(B,W)
X = np.linalg.lstsq(Aw, Bw)
Same values and same results.
I don't know how you have defined your weights, but you could try this if appropriate:
import numpy as np
A=np.array([[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,0,0]])
B = np.array([1,1,1,1,1])
W = np.array([1,2,3,4,5])
Aw = A * np.sqrt(W[:,np.newaxis])
Bw = B * np.sqrt(W)
X = np.linalg.lstsq(Aw, Bw)
scikit package offers weighted regression directly ..
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit
import numpy as np
# generate random data
N = 25
xp = [-5.0, 5.0]
x = np.random.uniform(xp[0],xp[1],(N,1))
e = 2*np.random.randn(N,1)
y = 2*x+e
w = np.ones(N)
# make the 3rd one outlier
y[2] += 30.0
w[2] = 0.0
from sklearn.linear_model import LinearRegression
# fit WLS using sample_weights
WLS = LinearRegression()
WLS.fit(x, y, sample_weight=w)
from matplotlib import pyplot as plt
plt.plot(x,y, '.')
plt.plot(xp, xp*WLS.coef_[0])
plt.show()

Python: Sklearn.linear_model.LinearRegression working weird

I am trying to do multiple variables linear regression. But I find that the sklearn.linear_model working very weird. Here's my code:
import numpy as np
from sklearn import linear_model
b = np.array([3,5,7]).transpose() ## the right answer I am expecting
x = np.array([[1,6,9], ## 1*3 + 6*5 + 7*9 = 96
[2,7,7], ## 2*3 + 7*5 + 7*7 = 90
[3,4,5]]) ## 3*3 + 4*5 + 5*7 = 64
y = np.array([96,90,64]).transpose()
clf = linear_model.LinearRegression()
clf.fit([[1,6,9],
[2,7,7],
[3,4,5]], [96,90,64])
print clf.coef_ ## <== it gives me [-2.2 5 4.4] NOT [3, 5, 7]
print np.dot(x, clf.coef_) ## <== it gives me [ 67.4 61.4 35.4]
In order to find your initial coefficients back you need to use the keyword fit_intercept=False when construction the linear regression.
import numpy as np
from sklearn import linear_model
b = np.array([3,5,7])
x = np.array([[1,6,9],
[2,7,7],
[3,4,5]])
y = np.array([96,90,64])
clf = linear_model.LinearRegression(fit_intercept=False)
clf.fit(x, y)
print clf.coef_
print np.dot(x, clf.coef_)
Using fit_intercept=False prevents the LinearRegression object from working with x - x.mean(axis=0), which it would otherwise do (and capture the mean using a constant offset y = xb + c) - or equivalently by adding a column of 1 to x.
As a side remark, calling transpose on a 1D array doesn't have any effect (it reverses the order of your axes, and you only have one).

Categories