Related
I want to fit a function to the independant (X) and dependent (y) variables:
import numpy as np
y = np.array([1.45952016, 1.36947283, 1.31433227, 1.24076599, 1.20577963,
1.14454815, 1.13068077, 1.09638278, 1.08121406, 1.04417094,
1.02251471, 1.01268524, 0.98535659, 0.97400591])
X = np.array([4.571428571362048, 8.771428571548313, 12.404761904850602, 17.904761904850602,
22.904761904850602, 31.238095237873495, 37.95833333302289,
44.67857142863795, 51.39880952378735, 64.83928571408615,
71.5595238097012, 85., 98.55357142863795, 112.1071428572759])
I already tried scipy package in this way:
from scipy.optimize import curve_fit
def func (x, a, b, c):
return 1/(a*(x**2) + b*(x**1) + c)
g = [1, 1, 1]
c, cov = curve_fit (func, X.flatten(), y.flatten(), g)
test_ar = np.arange(min(X), max(X), 0.25)
pred = np.empty(len(test_ar))
for i in range (len(test_ar)):
pred[i] = func(test_ar[i], c[0], c[1], c[2])
I can add higher orders of polynomial to make my func more accurate but I want to keep it simple. I very much appreciate if anyone an give me some help on how to find another function or make my prediction better. The figure also shows the result of the prediction:
First thing you want to do is to specify how do you measure "accuracy" which in your case is not an appropriate term at all.
What are you essentially doing is called linear regression. Suitable metrics in this case are mean squared error (MSE), root mean squared error (RMSE), mean absolute error (MAE). It is up to you to decide which metric to use and what threshold to set for being "acceptable".
The image that you are showing above (where you've fitted the line) looks fine BUT please expand your X-axis from -100 to 300 and show us the image again this is a problem with high degree polynomials.
This is a 101 example how to use regression in scikit-learn. In your case if you want to use x^2 or x^3 for predicting y, you just need to add them in to the data ... Currently your X variable is an array (a vector) you need to expand that to become a matrix where each column is a feature (x, x^2, x^3 ...)
here is some code:
import pandas as pd
from sklearn import linear_model
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error, r2_score
y = [1.45952016, 1.36947283, 1.31433227, 1.24076599,
1.20577963, 1.14454815, 1.13068077, 1.09638278,
1.08121406, 1.04417094, 1.02251471, 1.01268524, 0.98535659,
0.97400591]
x = [4.571428571362048, 8.771428571548313, 12.404761904850602,
17.904761904850602, 22.904761904850602, 31.238095237873495,
37.95833333302289, 44.67857142863795, 51.39880952378735,
64.83928571408615, 71.5595238097012, 85., 98.55357142863795, 112.1071428572759]
df = pd.DataFrame({
'x' : x,
'x^2': [i**2 for i in x],
'x^3': [i**3 for i in x],
'y': y
})
X = df[['x','x^2','x^3']]
y = df['y']
model = linear_model.LinearRegression()
model.fit(X, y)
y1 = model.predict(X)
coef = model.coef_
intercept = model.intercept_
you can see the coefficients from the coef variable:
array([-1.67456732e-02, 2.03899728e-04, -8.70976426e-07])
you can see the intercept from the intercept variable:
1.5042389677980577
which in your case means -> y1 = -1.67e-2x +2.03e-4x^2 -8.70e-7x^3 + 1.5
I am a learner of data science and machine learning. I have written a code for gradient descent optimization of linear regression cost function without using builtin python library. However, just to confirm whether my code is correct and verify results, I have also implemented the same using builtin python library.
The coefficient and intercept values I obtained through my code are not matching with the coefficient and intercept values obtained using builtin python module. Kindly suggest what is the error in my way of gradient descent optimization of linear regression?
my method:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
Data=pd.DataFrame({'X': list(np.arange(0,10,1)), 'Y': [1,3,2,5,7,8,8,9,10,12]})
Data.head()
sb.scatterplot(x ='X', y = 'Y', data = Data)
plt.show()
#generating column of ones
X0 = np.ones(len(Data)).reshape(-1,1)
#print(X0.shape)
X = Data.drop(['Y'], axis = 1).values
X_new = np.concatenate((X0,X), axis = 1)
#print(X_new)
#print(X_new.shape)
Y = Data.loc[:,['Y']].values
#print(Y)
#print(Y.shape)
# initial theta
theta =np.random.randint(low=0, high=1, size= X_new.shape[1]).reshape(-1,1)
#print(theta.shape)
J_history = []
theta_history = [list(theta.flatten())]
#gradient descent implementation
iterations = 1000
alpha = 0.01
m = len(Y)
for iter in range(1,iterations):
H = X_new.dot(theta)
loss = (H-Y)
J = loss/(2*m)
J_history.append(J)
G = X_new.T.dot(loss)/m
theta_new = theta - alpha*G
theta_history.append(list(theta_new.flatten()))
theta = theta_new
# collecting costs (J) and coefficients (theta_0,theta_1)
theta_history.pop()
J_history = [i[0] for i in J_history]
params = pd.DataFrame()
params['J']=J_history
for i in range(len(theta_history[0])):
params['theta_'+str(i)]=[k[i] for k in theta_history]
idx = params[params['J']==min(params['J'])].index
values = params.iloc[idx[0]][1:params.shape[1]].tolist()
print('intercept: {}, coeff: {}'.format(values[0],values[1]))
using builtin library:
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
Data=pd.DataFrame({'X': list(np.arange(0,10,1)), 'Y': [1,3,2,5,7,8,8,9,10,12]})
Data.head()
sb.scatterplot(x ='X', y = 'Y', data = Data)
plt.show()
model = SGDRegressor(loss = 'squared_loss', learning_rate = 'constant', eta0 = 0.01, max_iter= 1000)
model.fit(Data['X'].values.reshape(-1,1), Data['Y'].values.reshape(-1,1))
print('coeff: {}, intercept: {}'.format(model.coef_, model.intercept_))
First of all I appreciate your effort to understand and implement by yourself the SGD algorithm.
Now, back to your code. There are some minor errors that need to be corrected:
Your Js are not scalars but numpy.arrays but the way you're using them implies that they're assumed to be scalars hence the error raised when your code is executed.
After running your chain, you must take the theta who has the lowest error and this error is actually J^2 and not J as J may be negative as well.
The scikit learn SGDRegressor that you're actually using is, as its name suggests, stochastic by definition and given the small size of your dataset you need to run it many times and average its estimates if you want to get something reliable from it.
Your learning rate 0.01 seems to be a little big
When those changes are made, I get from your code a "comparable" results with SGDRegressor.
import pandas as pd
import numpy as np
import seaborn as sb
import matplotlib.pyplot as plt
from sklearn.linear_model import SGDRegressor
Data=pd.DataFrame({'X': list(np.arange(0,10,1)), 'Y': [1,3,2,5,7,8,8,9,10,12]})
Data.head()
sb.scatterplot(x ='X', y = 'Y', data = Data)
plt.show()
#generating column of ones
X0 = np.ones(len(Data)).reshape(-1,1)
#print(X0.shape)
X = Data.drop(['Y'], axis = 1).values
X_new = np.concatenate((X0,X), axis = 1)
#print(X_new)
#print(X_new.shape)
Y = Data.loc[:,['Y']].values
#print(Y)
#print(Y.shape)
# initial theta
theta =np.random.randint(low=0, high=1, size= X_new.shape[1]).reshape(-1,1)
#print(theta.shape)
J_history = []
theta_history = [list(theta.flatten())]
#gradient descent implementation
iterations = 2000
alpha = 0.001
m = len(Y)
for iter in range(1,iterations):
H = X_new.dot(theta)
loss = (H-Y)
J = loss/(2*m)
J_history.append(J[0]**2)
G = X_new.T.dot(loss)/m
theta_new = theta - alpha*G
theta_history.append(list(theta_new.flatten()))
theta = theta_new
theta_history.pop()
J_history = [i[0] for i in J_history]
# collecting costs (J) and coefficients (theta_0,theta_1)
params = pd.DataFrame()
params['J']=J_history
for i in range(len(theta_history[0])):
params['theta_'+str(i)]=[k[i] for k in theta_history]
idx = params[params['J']== params['J'].min()].index
values = params.iloc[idx[0]][1:params.shape[1]].tolist()
print('intercept: {}, coeff: {}'.format(values[0],values[1]))
#> intercept: 0.654041555750147, coeff: 1.2625626277290982
Now let's see the scikit learn model
from sklearn.linear_model import SGDRegressor
intercepts = []
coefs = []
for _ in range(500):
model = SGDRegressor(loss = 'squared_loss', learning_rate = 'constant', eta0 = 0.01, max_iter= 1000)
model.fit(Data['X'].values.reshape(-1,1), Data['Y'].values.reshape(-1))
intercepts.append(model.intercept_)
coefs.append(model.coef_)
intercept = np.concatenate(intercepts).mean()
coef = np.vstack(coefs).mean(0)
print('intercept: {}, coeff: {}'.format( intercept, coef))
#> intercept: 0.6912403374422401, coeff: [1.24932246]
I really don't understand what's wrong with my (simple) code...
i just want to test a multiple linear regression (....!).
import pandas as pd
import numpy as np
import scipy.stats as st
import sklearn
n = 1000
X1 = linspace(2, 8.5, n)
X2 = linspace(-4, 2.9, n)
X3 = linspace(-1, 16, n)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size = n, loc = 0, scale = 0.6)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression()
#Train:
my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
And I get some stupid results, like the coefficients are [ 0.25127347 0.26673645 0.65717676] ...
I also tried the OLS way, but I still get non sense coef (slighty different but still stupid)
(It's work with a one variable linear regression, something like Y = 2*X + 5, I would get coef and intercept really close to real one)
Thanks all!
I didn't perform a linear reression for a while, and of course it's because X is not invertible (in R, it gives me 'nan').
So it wasn't a smart question...
Thanks again!
The fact that the coefficients do not at all resemble the "true" ones that you have set indicates that multicollinearity might be a problem. The issue with your code is that your X matrix is near-singular, which renders numerical results instable. As can be seen in #R.yan's graphs, your X1 and X2 are almost identical except for a linear shift. This is corroborated by the fact that your X matrix, which has 1000 rows and three columns, only has a rank of 2. See:
np.linalg.matrix_rank(X)
Out[26]: 2
Try the following instead:
import pandas as pd
import numpy as np
import scipy.stats as st
import sklearn
from sklearn.linear_model import LinearRegression
n = 1000
# adding noise to your data:
X1 = np.linspace(2, 8.5, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X2 = np.linspace(-4, 2.9, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X3 = np.linspace(-1, 16, n) + st.norm.rvs(size=n ,loc = 0, scale = 1)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size=1000 ,loc = 0, scale = 1)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression(fit_intercept = True)
#Train:
res = my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
Coefficients:
[[ 1.99273588 3.20068392 -1.19688422]]
Constant:
[ 4.02296003]
Now, we get the right coefficients, and a matrix of full rank:
np.linalg.matrix_rank(X)
Out[32]: 3
Note that in linear regression, X needs to have a rank equal to the number of columns (or rows, if that is less). If it does not, this means there is multicollinearity, which renders numerical results for the inverse of X'X instable (depending on which algorithm is used). See this description for more information on multicollinearity.
I guess the code gives you the correct answer. I plot the predicted Y base on the coef_ and intercept_ from your regression and have the following graph.
import pandas as pd
import numpy as np
import scipy.stats as st
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
n = 1000
X1 = np.linspace(2, 8.5, n)
X2 = np.linspace(-4, 2.9, n)
X3 = np.linspace(-1, 16, n)
X = np.transpose( [X1, X2, X3] )
Y = 2*X1 + 3.2*X2 -1.2*X3 + 4 + st.norm.rvs(size=1000 ,loc = 0, scale = 0.6)
X = pd.DataFrame( X , columns = ["X1", "X2", "X3"])
Y = pd.DataFrame(Y, columns = ["Y"])
#Create linear regression object:
my_reg = sklearn.linear_model.LinearRegression()
plt.plot(Y, color='blue', label='Y')
#Train:
res = my_reg.fit(X, Y)
print('Coefficients: \n', my_reg.coef_)
print('Constant: \n', my_reg.intercept_)
plt.scatter(X.index.values,X['X1'], c='black')
plt.scatter(X.index.values,X['X2'], c='black')
plt.scatter(X.index.values,X['X3'], c='black')
Y_pred = my_reg.coef_[0][0]*X['X1'] + my_reg.coef_[0][1]*X['X2'] +my_reg.coef_[0][2]*X['X3'] + my_reg.intercept_
plt.plot(Y_pred, color="red", label='predict')
plt.legend()
Out[]: ('Coefficients: \n', array([[ 3.13842691e+12, 1.01316187e+13, -5.31223199e+12]]))
('Constant: \n', array([ 2.89373889e+13]))
Sorry for the noob question...here's my code:
from __future__ import division
import sklearn
import numpy as np
from scipy import stats
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
X =np.array([6,8,10,14,18])
Y = np.array([7,9,13,17.5,18])
X = np.reshape(X,(1,5))
Y = np.reshape(Y,(1,5))
print X
print Y
plt.figure()
plt.title('Pizza Price as a function of Pizza Diameter')
plt.xlabel('Pizza Diameter (Inches)')
plt.ylabel('Pizza Price (Dollars)')
axis = plt.axis([0, 25, 0 ,25])
m, b = np.polyfit(X,Y,1)
plt.grid(True)
plt.plot(X,Y, 'k.')
plt.plot(X, m*X + b, '-')
#plt.show()
#training data
#x= [[6],[8],[10],[14],[18]]
#y= [[7],[9],[13],[17.5],[18]]
# create and fit linear regression model
model = LinearRegression()
model.fit(X,Y)
print 'A 12" pizza should cost $% .2f' % model.predict(19)
#work out cost function, which is residual sum of squares
print 'Residual sum of squares: %.2f' % np.mean((model.predict(x)- y) ** 2)
#work out variance (AKA Mean squared error)
xMean = np.mean(x)
print 'Variance is: %.2f' %np.var([x], ddof=1)
#work out covariance (this is whether the x axis data and y axis data correlate with eachother)
#When a and b are 1-dimensional sequences, numpy.cov(x,y)[0][1] calculates covariance
print 'Covariance is: %.2f' %np.cov(X, Y, ddof = 1)[0][1]
#test the model on new test data, printing the r squared coefficient
X_test = [[8], [9], [11], [16], [12]]
y_test = [[11], [8.5], [15], [18], [11]]
print 'R squared for model on test data is: %.2f' %model.score(X_test,y_test)
Basically, some of these functions work for the variables I have called X and Y and some don't.
For example, as the code is, it throws up this error:
TypeError: expected 1D vector for x
for the line
m, b = np.polyfit(X,Y,1)
However, when I comment out the two lines reshaping the variables like this:
#X = np.reshape(X,(1,5))
#Y = np.reshape(Y,(1,5))
I get the error:
ValueError: Found input variables with inconsistent numbers of samples: [1, 5]
on the line
model.fit(X,Y)
So, how do I get the array to work for all the functions in my script, without having different arrays of the same data with slightly different structures?
Thanks for your help!
Change these lines
X = np.reshape(X,(5))
Y = np.reshape(Y,(5))
or just removed them both
I know how to solve A.X = B by least squares using Python:
Example:
A=[[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,0,0]]
B=[1,1,1,1,1]
X=numpy.linalg.lstsq(A, B)
print X[0]
# [ 5.00000000e-01 5.00000000e-01 -1.66533454e-16 -1.11022302e-16]
But what about solving this same equation with a weight matrix not being Identity:
A.X = B (W)
Example:
A=[[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,0,0]]
B=[1,1,1,1,1]
W=[1,2,3,4,5]
I found another approach (using W as a diagonal matrix, and matricial products) :
A=[[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,0,0]]
B = [1,1,1,1,1]
W = [1,2,3,4,5]
W = np.sqrt(np.diag(W))
Aw = np.dot(W,A)
Bw = np.dot(B,W)
X = np.linalg.lstsq(Aw, Bw)
Same values and same results.
I don't know how you have defined your weights, but you could try this if appropriate:
import numpy as np
A=np.array([[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,1,1],[1,1,0,0]])
B = np.array([1,1,1,1,1])
W = np.array([1,2,3,4,5])
Aw = A * np.sqrt(W[:,np.newaxis])
Bw = B * np.sqrt(W)
X = np.linalg.lstsq(Aw, Bw)
scikit package offers weighted regression directly ..
https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit
import numpy as np
# generate random data
N = 25
xp = [-5.0, 5.0]
x = np.random.uniform(xp[0],xp[1],(N,1))
e = 2*np.random.randn(N,1)
y = 2*x+e
w = np.ones(N)
# make the 3rd one outlier
y[2] += 30.0
w[2] = 0.0
from sklearn.linear_model import LinearRegression
# fit WLS using sample_weights
WLS = LinearRegression()
WLS.fit(x, y, sample_weight=w)
from matplotlib import pyplot as plt
plt.plot(x,y, '.')
plt.plot(xp, xp*WLS.coef_[0])
plt.show()