Is there any good library to calculate linear least squares OLS (Ordinary Least Squares) in python?
Thanks.
Edit:
Thanks for the SciKits and Scipy.
#ars: Can X be a matrix? An example:
y(1) = a(1)*x(11) + a(2)*x(12) + a(3)*x(13)
y(2) = a(1)*x(21) + a(2)*x(22) + a(3)*x(23)
...........................................
y(n) = a(1)*x(n1) = a(2)*x(n2) + a(3)*x(n3)
Then how do I pass the parameters for Y and X matrices in your example?
Also, I don't have much background in algebra, I would appreciate if you guys can let me know a good tutorial for that kind of problems.
Thanks much.
Try the statsmodels package. Here's a quick example:
import pylab
import numpy as np
import statsmodels.api as sm
x = np.arange(-10, 10)
y = 2*x + np.random.normal(size=len(x))
# model matrix with intercept
X = sm.add_constant(x)
# least squares fit
model = sm.OLS(y, X)
fit = model.fit()
print fit.summary()
pylab.scatter(x, y)
pylab.plot(x, fit.fittedvalues)
Update In response to the updated question, yes it works with matrices. Note that the code above has the x data in array form, but we build a matrix X (capital X) to pass to OLS. The add_constant function simply builds the matrix with a first column initialized to ones for the intercept. In your case, you would simply pass your X matrix without needing that intermediate step and it would work.
Have you looked at SciPy? I don't know if it does that, but I would imagine it will.
Related
I wrote a python code that imports some data which I then manipulate to get a nonsquare matrix A. I then used the following code to solve the matrix equation.
from scipy.optimize import lsq_linear
X = lsq_linear(A_normalized, Y, bounds=(0, np.inf), method='bvls')
As you can see I used this particular method because I require all the X coefficients to be positive. However, I realized that lsq_linear from scipy.optimize minimizes the L2 norm of AX - Y to solve the equation. I was wondering if anyone knows of an alternative to lsq_linear that solves the equation by minimizing the L1 norm instead. I have looked in the scipy documentation, but so far I haven't had any luck finding such an alternative myself.
(Note that I actually know what Y is and what I am trying to solve for is X).
Edit: After trying various suggestions from the comments section and after much frustration I finally managed to sort of make it work using cvxpy. However, there is a problem with it. First of all the elements of X are supposed to all be positive, but that is not the case. Moreover, when I multiply the matrix A_normalized with X they are not equal to Y. My code is below. Any suggestions on what I can do to fix it would be highly appreciated. (By the way my original use of lsq_linear in the code above gave me an X that satisfied A_normalized*X = Y.)
import cvxpy as cp
from cvxpy.atoms import norm
import numpy as np
Y = specificity
x = cp.Variable(22)
objective = cp.Minimize(norm((A_normalized # x - Y), 1))
constraints = [0 <= x]
prob = cp.Problem(objective, constraints)
result = prob.solve()
print("Optimal value", result)
print("Optimal var")
X = x.value # A numpy ndarray.
print(X)
A_normalized # X == Y
I have data that I want to fit with polynomials. I have 200,000 data points, so I want an efficient algorithm. I want to use the numpy.polynomial package so that I can try different families and degrees of polynomials. Is there some way I can formulate this as a system of equations like Ax=b? Is there a better way to solve this than with scipy.minimize?
import numpy as np
from scipy.optimize import minimize as mini
x1 = np.random.random(2000)
x2 = np.random.random(2000)
y = 20 * np.sin(x1) + x2 - np.sin (30 * x1 - x2 / 10)
def fitness(x, degree=5):
poly1 = np.polynomial.polynomial.polyval(x1, x[:degree])
poly2 = np.polynomial.polynomial.polyval(x2, x[degree:])
return np.sum((y - (poly1 + poly2)) ** 2 )
# It seems like I should be able to solve this as a system of equations
# x = np.linalg.solve(np.concatenate([x1, x2]), y)
# minimize the sum of the squared residuals to find the optimal polynomial coefficients
x = mini(fitness, np.ones(10))
print fitness(x.x)
Your intuition is right. You can solve this as a system of equations of the form Ax = b.
However:
The system is overdefined and you want to get the least-squares solution, so you need to use np.linalg.lstsq instead of np.linalg.solve.
You can't use polyval because you need to separate the coefficients and powers of the independent variable.
This is how to construct the system of equations and solve it:
A = np.stack([x1**0, x1**1, x1**2, x1**3, x1**4, x2**0, x2**1, x2**2, x2**3, x2**4]).T
xx = np.linalg.lstsq(A, y)[0]
print(fitness(xx)) # test the result with original fitness function
Of course you can generalize over the degree:
A = np.stack([x1**p for p in range(degree)] + [x2**p for p in range(degree)]).T
With the example data, the least squares solution runs much faster than the minimize solution (800µs vs 35ms on my laptop). However, A can become quite large, so if memory is an issue minimize might still be an option.
Update:
Without any knowledge about the internals of the polynomial function things become tricky, but it is possible to separate terms and coefficients. Here is a somewhat ugly way to construct the system matrix A from a function like polyval:
def construct_A(valfunc, degree):
columns1 = []
columns2 = []
for p in range(degree):
c = np.zeros(degree)
c[p] = 1
columns1.append(valfunc(x1, c))
columns2.append(valfunc(x2, c))
return np.stack(columns1 + columns2).T
A = construct_A(np.polynomial.polynomial.polyval, 5)
xx = np.linalg.lstsq(A, y)[0]
print(fitness(xx)) # test the result with original fitness function
I want to use KernelRidge class of scikit_learn library to fit nonlinear regression model on my data. But I am getting confused how I can do that.
from sklearn.kernel_ridge import KernelRidge
import numpy as np
n_samples, n_features = 20,1
rng = np.random.RandomState(0)
y = rng.randn(n_samples)
X = rng.randn(n_samples, n_features)
Krr = KernelRidge(alpha=1.0, kernel='linear',degree = 4)
Krr.fit(X, y)
I am expecting 5 coefficients to be set for this model, how can I get them?
The above code will transform 1-D data to 4-D space and fit the model to the data. I think it should find best c0,c1,c2,c3,c4 according to the training data. My question is how can I access c0,c1,c2,c3,c4?
EDIT:
I made a mistake in above my code here, kernel parameter should be "polynomial" instead of "linear" in line 7.
Krr = KernelRidge(alpha=1.0, kernel='polynomial',degree = 4)
But my question is same as before.
http://scikit-learn.org/stable/modules/generated/sklearn.kernel_ridge.KernelRidge.html#sklearn.kernel_ridge.KernelRidge
dual_coef_ : array, shape = [n_features] or [n_targets, n_features]
so
Krr.dual_coef_
should do it.
EDIT:
Ok, so dual_coef_ is the coefficient in the Kernel space. For a linear kernel, the Kernel, K(X,X') is X.T *X . So this is an NxN matrix, hence the number of coefficients equal to the the dimension of y.
there are 3 equations we need to understand,
The first is the standard ridge regression weight estimation.
The second is the partially kernalised version, with the relation linking the two being the third equation.
dual_coef_ returns the alpha of equation 2. Therefore to have the weight vector in the 'normal' space, rather than the kernel space as it is returned, you need to do X.T * Krr.dual_coef_
We can check this is correct because KRR and Ridge Regression are the same if the kernel is linear.
import numpy as np
from sklearn.kernel_ridge import KernelRidge
from sklearn.linear_model import Ridge
rng = np.random.RandomState(0)
X = 5 * rng.rand(100, 1)
y = np.sin(X).ravel()
Krr = KernelRidge(alpha=1.0, kernel='linear', coef0=0)
R = Ridge(alpha=1.0,fit_intercept=False)
Krr.fit(X, y)
R.fit(X, y)
print np.dot(X.transpose(),Krr.dual_coef_)
print R.coef_
I see this to output:
[-0.03997686]
[-0.03997686]
Will show they are equivalent (you have to change the intercept options as the defaults differ between the models).
As the degree parameter is ignored, as I mentioned in the comments, the coefficient should be 1x1 in this case (as it is).
If you want to know exactly what a particular model returns, I recommend looking at the source code on github, which I think is the only way to gain a deeper understanding of how this stuff works. https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/kernel_ridge.py
Additionally, for a non-linear kernel, the intuition of the weights can easily be lost, so always start from first principles if you do this.
Illustration of how KernelRidge prediction works. Hope it will help someone to understand the model.
I have a set of data and I would like to fit a power law function given as
y=a*x**b
using python libraries. Another issue is that I have errors for both x and y directions and I don't know which one of libraries would allow me to fit the function regarding both errors. The data is here. I also tried to use gnuplot to do the fit but doesn't look like promising plus I can not use the error information.
Any suggestion?
Actually, scipy has an Orthogonal distance regression package.
Here is their example for a linear fit, all you would have to do is change f for a power law:
from scipy.odr import Model, Data, ODR
def f(p, x):
'''Linear function y = m*x + b'''
# B is a vector of the parameters.
# x is an array of the current x values.
# x is in the same format as the x passed to Data or RealData.
#
# Return an array in the same format as y passed to Data or RealData.
return p[0] * x ** p[1]
linear = Model(f)
#sx, sy are arrays os error estimates
mydata = Data(x, y, wd=1./power(sx,2), we=1./power(sy,2))
#beta0 are the inital parameter estimates
myodr = ODR(mydata, linear, beta0=[10, -1.0])
myoutput = myodr.run()
myoutput.pprint()
The leastsq method in scipy lib fits a curve to some data. And this method implies that in this data Y values depends on some X argument. And calculates the minimal distance between curve and the data point in the Y axis (dy)
But what if I need to calculate minimal distance in both axes (dy and dx)
Is there some ways to implement this calculation?
Here is a sample of code when using one axis calculation:
import numpy as np
from scipy.optimize import leastsq
xData = [some data...]
yData = [some data...]
def mFunc(p, x, y):
return y - (p[0]*x**p[1]) # is takes into account only y axis
plsq, pcov = leastsq(mFunc, [1,1], args=(xData,yData))
print plsq
I recently tryed scipy.odr library and it returns the proper results only for linear function. For other functions like y=a*x^b it returns wrong results. This is how I use it:
def f(p, x):
return p[0]*x**p[1]
myModel = Model(f)
myData = Data(xData, yData)
myOdr = ODR(myData, myModel , beta0=[1,1])
myOdr.set_job(fit_type=0) #if set fit_type=2, returns the same as leastsq
out = myOdr.run()
out.pprint()
This returns wrong results, not desired, and in some input data not even close to real.
May be, there is some special ways of using it, what do I do wrong?
I've found the solution. Scipy Odrpack works noramally but it needs a good initial guess for correct results. So I divided the process into two steps.
First step: find the initial guess by using ordinaty least squares method.
Second step: substitude these initial guess in ODR as beta0 parameter.
And it works very well with an acceptable speed.
Thank you guys, your advice directed me to the right solution
scipy.odr implements the Orthogonal Distance Regression. See the instructions for basic use in the docstring and documentation.
If/when you are able to invert the function described by p you may just include x-pinverted(y) in mFunc, I guess as sqrt(a^2+b^2), so (pseudo code)
return sqrt( (y - (p[0]*x**p[1]))^2 + (x - (pinverted(y))^2)
for example for
y=kx+m p=[m,k]
pinv=[-m/k,1/k]
return sqrt( (y - (p[0]+x*p[1]))^2 + (x - (pinv[0]+y*pinv[1]))^2)
But what you ask for is in some cases problematic. For example, if a polynomial (or your x^j) curve has a minimum ym at y(m) and you have a point x,y lower than ym, what kind of value do you want to return? There's not always a solution.
you can use the ONLS package in R.