calculation of residuals with numpy lstsq - python

I have x,y data:
import numpy as np
x = np.array([ 2.5, 1.25, 0.625, 0.3125, 0.15625, 0.078125])
y = np.array([ 2448636.,1232116.,617889.,310678.,154454.,78338.])
X = np.vstack((x, np.zeros(len(x))))
popt,res,rank,val = np.linalg.lstsq(X.T,y)
popt,res,rank,val
Gives me:
(array([ 981270.29919414, 0. ]),
array([], dtype=float64),
1,
array([ 2.88639894, 0. ]))
Why are the residuals zero ? If I add ones instead of zero the residuals are calculated:
X = np.vstack((x, np.ones(len(x)))) # added ones instead of zeros
popt,res,rank,val = np.linalg.lstsq(X.T,y)
popt,res,rank,val
(array([ 978897.28500355, 4016.82089552]),
array([ 42727293.12864216]),
2,
array([ 3.49623683, 1.45176681]))
Additionally, If I calculate the sum of squared residuals in excel i get 9261214 if the intercept is set zero and 5478137 if ones are added to x.

lstsq is going to have a tough time fitting to that column of zeros: any value of the corresponding parameter (presumably intercept) will do.
To fix the intercept to 0, if that's what you need to do, just send the x array, but make sure that it's the right shape for lstsq:
In [214]: popt,res,rank,val = np.linalg.lstsq(np.atleast_2d(x).T,y)
In [215]: popt
Out[215]: array([ 981270.29919414])
In [216]: res
Out[216]: array([ 92621214.2278382])

Related

Why is my SVD calculation different than numpy's SVD calculation of this matrix?

I'm trying to manually compute the SVD of the matrix A defined below but I am having some problems. Computing it manually and with the svd method in numpy yields two different results.
Computed manually below:
import numpy as np
A = np.array([[3,2,2], [2,3,-2]])
V = np.linalg.eig(A.T # A)[1]
U = np.linalg.eig(A # A.T)[1]
S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
print(A)
print(U # S # V.T)
And computed via numpy's svd method:
X,Y,Z = np.linalg.svd(A)
Y = np.c_[np.diag(Y), [0,0]]
print(A)
print(X # Y # Z)
When these two codes are run. The manual calculation doesn't equal the svd method. Why is there a discrepancy between these two calculations?
Look at the eigenvalues returned by np.linalg.eig(A.T # A):
In [57]: evals, evecs = np.linalg.eig(A.T # A)
In [58]: evals
Out[58]: array([2.50000000e+01, 3.61082692e-15, 9.00000000e+00])
So (ignoring the normal floating point imprecision), it computed [25, 0, 9]. The eigenvectors associated with those eigenvalues are in the columns of evecs, in the same order. But your construction of S doesn't match that order; here's your S:
In [60]: S
Out[60]:
array([[5., 0., 0.],
[0., 3., 0.]])
When you compute U # S # V.T, the values in S # V.T are not correctly aligned.
As a quick fix, you can rerun your code with S set explicitly as follows:
S = np.array([[5, 0, 0],
[0, 0, 3]])
With that change, your code outputs
[[ 3 2 2]
[ 2 3 -2]]
[[-3. -2. -2.]
[-2. -3. 2.]]
That's better, but why are the signs wrong? Now the problem is that you have independently computed U and V. Eigenvectors are not unique; they are the basis of an eigenspace, and such a basis is not unique. If the eigenvalue is simple, and if the vector is normalized to have length one (which numpy.linalg.eig does), there is still a choice of the sign to be made. That is, if v is an eigenvector, then so is -v. The choices made by eig when computing U and V won't necessarily result in restoring the sign of A when U # S # V.T is computed.
It turns out that you can get the result that you expect by simply reversing all the signs in either U or V. Here is a modified version of your script that generates the output that you expected:
import numpy as np
A = np.array([[3, 2, 2],
[2, 3, -2]])
U = np.linalg.eig(A # A.T)[1]
V = -np.linalg.eig(A.T # A)[1]
#S = np.c_[np.diag(np.sqrt(np.linalg.eig(A # A.T)[0])), [0,0]]
S = np.array([[5, 0, 0],
[0, 0, 3]])
print(A)
print(U # S # V.T)
Output:
[[ 3 2 2]
[ 2 3 -2]]
[[ 3. 2. 2.]
[ 2. 3. -2.]]

Using two np.linspace, how to fill 2D array with complex values?

I'm trying to fill a 2D array with complex(x,y), where x and y are from two two arrays:
xstep = np.linspace(xmin, xmax, Nx)
ystep = np.linspace(ymin, ymax, Ny)
However I can't figure out how to "spread" these values out on a 2D array.
So far my attempts are not really working out. I was hoping for something along the lines of:
result = np.array(xstep + (1j * ystep))
Maybe something from fromfunction, meshgrid or full, but I can't quite make it work.
As an example, say I do this:
xstep = np.linspace(0, 1, 2) # array([0., 1.])
ystep = np.linspace(0, 1, 3) # array([0. , 0.5, 1. ])
I'm trying to construct an answer:
array([
[0+0j, 0+0.5j, 0+1j],
[1+0j, 1+0.5j, 1+1j]
])
Note that I am not married to the linspace, so any quicker method would also do, it is just my natural starting point for creating this array, being new to Numpy.
In [4]: xstep = np.linspace(0, 1, 2)
In [5]: ystep = np.linspace(0, 1, 3)
In [6]: xstep[:, None] + 1j*ystep
Out[6]:
array([[0.+0.j , 0.+0.5j, 0.+1.j ],
[1.+0.j , 1.+0.5j, 1.+1.j ]])
xstep[:, None] is equivalent to xstep[:, np.newaxis] and its purpose is to add a new axis to xstep on the right. Thus, xstep[:, None] is a 2D array of shape (2, 1).
In [19]: xstep[:, None].shape
Out[19]: (2, 1)
xstep[:, None] + 1j*ystep is thus the sum of a 2D array of shape (2, 1) and a 1D array of shape (3,).
NumPy broadcasting resolves this apparent shape conflict by automatically adding new axes (of length 1) on the left. So, by NumPy broadcasting rules, 1j*ystep is promoted to an array of shape (1, 3).
(Notice that xstep[:, None] is required to explicitly add new axes on the right, but broadcasting will automatically add axes on the left. This is why 1j*ystep[None, :] was unnecessary though valid.)
Broadcasting further promotes both arrays to the common shape (2, 3) (but in a memory-efficient way, without copying the data). The values along the axes of length 1 are broadcasted repeatedly:
In [15]: X, Y = np.broadcast_arrays(xstep[:, None], 1j*ystep)
In [16]: X
Out[16]:
array([[0., 0., 0.],
[1., 1., 1.]])
In [17]: Y
Out[17]:
array([[0.+0.j , 0.+0.5j, 0.+1.j ],
[0.+0.j , 0.+0.5j, 0.+1.j ]])
You can use np.ogrid with imaginary "step" to obtain linspace semantics:
y, x = np.ogrid[0:1:2j, 0:1:3j]
y + 1j*x
# array([[0.+0.j , 0.+0.5j, 0.+1.j ],
# [1.+0.j , 1.+0.5j, 1.+1.j ]])
Here the ogrid line means make an open 2D grid. axis 0: 0 to 1, 2 steps, axis 1: 0 to 1, 3 steps. The type of the slice "step" acts as a switch, if it is imaginary (in fact anything of complex type) its absolute value is taken and the expression is treated like a linspace. Otherwise range semantics apply.
The return values
y, x
# (array([[0.],
# [1.]]), array([[0. , 0.5, 1. ]]))
are "broadcast ready", so in the example we can simply add them and obtain a full 2D grid.
If we allow ourselves an imaginary "stop" parameter in the second slice (which only works with linspace semantics, so depending on your style you may prefer to avoid it) this can be condensed to one line:
sum(np.ogrid[0:1:2j, 0:1j:3j])
# array([[0.+0.j , 0.+0.5j, 0.+1.j ],
# [1.+0.j , 1.+0.5j, 1.+1.j ]])
A similar but potentially more performant method would be preallocation and then broadcasting:
out = np.empty((y.size, x.size), complex)
out.real[...], out.imag[...] = y, x
out
# array([[0.+0.j , 0.+0.5j, 0.+1.j ],
# [1.+0.j , 1.+0.5j, 1.+1.j ]])
And another one using outer sum:
np.add.outer(np.linspace(0,1,2), np.linspace(0,1j,3))
# array([[0.+0.j , 0.+0.5j, 0.+1.j ],
# [1.+0.j , 1.+0.5j, 1.+1.j ]])
Use reshape(-1,1) for xstep as:
xstep = np.linspace(0, 1, 2) # array([0., 1.])
ystep = np.linspace(0, 1, 3) # array([0. , 0.5, 1. ])
result = np.array(xstep.reshape(-1,1) + (1j * ystep))
result
array([[0.+0.j , 0.+0.5j, 0.+1.j ],
[1.+0.j , 1.+0.5j, 1.+1.j ]])

Vectorizing A Function With Array Parameter

I'm currently trying to apply Chi-Squared analysis to some data.
I want to plot a colourmap of varying values depending on the two coefficients of a model
def f(x, coeff):
return coeff[0] + numpy.exp(coeff[1] * x)
def chi_squared(coeff, x, y, y_err):
return numpy.sum(((y - f(x, coeff) / y_err)**2)
us = numpy.linspace(u0, u1, n)
vs = numpy.linspace(v0, v1, n)
rs = numpy.meshgrid(us, vs)
chi = numpy.vectorize(chi_squared)
chi(rs, x, y, y_error)
I tried vectorizing the function to be able to pass a meshgrid of the varying coefficents to produce the colormap.
The values of x, y, y_err are all 1D arrays of length n.
And u, v are the various changing coefficients.
However this doesn't work, resulting in
IndexError: invalid index to scalar variable.
This is because coeff is passed as a scalar rather than a vector, however I don't know how to correct this.
Update
My aim is to take an array of coordinates
rs = [[[u0, v0], [u1, v0],..,[un, v0]],...,[[u0, vm],..,[un,vm]]
Where each coordinate is the coefficient parameters to be passed to the chi-squared method.
This should return a 2D array populated with Chi-Squared values for the appropriate coordinate
chi = [[c00, c10, ..., cn0], ..., [c0m, c1m, ..., cnm]]
I can then use this data to plot a colormap using imshow
Here's my first attempt to run your code:
In [44]: def f(x, coeff):
...: return coeff[0] + numpy.exp(coeff[1] * x)
...:
...: def chi_squared(coeff, x, y, y_err):
...: return numpy.sum((y - f(x, coeff) / y_err)**2)
(I had to remove the ( in that last line)
First guess at possible array values:
In [45]: x = np.arange(3)
In [46]: y = x
In [47]: y_err = x
In [48]: us = np.linspace(0,1,3)
In [49]: rs = np.meshgrid(us,us)
In [50]: rs
Out[50]:
[array([[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ]]),
array([[ 0. , 0. , 0. ],
[ 0.5, 0.5, 0.5],
[ 1. , 1. , 1. ]])]
In [51]: chi_squared(rs, x, y, y_err)
/usr/local/bin/ipython3:5: RuntimeWarning: divide by zero encountered in true_divide
import sys
Out[51]: inf
oops, y_err shouldn't have a 0. Try again:
In [52]: y_err = np.array([1,1,1])
In [53]: chi_squared(rs, x, y, y_err)
Out[53]: 53.262865105526018
It also works if I turn the rs list into an array:
In [55]: np.array(rs).shape
Out[55]: (2, 3, 3)
In [56]: chi_squared(np.array(rs), x, y, y_err)
Out[56]: 53.262865105526018
Now, what was the purpose of vectorize?
The f function returns a (n,n) array:
In [57]: f(x, rs)
Out[57]:
array([[ 1. , 1.5 , 2. ],
[ 1. , 2.14872127, 3.71828183],
[ 1. , 3.21828183, 8.3890561 ]])
Lets modify the chi_squared to give sum an axis
In [61]: def chi_squared(coeff, x, y, y_err, axis=None):
...: return numpy.sum((y - f(x, coeff) / y_err)**2, axis=axis)
In [62]: chi_squared(np.array(rs), x, y, y_err)
Out[62]: 53.262865105526018
In [63]: chi_squared(np.array(rs), x, y, y_err, axis=0)
Out[63]: array([ 3. , 6.49033483, 43.77253028])
In [64]: chi_squared(np.array(rs), x, y, y_err, axis=1)
Out[64]: array([ 1.25 , 5.272053 , 46.74081211])
I'm tempted to change the coeff to coeff0, coeff1, to give more control from the start on how this parameter is passed, but it probably doesn't make a difference.
update
Now that you've been more specific about how the coeff values relate to x, y etc, I see that this can be solved with simple broadcasting. No need to use np.vectorize.
First, define a grid that has a different size; that way we, and the code, won't think that each dimension of the coeff grid has anything to do with the x,y values.
In [134]: rs = np.meshgrid(np.linspace(0,1,4), np.linspace(0,1,5), indexing='ij')
In [135]: coeff=np.array(rs)
In [136]: coeff.shape
Out[136]: (2, 4, 5)
Now look at what f looks like when given this coeff and x.
In [137]: f(x, coeff[...,None]).shape
Out[137]: (4, 5, 3)
coeff is effectively (4,5,1), while x is (1,1,3), resulting in a (4,5,3) (by broadcasting rules)
The same thing happens inside chi_squared, with the final step of sum on the last axis (size 3):
In [138]: chi_squared(coeff[...,None], x, y, y_err, axis=-1)
Out[138]:
array([[ 2. , 1.20406718, 1.93676807, 8.40646968,
32.99441808],
[ 2.33333333, 2.15923164, 3.84810347, 11.80559574,
38.73264336],
[ 3.33333333, 3.78106277, 6.42610554, 15.87138846,
45.13753532],
[ 5. , 6.06956056, 9.67077427, 20.60384785,
52.20909393]])
In [139]: _.shape
Out[139]: (4, 5)
One value for each coeff pair of values, the (4,5) grid.

efficient numpy array creation

Given x, I want to produce x, log(x) as a numpy array whereby x has shape s, the result has shape (*s, 2). What's the neatest way to do this? x may just be a float, in which case I want a result with shape (2,).
An ugly way to do this is:
import numpy as np
x = np.asarray(x)
result = np.empty((*x.shape, 2))
result[..., 0] = x
result[..., 1] = np.log(x)
It's important to separate aesthetics from performance. Sometimes ugly code is
fast. In fact, that's the case here. Although creating an empty array and then
assigning values to slices may not look beautiful, it is fast.
import numpy as np
import timeit
import itertools as IT
import pandas as pd
def using_empty(x):
x = np.asarray(x)
result = np.empty(x.shape + (2,))
result[..., 0] = x
result[..., 1] = np.log(x)
return result
def using_concat(x):
x = np.asarray(x)
return np.concatenate([x, np.log(x)], axis=-1).reshape(x.shape+(2,), order='F')
def using_stack(x):
x = np.asarray(x)
return np.stack([x, np.log(x)], axis=x.ndim)
def using_ufunc(x):
return np.array([x, np.log(x)])
using_ufunc = np.vectorize(using_ufunc, otypes=[np.ndarray])
tests = [np.arange(600),
np.arange(600).reshape(20,30),
np.arange(960).reshape(8,15,8)]
# check that all implementations return the same result
for x in tests:
assert np.allclose(using_empty(x), using_concat(x))
assert np.allclose(using_empty(x), using_stack(x))
timing = []
funcs = ['using_empty', 'using_concat', 'using_stack', 'using_ufunc']
for test, func in IT.product(tests, funcs):
timing.append(timeit.timeit(
'{}(test)'.format(func),
setup='from __main__ import test, {}'.format(func), number=1000))
timing = pd.DataFrame(np.array(timing).reshape(-1, len(funcs)), columns=funcs)
print(timing)
yields, the following timeit results on my machine:
using_empty using_concat using_stack using_ufunc
0 0.024754 0.025182 0.030244 2.414580
1 0.025766 0.027692 0.031970 2.408344
2 0.037502 0.039644 0.044032 3.907487
So using_empty is the fastest (of the options tested applied to tests).
Note that np.stack does exactly what you want, so
np.stack([x, np.log(x)], axis=x.ndim)
looks reasonably pretty, but it is also the slowest of the three options tested.
Note that along with being much slower, using_ufunc returns an array of object dtype:
In [236]: x = np.arange(6)
In [237]: using_ufunc(x)
Out[237]:
array([array([ 0., -inf]), array([ 1., 0.]),
array([ 2. , 0.69314718]),
array([ 3. , 1.09861229]),
array([ 4. , 1.38629436]), array([ 5. , 1.60943791])], dtype=object)
which is not the same as the desired result:
In [240]: using_empty(x)
Out[240]:
array([[ 0. , -inf],
[ 1. , 0. ],
[ 2. , 0.69314718],
[ 3. , 1.09861229],
[ 4. , 1.38629436],
[ 5. , 1.60943791]])
In [238]: using_ufunc(x).shape
Out[238]: (6,)
In [239]: using_empty(x).shape
Out[239]: (6, 2)

numpy interpolation to increase a vector size

Hi I have to enlarge the number of points inside of vector to enlarge the vector to fixed size. for example:
for this simple vector
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> len(a)
# 6
now, I want to get a vector with size of 11 taken the a vector as base the results will be
# array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])
EDIT 1
what I need is a function that will enter the base vector and the number of values that must be the resultant vector, and I return a new vector with size equal to the parameter. something like
def enlargeVector(vector, size):
.....
return newVector
to use like:
>>> a = np.array([0, 1, 2, 3, 4, 5])
>>> b = enlargeVector(a, 200):
>>> len(b)
# 200
and b contains data results of linear, cubic, or whatever interpolation methods
There are many methods to do this within scipy.interpolate. My favourite is UnivariateSpline, which produces an order k spline guaranteed to be differentiable k times.
To use it:
from scipy.interpolate import UnivariateSpline
old_indices = np.arange(0,len(a))
new_length = 11
new_indices = np.linspace(0,len(a)-1,new_length)
spl = UnivariateSpline(old_indices,a,k=3,s=0)
new_array = spl(new_indices)
The s is a smoothing factor that you should set to 0 in this case (since the data are exact).
Note that for the problem you have specified (since a just increases monotonically by 1), this is overkill, since the second np.linspace gives already the desired output.
EDIT: clarified that the length is arbitrary
As AGML pointed out there are tools to do this, but how about a pure numpy solution:
In [20]: a = np.arange(6)
In [21]: temp = np.dstack((a[:-1], a[:-1] + np.diff(a) / 2.0)).ravel()
In [22]: temp
Out[22]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5])
In [23]: np.hstack((temp, [a[-1]]))
Out[23]: array([ 0. , 0.5, 1. , 1.5, 2. , 2.5, 3. , 3.5, 4. , 4.5, 5. ])

Categories