I'm currently trying to apply Chi-Squared analysis to some data.
I want to plot a colourmap of varying values depending on the two coefficients of a model
def f(x, coeff):
return coeff[0] + numpy.exp(coeff[1] * x)
def chi_squared(coeff, x, y, y_err):
return numpy.sum(((y - f(x, coeff) / y_err)**2)
us = numpy.linspace(u0, u1, n)
vs = numpy.linspace(v0, v1, n)
rs = numpy.meshgrid(us, vs)
chi = numpy.vectorize(chi_squared)
chi(rs, x, y, y_error)
I tried vectorizing the function to be able to pass a meshgrid of the varying coefficents to produce the colormap.
The values of x, y, y_err are all 1D arrays of length n.
And u, v are the various changing coefficients.
However this doesn't work, resulting in
IndexError: invalid index to scalar variable.
This is because coeff is passed as a scalar rather than a vector, however I don't know how to correct this.
Update
My aim is to take an array of coordinates
rs = [[[u0, v0], [u1, v0],..,[un, v0]],...,[[u0, vm],..,[un,vm]]
Where each coordinate is the coefficient parameters to be passed to the chi-squared method.
This should return a 2D array populated with Chi-Squared values for the appropriate coordinate
chi = [[c00, c10, ..., cn0], ..., [c0m, c1m, ..., cnm]]
I can then use this data to plot a colormap using imshow
Here's my first attempt to run your code:
In [44]: def f(x, coeff):
...: return coeff[0] + numpy.exp(coeff[1] * x)
...:
...: def chi_squared(coeff, x, y, y_err):
...: return numpy.sum((y - f(x, coeff) / y_err)**2)
(I had to remove the ( in that last line)
First guess at possible array values:
In [45]: x = np.arange(3)
In [46]: y = x
In [47]: y_err = x
In [48]: us = np.linspace(0,1,3)
In [49]: rs = np.meshgrid(us,us)
In [50]: rs
Out[50]:
[array([[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ]]),
array([[ 0. , 0. , 0. ],
[ 0.5, 0.5, 0.5],
[ 1. , 1. , 1. ]])]
In [51]: chi_squared(rs, x, y, y_err)
/usr/local/bin/ipython3:5: RuntimeWarning: divide by zero encountered in true_divide
import sys
Out[51]: inf
oops, y_err shouldn't have a 0. Try again:
In [52]: y_err = np.array([1,1,1])
In [53]: chi_squared(rs, x, y, y_err)
Out[53]: 53.262865105526018
It also works if I turn the rs list into an array:
In [55]: np.array(rs).shape
Out[55]: (2, 3, 3)
In [56]: chi_squared(np.array(rs), x, y, y_err)
Out[56]: 53.262865105526018
Now, what was the purpose of vectorize?
The f function returns a (n,n) array:
In [57]: f(x, rs)
Out[57]:
array([[ 1. , 1.5 , 2. ],
[ 1. , 2.14872127, 3.71828183],
[ 1. , 3.21828183, 8.3890561 ]])
Lets modify the chi_squared to give sum an axis
In [61]: def chi_squared(coeff, x, y, y_err, axis=None):
...: return numpy.sum((y - f(x, coeff) / y_err)**2, axis=axis)
In [62]: chi_squared(np.array(rs), x, y, y_err)
Out[62]: 53.262865105526018
In [63]: chi_squared(np.array(rs), x, y, y_err, axis=0)
Out[63]: array([ 3. , 6.49033483, 43.77253028])
In [64]: chi_squared(np.array(rs), x, y, y_err, axis=1)
Out[64]: array([ 1.25 , 5.272053 , 46.74081211])
I'm tempted to change the coeff to coeff0, coeff1, to give more control from the start on how this parameter is passed, but it probably doesn't make a difference.
update
Now that you've been more specific about how the coeff values relate to x, y etc, I see that this can be solved with simple broadcasting. No need to use np.vectorize.
First, define a grid that has a different size; that way we, and the code, won't think that each dimension of the coeff grid has anything to do with the x,y values.
In [134]: rs = np.meshgrid(np.linspace(0,1,4), np.linspace(0,1,5), indexing='ij')
In [135]: coeff=np.array(rs)
In [136]: coeff.shape
Out[136]: (2, 4, 5)
Now look at what f looks like when given this coeff and x.
In [137]: f(x, coeff[...,None]).shape
Out[137]: (4, 5, 3)
coeff is effectively (4,5,1), while x is (1,1,3), resulting in a (4,5,3) (by broadcasting rules)
The same thing happens inside chi_squared, with the final step of sum on the last axis (size 3):
In [138]: chi_squared(coeff[...,None], x, y, y_err, axis=-1)
Out[138]:
array([[ 2. , 1.20406718, 1.93676807, 8.40646968,
32.99441808],
[ 2.33333333, 2.15923164, 3.84810347, 11.80559574,
38.73264336],
[ 3.33333333, 3.78106277, 6.42610554, 15.87138846,
45.13753532],
[ 5. , 6.06956056, 9.67077427, 20.60384785,
52.20909393]])
In [139]: _.shape
Out[139]: (4, 5)
One value for each coeff pair of values, the (4,5) grid.
Related
Though the linalg.lstsq document is offered. I still feel hard to understand since it is not quite detailed.
x : {(N,), (N, K)} ndarray
Least-squares solution. If b is two-dimensional, the solutions are in
the K columns of x.
residuals : {(1,), (K,), (0,)} ndarray
Sums of residuals; squared Euclidean 2-norm for each column in b -
a*x. If the rank of a is < N or M <= N, this is an empty array. If b
is 1-dimensional, this is a (1,) shape array. Otherwise the shape is
(K,).
rank : int
Rank of matrix a.
s : (min(M, N),) ndarray
Singular values of a.
I tried to observe the output. But I only figure out the rank is 2. For the rest, I don't get why it is so.
x = np.array([0, 1, 2, 3])
y = np.array([-1, 0.2, 0.9, 2.1])
A = np.vstack([x, np.ones(len(x))]).T
print(A)
print('-------------------------')
print(np.linalg.lstsq(A, y, rcond=None))
Gives
[[0. 1.]
[1. 1.]
[2. 1.]
[3. 1.]]
-------------------------
(array([ 1. , -0.95]), array([0.05]), 2, array([4.10003045, 1.09075677]))
I don't understand what the tuples, "(N, ), (N, K), (1,), (K,), (0,), (M, N)", represent in the document.
For example, np.linalg.lstsq(A, y, rcond=None)[0] would be array([ 1. , -0.95]) How does it relate to {(N,), (N, K)}?
Those tuples are the possible shapes of inputs and outputs.
In your example, A.shape = (4, 2) and y.shape = (4,).
Looking at the documentation, M = 4, N = 2, and we are dealing with the cases without K.
So the output's shapes should be x.shape = (N,) = (2,), residuals.shape = (1,), s.shape = (min(M, N),) = (2,).
Let's look at the outputs one at a time
>>> x, residuals, rank, s = np.linalg.lstsq(A, y, rcond=None)
x is the least square solution of A # x = y, so it minimises np.linalg.norm(A # x - y)**2:
>>> A.T # (A # x - y)
array([1.72084569e-15, 2.16493490e-15])
The other outputs are there to tell you how good this solution is and how susceptible it is to numerical errors.
residuals is the squared norm of the mis-match between A # x and y:
>>> np.linalg.norm(A # x - y)**2
0.04999999999999995
>>> residuals[0]
0.04999999999999971
rank is the rank of A:
np.linalg.matrix_rank(A)
2
>>> rank
2
s contains the singular values of A
>>> np.linalg.svd(A, compute_uv=False)
array([4.10003045, 1.09075677])
>>> s
array([4.10003045, 1.09075677])
Are you familiar with the mathematical concepts?
I'm trying to fill a 2D array with complex(x,y), where x and y are from two two arrays:
xstep = np.linspace(xmin, xmax, Nx)
ystep = np.linspace(ymin, ymax, Ny)
However I can't figure out how to "spread" these values out on a 2D array.
So far my attempts are not really working out. I was hoping for something along the lines of:
result = np.array(xstep + (1j * ystep))
Maybe something from fromfunction, meshgrid or full, but I can't quite make it work.
As an example, say I do this:
xstep = np.linspace(0, 1, 2) # array([0., 1.])
ystep = np.linspace(0, 1, 3) # array([0. , 0.5, 1. ])
I'm trying to construct an answer:
array([
[0+0j, 0+0.5j, 0+1j],
[1+0j, 1+0.5j, 1+1j]
])
Note that I am not married to the linspace, so any quicker method would also do, it is just my natural starting point for creating this array, being new to Numpy.
In [4]: xstep = np.linspace(0, 1, 2)
In [5]: ystep = np.linspace(0, 1, 3)
In [6]: xstep[:, None] + 1j*ystep
Out[6]:
array([[0.+0.j , 0.+0.5j, 0.+1.j ],
[1.+0.j , 1.+0.5j, 1.+1.j ]])
xstep[:, None] is equivalent to xstep[:, np.newaxis] and its purpose is to add a new axis to xstep on the right. Thus, xstep[:, None] is a 2D array of shape (2, 1).
In [19]: xstep[:, None].shape
Out[19]: (2, 1)
xstep[:, None] + 1j*ystep is thus the sum of a 2D array of shape (2, 1) and a 1D array of shape (3,).
NumPy broadcasting resolves this apparent shape conflict by automatically adding new axes (of length 1) on the left. So, by NumPy broadcasting rules, 1j*ystep is promoted to an array of shape (1, 3).
(Notice that xstep[:, None] is required to explicitly add new axes on the right, but broadcasting will automatically add axes on the left. This is why 1j*ystep[None, :] was unnecessary though valid.)
Broadcasting further promotes both arrays to the common shape (2, 3) (but in a memory-efficient way, without copying the data). The values along the axes of length 1 are broadcasted repeatedly:
In [15]: X, Y = np.broadcast_arrays(xstep[:, None], 1j*ystep)
In [16]: X
Out[16]:
array([[0., 0., 0.],
[1., 1., 1.]])
In [17]: Y
Out[17]:
array([[0.+0.j , 0.+0.5j, 0.+1.j ],
[0.+0.j , 0.+0.5j, 0.+1.j ]])
You can use np.ogrid with imaginary "step" to obtain linspace semantics:
y, x = np.ogrid[0:1:2j, 0:1:3j]
y + 1j*x
# array([[0.+0.j , 0.+0.5j, 0.+1.j ],
# [1.+0.j , 1.+0.5j, 1.+1.j ]])
Here the ogrid line means make an open 2D grid. axis 0: 0 to 1, 2 steps, axis 1: 0 to 1, 3 steps. The type of the slice "step" acts as a switch, if it is imaginary (in fact anything of complex type) its absolute value is taken and the expression is treated like a linspace. Otherwise range semantics apply.
The return values
y, x
# (array([[0.],
# [1.]]), array([[0. , 0.5, 1. ]]))
are "broadcast ready", so in the example we can simply add them and obtain a full 2D grid.
If we allow ourselves an imaginary "stop" parameter in the second slice (which only works with linspace semantics, so depending on your style you may prefer to avoid it) this can be condensed to one line:
sum(np.ogrid[0:1:2j, 0:1j:3j])
# array([[0.+0.j , 0.+0.5j, 0.+1.j ],
# [1.+0.j , 1.+0.5j, 1.+1.j ]])
A similar but potentially more performant method would be preallocation and then broadcasting:
out = np.empty((y.size, x.size), complex)
out.real[...], out.imag[...] = y, x
out
# array([[0.+0.j , 0.+0.5j, 0.+1.j ],
# [1.+0.j , 1.+0.5j, 1.+1.j ]])
And another one using outer sum:
np.add.outer(np.linspace(0,1,2), np.linspace(0,1j,3))
# array([[0.+0.j , 0.+0.5j, 0.+1.j ],
# [1.+0.j , 1.+0.5j, 1.+1.j ]])
Use reshape(-1,1) for xstep as:
xstep = np.linspace(0, 1, 2) # array([0., 1.])
ystep = np.linspace(0, 1, 3) # array([0. , 0.5, 1. ])
result = np.array(xstep.reshape(-1,1) + (1j * ystep))
result
array([[0.+0.j , 0.+0.5j, 0.+1.j ],
[1.+0.j , 1.+0.5j, 1.+1.j ]])
Given x, I want to produce x, log(x) as a numpy array whereby x has shape s, the result has shape (*s, 2). What's the neatest way to do this? x may just be a float, in which case I want a result with shape (2,).
An ugly way to do this is:
import numpy as np
x = np.asarray(x)
result = np.empty((*x.shape, 2))
result[..., 0] = x
result[..., 1] = np.log(x)
It's important to separate aesthetics from performance. Sometimes ugly code is
fast. In fact, that's the case here. Although creating an empty array and then
assigning values to slices may not look beautiful, it is fast.
import numpy as np
import timeit
import itertools as IT
import pandas as pd
def using_empty(x):
x = np.asarray(x)
result = np.empty(x.shape + (2,))
result[..., 0] = x
result[..., 1] = np.log(x)
return result
def using_concat(x):
x = np.asarray(x)
return np.concatenate([x, np.log(x)], axis=-1).reshape(x.shape+(2,), order='F')
def using_stack(x):
x = np.asarray(x)
return np.stack([x, np.log(x)], axis=x.ndim)
def using_ufunc(x):
return np.array([x, np.log(x)])
using_ufunc = np.vectorize(using_ufunc, otypes=[np.ndarray])
tests = [np.arange(600),
np.arange(600).reshape(20,30),
np.arange(960).reshape(8,15,8)]
# check that all implementations return the same result
for x in tests:
assert np.allclose(using_empty(x), using_concat(x))
assert np.allclose(using_empty(x), using_stack(x))
timing = []
funcs = ['using_empty', 'using_concat', 'using_stack', 'using_ufunc']
for test, func in IT.product(tests, funcs):
timing.append(timeit.timeit(
'{}(test)'.format(func),
setup='from __main__ import test, {}'.format(func), number=1000))
timing = pd.DataFrame(np.array(timing).reshape(-1, len(funcs)), columns=funcs)
print(timing)
yields, the following timeit results on my machine:
using_empty using_concat using_stack using_ufunc
0 0.024754 0.025182 0.030244 2.414580
1 0.025766 0.027692 0.031970 2.408344
2 0.037502 0.039644 0.044032 3.907487
So using_empty is the fastest (of the options tested applied to tests).
Note that np.stack does exactly what you want, so
np.stack([x, np.log(x)], axis=x.ndim)
looks reasonably pretty, but it is also the slowest of the three options tested.
Note that along with being much slower, using_ufunc returns an array of object dtype:
In [236]: x = np.arange(6)
In [237]: using_ufunc(x)
Out[237]:
array([array([ 0., -inf]), array([ 1., 0.]),
array([ 2. , 0.69314718]),
array([ 3. , 1.09861229]),
array([ 4. , 1.38629436]), array([ 5. , 1.60943791])], dtype=object)
which is not the same as the desired result:
In [240]: using_empty(x)
Out[240]:
array([[ 0. , -inf],
[ 1. , 0. ],
[ 2. , 0.69314718],
[ 3. , 1.09861229],
[ 4. , 1.38629436],
[ 5. , 1.60943791]])
In [238]: using_ufunc(x).shape
Out[238]: (6,)
In [239]: using_empty(x).shape
Out[239]: (6, 2)
When I try to calculate the Mahalanobis distance with the following python code I get some Nan entries in the result. Do you have any insight about why this happens?
My data.shape = (181, 1500)
from scipy.spatial.distance import pdist, squareform
data_log = log2(data + 1) # A log transform that I usually apply to my data
data_centered = data_log - data_log.mean(0) # zero centering
D = squareform( pdist(data_centered, 'mahalanobis' ) )
I also tried:
data_standard = data_centered / data_centered.std(0, ddof=1)
D = squareform( pdist(data_standard, 'mahalanobis' ) )
Also got nans.
The input is not corrupted and other distances, such as correlation distance, can be computed just fine.
For some reason when I reduce the number of features I stop getting Nans. E.g the following examples does not get any Nan:
D = squareform( pdist(data_centered[:,:200], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:480], 'mahalanobis' ) )
while those others get Nans:
D = squareform( pdist(data_centered[:,:300], 'mahalanobis' ) )
D = squareform( pdist(data_centered[:,180:600], 'mahalanobis' ) )
Any clue? Is this an expected behaviour if some condition for the input is not satisfied?
You have fewer observations than features, so the covariance matrix V computed by the scipy code is singular. The code doesn't check this, and blindly computes the "inverse" of the covariance matrix. Because this numerically computed inverse is basically garbage, the product (x-y)*inv(V)*(x-y) (where x and y are observations) might turn out to be negative. Then the square root of that value results in nan.
For example, this array also results in a nan:
In [265]: x
Out[265]:
array([[-1. , 0.5, 1. , 2. , 2. ],
[ 2. , 1. , 2.5, -1.5, 1. ],
[ 1.5, -0.5, 1. , 2. , 2.5]])
In [266]: squareform(pdist(x, 'mahalanobis'))
Out[266]:
array([[ 0. , nan, 1.90394328],
[ nan, 0. , nan],
[ 1.90394328, nan, 0. ]])
Here's the Mahalanobis calculation done "by hand":
In [279]: V = np.cov(x.T)
In theory, V is singular; the following value is effectively 0:
In [280]: np.linalg.det(V)
Out[280]: -2.968550671342364e-47
But inv doesn't see the problem, and returns an inverse:
In [281]: VI = np.linalg.inv(V)
Let's compute the distance between x[0] and x[2] and verify that we get the same non-nan value (1.9039) returned by pdist when we use VI:
In [295]: delta = x[0] - x[2]
In [296]: np.dot(np.dot(delta, VI), delta)
Out[296]: 3.625
In [297]: np.sqrt(np.dot(np.dot(delta, VI), delta))
Out[297]: 1.9039432764659772
Here's what happens when we try to compute the distance between x[0] and x[1]:
In [300]: delta = x[0] - x[1]
In [301]: np.dot(np.dot(delta, VI), delta)
Out[301]: -1.75
Then the square root of that value gives nan.
In scipy 0.16 (to be released in June 2015), you will get an error instead of nan or garbage. The error message describes the problem:
In [4]: x = array([[-1. , 0.5, 1. , 2. , 2. ],
...: [ 2. , 1. , 2.5, -1.5, 1. ],
...: [ 1.5, -0.5, 1. , 2. , 2.5]])
In [5]: pdist(x, 'mahalanobis')
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-5-a3453ff6fe48> in <module>()
----> 1 pdist(x, 'mahalanobis')
/Users/warren/local_scipy/lib/python2.7/site-packages/scipy/spatial/distance.pyc in pdist(X, metric, p, w, V, VI)
1298 "singular. For observations with %d "
1299 "dimensions, at least %d observations "
-> 1300 "are required." % (m, n, n + 1))
1301 V = np.atleast_2d(np.cov(X.T))
1302 VI = _convert_to_double(np.linalg.inv(V).T.copy())
ValueError: The number of observations (3) is too small; the covariance matrix is singular. For observations with 5 dimensions, at least 6 observations are required.
I have x,y data:
import numpy as np
x = np.array([ 2.5, 1.25, 0.625, 0.3125, 0.15625, 0.078125])
y = np.array([ 2448636.,1232116.,617889.,310678.,154454.,78338.])
X = np.vstack((x, np.zeros(len(x))))
popt,res,rank,val = np.linalg.lstsq(X.T,y)
popt,res,rank,val
Gives me:
(array([ 981270.29919414, 0. ]),
array([], dtype=float64),
1,
array([ 2.88639894, 0. ]))
Why are the residuals zero ? If I add ones instead of zero the residuals are calculated:
X = np.vstack((x, np.ones(len(x)))) # added ones instead of zeros
popt,res,rank,val = np.linalg.lstsq(X.T,y)
popt,res,rank,val
(array([ 978897.28500355, 4016.82089552]),
array([ 42727293.12864216]),
2,
array([ 3.49623683, 1.45176681]))
Additionally, If I calculate the sum of squared residuals in excel i get 9261214 if the intercept is set zero and 5478137 if ones are added to x.
lstsq is going to have a tough time fitting to that column of zeros: any value of the corresponding parameter (presumably intercept) will do.
To fix the intercept to 0, if that's what you need to do, just send the x array, but make sure that it's the right shape for lstsq:
In [214]: popt,res,rank,val = np.linalg.lstsq(np.atleast_2d(x).T,y)
In [215]: popt
Out[215]: array([ 981270.29919414])
In [216]: res
Out[216]: array([ 92621214.2278382])