scipy stats geometric mean returns NaN - python

I am using scipy's gmean() function to determine the geometric mean of a numpy array that contains voltage outputs. The range of the numbers is between -80.0 and 30.0. Currently, the numpy array is two dimensional, giving the voltage for two different measurements.
array([[-60.0924, -60.0882],
[-80. , -80. ],
[-80. , -80. ],
...,
[-60.9221, -66.0748],
[-61.0971, -65.9637],
[-61.2706, -65.8803]])
However, I get NaN when I take the geometric mean:
>>> from scipy import stats as scistats
>>> scistats.gmean(voltages)
array([ NaN, NaN])
Does anybody have an idea what might be causing this? Am I doing something wrong?
Thanks in advance!

The geometric mean cannot be applied to negative values.

Related

Calculate squared deviation from the mean for each element in array

I have an array with shape (128,116,116,1), where 1st dimension asthe number of subjects, with the 2nd and 3rd being the data.
I was trying to calculate the variance (squared deviation from the mean) at each position (i.e: in (0,0), (0,1), (1,0), etc... until (116,116)) for all the 128 subjects, resulting in an array with shape (116,116).
Can anyone tell me how to accomplish this?
Thank you!
Let's say we have a multidimensional list a of shape (3,2,2)
import numpy as np
a =
[
[
[1,1],
[1,1]
],
[
[2,2],
[2,2]
],
[
[3,3],
[3,3]
],
]
np.var(a, axis = 0) # results in:
> array([[0.66666667, 0.66666667],
> [0.66666667, 0.66666667]])
If you want to efficiently compute the variance across all 128 subjects (which would be axis 0), I don't see a way to do it using the statistics package since it doesn't take multi-lists as input. So you will have to write your own code/logic and add loops on the subjects.
But, using the numpy.var
function, we can easily calculate the variance of each 'datapoint' (tuples of indices) across all 128 subjects.
Side note: You mentioned statistics.variance. However, that is only to be used when you are taking a sample from a population as is mentioned in the documentation you linked. If you were to go the manual route, you would use statistics.pvariance instead, since we are calculating it on the whole dataset.
The difference can be seen here:
statistics.pvariance([1,2,3])
> 0.6666666666666666 # (correct)
statistics.variance([1,2,3])
> 1 # (incorrect)
np.var([1,2,3])
> 0.6666666666666666 # (np.var also gives the correct output)

scipy.optimise.bisect on a numpy array

I have a numpy array of floats which when printed look like this:
The red circles are the original values, the blue crosses are a linear interpolation using numpy.interp.
I would like to find the abscissa of the zero crossing of this numpy array (red circle) using scipy.optimize.bisect (for example). Since this is a numpy array (and not a function) I cannot pass it directly to scipy.optimize.bisect. So I was thinking to pass a function that interpolates the numpy array to bisect. Here is the code I am using for the moment:
def Inter_F(x,xp,fp):
return np.interp(x,xp,fp)
Numpyroot = scp.optimize.bisect(Inter_F,0,9,args=(XNumpy,YNumpy))
I find a value that seems correct, Numpyroot = 3.376425289196618.
I am wondering:
if this is the correct technical way to use scipy.optimize.bisect on
a numpy array? Specially when I am going to do this 10^6 times on different set of numpy values.
if enforcing a linear interpolation is not influencing the results
that bisect is going to find and if yes, are there better choice?
Here are the two numpy arrays:
XNumpy = array([ 0. , 1.125, 2.25 , 3.375, 4.5 , 5.625, 6.75 , 7.875, 9. ])
YNumpy = array([ -2.70584242e+04, -2.46925289e+04, -1.53211676e+04,
-2.30000000e+01, 1.81312104e+04, 3.41662461e+04,
4.80466863e+04, 5.75113178e+04, 6.41718009e+04])
I think what you do is correct. However, there is a more concise way.
import numpy as np
from scipy.interpolate import interp1d
XNumpy = np.array([0., 1.125, 2.25, 3.375, 4.5, 5.625, 6.75, 7.875, 9.])
YNumpy = np.array([
-2.70584242e+04, -2.46925289e+04, -1.53211676e+04,
-2.30000000e+01, 1.81312104e+04, 3.41662461e+04,
4.80466863e+04, 5.75113178e+04, 6.41718009e+04
])
invf = interp1d(YNumpy, XNumpy)
print(invf(0))
Result:
array(3.376425289199028)
Here I use scipy.interpolate.interp1d to return a function. Also I interpolate the inverse function so that the abscissa are readily calculated. Of course you can do the same trick with np.interp, I just like scipy.interpolate.interp1d because it returns a function so I can calculate x value from any given y value.

Creating density estimate in numpy

import numpy as np
np.random.random((5,5))
array([[ 0.26045197, 0.66184973, 0.79957904, 0.82613958, 0.39644677],
[ 0.09284838, 0.59098542, 0.13045167, 0.06170584, 0.01265676],
[ 0.16456109, 0.87820099, 0.79891448, 0.02966868, 0.27810629],
[ 0.03037986, 0.31481138, 0.06477025, 0.37205248, 0.59648463],
[ 0.08084797, 0.10305354, 0.72488268, 0.30258304, 0.230913 ]])
I would like to create a 2D density estimate from this 2D array such that similar values imply higher density. Is there a way to do this in numpy?
I agree, it is indeed not entirely clear what you mean.
The numpy.histogram function provides you with the density for an array.
import numpy as np
array = np.random.random((5,5))
print array
density = np.histogram(array, density=True)
print(density)
You can then plot the density, for example with Matplotlib.
There is a great discussion on this here: How does numpy.histogram() work?

Construct 2 time series random variables with fixed correlation

Is there an easy way to generate two time-series with a fixed correlation? For instance 0.5.
Does anyone know a solution in R or Python?
Thanks!
This question is quite general, I think. It is not limited to just time-series. What you are asking is to generate 2d random variable, with known covariance. r==0.5, std1=1 and std2=2 would translate to a covariance matrix of [[1,1],[1,4]]. Therefore, if we assume the data is multidimensional normal distributed, we can generate such a random variable:
In [42]:
import numpy as np
val=np.random.multivariate_normal((0,0),[[1,1],[1,4]],1000)
In [43]:
np.corrcoef(val.T)
Out[43]:
array([[ 1. , 0.488883],
[ 0.488883, 1. ]])
In [44]:
np.cov(val.T)
Out[44]:
array([[ 1.03693888, 0.96490767],
[ 0.96490767, 3.75671707]])
In [45]:
val=np.random.multivariate_normal((0,0),[[1,1],[1,4]],10)
In [46]:
np.corrcoef(val.T)
Out[46]:
array([[ 1. , 0.56807297],
[ 0.56807297, 1. ]])
In [48]:
val[:,0]
Out[48]:
array([-0.77425116, 0.35758601, -1.21668939, -0.95127533, -0.5714381 ,
0.87530824, 0.9594394 , 1.30123373, 1.92511929, 0.98070711])
In [49]:
val[:,1]
Out[49]:
array([-1.75698285, 2.24011423, -3.5129411 , -1.33889305, 2.32720257,
0.53750133, 3.23935645, 2.96819425, -0.72551024, 3.0743096 ])
As shown in this example, if your sample size is small, the resulting random variable may deviate from r=0.5, considerably.

Euclidian Distances between points

I have an array of points in numpy:
points = rand(dim, n_points)
And I want to:
Calculate all the l2 norm (euclidian distance) between a certain point and all other points
Calculate all pairwise distances.
and preferably all numpy and no for's. How can one do it?
If you're willing to use SciPy, the scipy.spatial.distance module (the functions cdist and/or pdist) do exactly what you want, with all the looping done in C. You can do it with broadcasting too but there's some extra memory overhead.
This might help with the second part:
import numpy as np
from numpy import *
p=rand(3,4) # this is column-wise so each vector has length 3
sqrt(sum((p[:,np.newaxis,:]-p[:,:,np.newaxis])**2 ,axis=0) )
which gives
array([[ 0. , 0.37355868, 0.64896708, 1.14974483],
[ 0.37355868, 0. , 0.6277216 , 1.19625254],
[ 0.64896708, 0.6277216 , 0. , 0.77465192],
[ 1.14974483, 1.19625254, 0.77465192, 0. ]])
if p was
array([[ 0.46193242, 0.11934744, 0.3836483 , 0.84897951],
[ 0.19102709, 0.33050367, 0.36382587, 0.96880535],
[ 0.84963349, 0.79740414, 0.22901247, 0.09652746]])
and you can check one of the entries via
sqrt(sum ((p[:,0]-p[:,2] )**2 ))
0.64896708223796884
The trick is to put newaxis and then do broadcasting.
Good luck!

Categories