Empirical CDF function in python with reasonable NaN behavior - python

I'm looking to compute the ECDF and am using this statsmodels function:
from statsmodels.distributions.empirical_distribution import ECDF
Looks good at first:
ECDF(np.array([0,1,2,3, 3, 3]))(np.array([0,1,2,3, 3,3]))
array([0.16666667, 0.33333333, 0.5 , 1. , 1. ,
1. ])
However, nan seems to be treated as infinity:
>>> x = np.array([0,1,2,3, np.nan, np.nan])
>>> ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
Same as:
np.array([0,1,2,3, np.inf, np.inf])
ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
Comparing with R:
> x <- c(0,1,2,3,NA,NA)
> x
[1] 0 1 2 3 NA NA
> ecdf(x)(x)
[1] 0.25 0.50 0.75 1.00 NA NA
What's the standard python function for ecdf that is nan aware?
Hot-wiring like so does not seem to work:
def ecdf(x):
return np.where(~np.isfinite(x),
np.full_like(x, np.nan),
ECDF(x[np.isfinite(x)])(x[np.isfinite(x)]))
ecdf(x)
ECDF(x[np.isfinite(x)])(x[np.isfinite(x)]))
File "<__array_function__ internals>", line 6, in where
ValueError: operands could not be broadcast together with shapes (7,) (7,) (4,)

The source code of statsmodel's ECDF is pleasantly brief (after stripping comments):
class ECDF(StepFunction):
def __init__(self, x, side='right'):
x = np.array(x, copy=True)
x.sort()
nobs = len(x)
y = np.linspace(1./nobs,1,nobs)
super(ECDF, self).__init__(x, y, side=side, sorted=True)
Sorting the input samples via x.sort() will move all the np.nan valued elements to the end even after np.inf, which is why they appear to be treated as infinity
bar=np.array([1, np.nan, 2, np.inf, 3])
bar.sort()
# bar is now array([ 1., 2., 3., inf, nan])
The reason np.nan isn't propagated is because ECDF's parent class uses np.searchsorted to find the correct index and then looks it up in y. For np.nan this is simply the last element of the array and a subsequent lookup of self.y will return 1 for this case.
You can make it propagate np.nan with a simple change, which you can realize as a subclass or sibling.
from statsmodels.distributions.empirical_distribution import StepFunction
import numpy as np
class MyECDF(StepFunction):
def __init__(self, x, side='right'):
x = np.sort(x)
# count number of non-nan's instead of length
nobs = np.count_nonzero(~np.isnan(x))
# fill the y values corresponding to np.nan with np.nan
y = np.full_like(x, np.nan)
y[:nobs] = np.linspace(1./nobs,1,nobs)
super(MyECDF, self).__init__(x, y, side=side, sorted=True)
This small change will make the function behave in a way similar to R:
>>> from foobar import MyECDF
>>> from statsmodels.distributions.empirical_distribution import ECDF
>>> import numpy as np
>>> x = np.array([0,1,2,3, np.nan, np.nan])
>>> ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
>>> MyECDF(x)(x)
array([0.25, 0.5 , 0.75, 1. , nan, nan])

You can use a masked array:
import numpy.ma as ma
def ecdf(x):
return np.where(np.isnan(x),
np.full_like(x, np.nan),
ECDF(ma.array(x, mask=np.isnan(x)).compressed(), "right")(ma.array(x, mask=np.isnan(x))),
)
>>> ecdf(x)
array([0.25, 0.5 , 0.75, 1. , nan, nan])
Matches what R does natively.

Related

How to divide an array by an other array element wise in numpy?

I have two arrays, and I want all the elements of one to be divided by the second. For example,
In [24]: a = np.array([1,2,3])
In [25]: b = np.array([1,2,3])
In [26]: a/b
Out[26]: array([1., 1., 1.])
In [27]: 1/b
Out[27]: array([1. , 0.5 , 0.33333333])
This is not the answer I want, the output I want is like (we can see all of the elements of a are divided by b)
In [28]: c = []
In [29]: for i in a:
...: c.append(i/b)
...:
In [30]: c
Out[30]:
[array([1. , 0.5 , 0.33333333]),
array([2. , 1. , 0.66666667]),
In [34]: np.array(c)
Out[34]:
array([[1. , 0.5 , 0.33333333],
[2. , 1. , 0.66666667],
[3. , 1.5 , 1. ]])
But I don't like for loop, it's too slow for big data, so is there a function that included in numpy package or any good (faster) way to solve this problem?
It is simple to do in pure numpy, you can use broadcasting to calculate the outer product (or any other outer operation) of two vectors:
import numpy as np
a = np.arange(1, 4)
b = np.arange(1, 4)
c = a[:,np.newaxis] / b
# array([[1. , 0.5 , 0.33333333],
# [2. , 1. , 0.66666667],
# [3. , 1.5 , 1. ]])
This works, since a[:,np.newaxis] increases the dimension of the (3,) shaped array a into a (3, 1) shaped array, which can be used for the desired broadcasting operation.
First you need to cast a into a 2D array (same shape as the output), then repeat for the dimension you want to loop over. Then vectorized division will work.
>>> a.reshape(-1,1)
array([[1],
[2],
[3]])
>>> a.reshape(-1,1).repeat(b.shape[0], axis=1)
array([[1, 1, 1],
[2, 2, 2],
[3, 3, 3]])
>>> a.reshape(-1,1).repeat(b.shape[0], axis=1) / b
array([[1. , 0.5 , 0.33333333],
[2. , 1. , 0.66666667],
[3. , 1.5 , 1. ]])
# Transpose will let you do it the other way around, but then you just get 1 for everything
>>> a.reshape(-1,1).repeat(b.shape[0], axis=1).T
array([[1, 2, 3],
[1, 2, 3],
[1, 2, 3]])
>>> a.reshape(-1,1).repeat(b.shape[0], axis=1).T / b
array([[1., 1., 1.],
[1., 1., 1.],
[1., 1., 1.]])
This should do the job:
import numpy as np
a = np.array([1, 2, 3])
b = np.array([1, 2, 3])
print(a.reshape(-1, 1) / b)
Output:
[[ 1. 0.5 0.33333333]
[ 2. 1. 0.66666667]
[ 3. 1.5 1. ]]

Numpy apply function to array

For example, I have function:
f1 = lambda x: x % 2
If I want to modify array = np.linspace(0, 5, 6) I can do f1(array). Everything works as expected:
[0. 1. 0. 1. 0. 1.]
If I change function to:
f2 = lambda x: 0
print(f2(array))
gives me 0 while I expected [0. 0. 0. 0. 0. 0.]. How to achieve consistency?
You can use below code to achieve desirable output
import numpy as np
array = np.linspace(0, 5, 6)
f2 = lambda x: x-x
print(f2(array))
Slightly more explicit than previous answer :
import numpy as np
array = np.linspace(0, 5, 6)
f2 = lambda x: np.zeros_like(x)
print(f2(array))
Documentation for numpy.zeros_like: Return an array of zeros with the same shape and type as a given array.
To iterate over an array, evaluate the function for every element, then store it to a resulting array, a list iterator works consistently:
import numpy as np
array = np.linspace(0, 5, 6)
f1 = lambda x: x % 2
f2 = lambda x: 0
print ([f1(x) for x in array])
[0.0, 1.0, 0.0, 1.0, 0.0, 1.0]
print ([f2(x) for x in array])
[0, 0, 0, 0, 0, 0]

Re-assign values with multiple if statements Numpy

I have a large Numpy ndarray, here is a sample of that:
myarray = np.array([[1.01,9.4,0.0,6.9,5.7],[1.9,2.6,np.nan,4.7,-2.45],[np.nan,0.2,0.3,4.2,15.1]])
myarray
array([[ 1.01, 9.4 , 0.0 , 6.9 , 5.7 ],
[ 1.9 , 2.6 , nan, 4.7 , -2.45],
[ nan, 0.2 , 0.3 , 4.2 , 15.1 ]])
As you can see, my array contains floats, positive, negative, zeros and NaNs. I would like to re-assign (re-class) the values in the array based on multiple if statements. I've read many answers and docs but all of which I've seen refer to a simple one or two conditions which can be easily be resolved using np.where for example.
I have multiple condition, for the sake of simplicity let's say I have four conditions (the desired solution should be able to handle more conditions). My conditions are:
if x > 6*y:
x=3
elif x < 4*z:
x=2
elif x == np.nan:
x=np.nan # maybe pass is better?
else:
x=0
where x is a value in the array, y and z are variable that will change among arrays. For example, array #1 will have y=5, z=2, array #2 will have y = 0.9, z= 0.5 etc. The condition for np.nan just means that if a value is nan, do not alter it, keep it nan.
Note that this needs to be executed at the same time, because if I use several np.where one after the other, than condition #2 will overwrite condition #1.
I tried to create a function and then apply it on the array but with no success. It seems that in order to apply a function to an array, the function must include only one argument (the array), and if I out to use a function, it should contain 3 arguments: the array, and y and z values.
What would be the most efficient way to achieve my goal?
In [11]: myarray = np.array([[1.01,9.4,0.0,6.9,5.7],[1.9,2.6,np.nan,4.7,-2.45],[
...: np.nan,0.2,0.3,4.2,15.1]])
In [13]: y, z = 0.9, 0.5
If I perform one of your tests on the whole array:
In [14]: mask1 = myarray >6*y
/usr/local/bin/ipython3:1: RuntimeWarning: invalid value encountered in greater
It's the np.nan that cause this warning.
So lets first identify those nan (and replace):
In [25]: mask0 = np.isnan(myarray)
In [26]: mask0
Out[26]:
array([[False, False, False, False, False],
[False, False, True, False, False],
[ True, False, False, False, False]])
In [27]: arr = myarray.copy()
In [28]: arr[mask0] = 0 # temp replace the nan with 0
myarray == np.nan does not work; it produces False everywhere.
arr = np.nan_to_num(myarray) also works, replacing the nan with 0.
Now find the masks for the y and z tests. It doesn't matter how these handle the original nan (now 0). Calculate both masks first to reduce mutual interference.
In [29]: mask1 = arr > 6*y
In [30]: mask2 = arr < 4*z
In [31]: arr[mask1]
Out[31]: array([ 9.4, 6.9, 5.7, 15.1])
In [32]: arr[mask2]
Out[32]: array([ 1.01, 0. , 1.9 , 0. , -2.45, 0. , 0.2 , 0.3 ])
In [33]: arr[mask0]
Out[33]: array([0., 0.])
Since you want everything else to be 0, lets initial an array of zeros:
In [34]: res = np.zeros_like(arr)
now apply the 3 masks:
In [35]: res[mask1] = 3
In [36]: res[mask2] = 2
In [37]: res[mask0] = np.nan
In [38]: res
Out[38]:
array([[ 2., 3., 2., 3., 3.],
[ 2., 0., nan, 0., 2.],
[nan, 2., 2., 0., 3.]])
I could have applied the masks to arr:
In [40]: arr[mask1] = 3 # np.where(mask1, 3, arr) should also work
In [41]: arr[mask2] = 2
In [42]: arr[mask0] = np.nan
In [43]: arr
Out[43]:
array([[2. , 3. , 2. , 3. , 3. ],
[2. , 2.6, nan, 4.7, 2. ],
[nan, 2. , 2. , 4.2, 3. ]])
I still have to use some logic to combine the masks to identify the slots that are supposed to be 0.

Python - Binning x,y,z values on a 2D grid

I have a list of z points associated to pairs x,y, meaning that for example
x y z
3.1 5.2 1.3
4.2 2.3 9.3
5.6 9.8 3.5
and so on. The total number of z values is relatively high, around 10000.
I would like to bin my data, in the following sense:
1) I would like to split the x and y values into cells, so as to make a 2-dimensional grid in x,y.If I have Nx cells for the x axis and Ny for the y axis, I would then have Nx*Ny cells on the grid. For example, the first bin for x could be ranging from 1. to 2., the second from 2. to 3. and so on.
2) For each of this cell in the 2dimensional grid, I would then need to calculate how many points fall into that cell, and sum all their z values. This gives me a numerical value associated to each cell.
I thought about using binned_statistic from scipy.stats, but I would have no idea on how to set the options to accomplish my task. Any suggestions? Also other tools, other than binned_statistic, are well accepted.
Assuming I understand, you can get what you need by exploiting the expand_binnumbers parameter for binned_statistic_2d, thus.
from scipy.stats import binned_statistic_2d
import numpy as np
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
z = [2.,3.,5.,7.]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = binned_statistic_2d(x, y, None, 'count', bins=[binx,biny], \
expand_binnumbers=True)
print (ret.statistic)
print (ret.binnumber)
sums = np.zeros([-1+len(binx), -1+len(biny)])
for i in range(len(x)):
m = ret.binnumber [0][i] - 1
n = ret.binnumber [1][i] - 1
sums[m][n] += sums[m][n] + z[i]
print (sums)
This is just an expansion of one of the examples. Here's the output.
[[ 2. 1.]
[ 1. 0.]]
[[1 1 1 2]
[1 2 1 1]]
[[ 9. 3.]
[ 7. 0.]]
Establish the edges of the cells, iterate over cell edges and use boolean indexing to extract the z values in each cell, keep the sums in a list, convert the list and reshape it.
import itertools
import numpy as np
x = np.array([0.1, 0.1, 0.1, 0.6, 1.2, 2.1])
y = np.array([2.1, 2.6, 2.1, 2.1, 3.4, 4.7])
z = np.array([2., 3., 5., 7., 10, 20])
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
next(b, None)
return itertools.izip(a, b)
minx, maxx = int(min(x)), int(max(x)) + 1
miny, maxy = int(min(y)), int(max(y)) + 1
result = []
x_edges = pairwise(xrange(minx, maxx + 1))
for xleft, xright in x_edges:
xmask = np.logical_and(x >= xleft, x < xright)
y_edges = pairwise(xrange(miny, maxy + 1))
for yleft, yright in y_edges:
ymask = np.logical_and(y >= yleft, y < yright)
cell = z[np.logical_and(xmask, ymask)]
result.append(cell.sum())
result = np.array(result).reshape((maxx - minx, maxy - miny))
>>> result
array([[ 17., 0., 0.],
[ 0., 10., 0.],
[ 0., 0., 20.]])
>>>
Unfortunately, no numpy vectorization magic

numpy.meshgrid explanation

Could someone care to explain the meshgrid method? I cannot wrap my mind around it. The example is from the [SciPy][1] site:
import numpy as np
nx, ny = (3, 2)
x = np.linspace(0, 1, nx)
print ("x =", x)
y = np.linspace(0, 1, ny)
print ("y =", y)
xv, yv = np.meshgrid(x, y)
print ("xv_1 =", xv)
print ("yv_1 =", yv)
xv, yv = np.meshgrid(x, y, sparse=True) # make sparse output arrays
print ("xv_2 =", xv)
print ("yv_2 =", yv)
Printout is :
x = [ 0. 0.5 1. ]
y = [ 0. 1.]
xv_1 = [[ 0. 0.5 1. ]
[ 0. 0.5 1. ]]
yv_1 = [[ 0. 0. 0.]
[ 1. 1. 1.]]
xv_2 = [[ 0. 0.5 1. ]]
yv_2 = [[ 0.]
[ 1.]]
Why are arrays xv_1 and yv_1 formed like this ? Ty :)
[1]: http://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html#numpy.meshgrid
In [214]: nx, ny = (3, 2)
In [215]: x = np.linspace(0, 1, nx)
In [216]: x
Out[216]: array([ 0. , 0.5, 1. ])
In [217]: y = np.linspace(0, 1, ny)
In [218]: y
Out[218]: array([ 0., 1.])
Using unpacking to better see the 2 arrays produced by meshgrid:
In [225]: X,Y = np.meshgrid(x, y)
In [226]: X
Out[226]:
array([[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ]])
In [227]: Y
Out[227]:
array([[ 0., 0., 0.],
[ 1., 1., 1.]])
and for the sparse version. Notice that X1 looks like one row of X (but 2d). and Y1 like one column of Y.
In [228]: X1,Y1 = np.meshgrid(x, y, sparse=True)
In [229]: X1
Out[229]: array([[ 0. , 0.5, 1. ]])
In [230]: Y1
Out[230]:
array([[ 0.],
[ 1.]])
When used in calculations like plus and times, both forms behave the same. That's because of numpy's broadcasting.
In [231]: X+Y
Out[231]:
array([[ 0. , 0.5, 1. ],
[ 1. , 1.5, 2. ]])
In [232]: X1+Y1
Out[232]:
array([[ 0. , 0.5, 1. ],
[ 1. , 1.5, 2. ]])
The shapes might also help:
In [235]: X.shape, Y.shape
Out[235]: ((2, 3), (2, 3))
In [236]: X1.shape, Y1.shape
Out[236]: ((1, 3), (2, 1))
The X and Y have more values than are actually needed for most uses. But usually there isn't much of penalty for using them instead the sparse versions.
Your linear spaced vectors x and y defined by linspace use 3 and 2 points respectively.
These linear spaced vectors are then used by the meshgrid function to create a 2D linear spaced point cloud. This will be a grid of points for each of the x and y coordinates. The size of this point cloud will be 3 x 2.
The output of the function meshgrid creates an indexing matrix that holds in each cell the x and y coordinates for each point of your space.
This is created as follows:
# dummy
def meshgrid_custom(x,y):
xv = np.zeros((len(x),len(y)))
yv = np.zeros((len(x),len(y)))
for i,ix in zip(range(len(x)),x):
for j,jy in zip(range(len(y)),y):
xv[i,j] = ix
yv[i,j] = jy
return xv.T, yv.T
So, for example the point at the location (1,1) has the coordinates:
x = xv_1[1,1] = 0.5
y = yv_1[1,1] = 1.0

Categories