Python - Binning x,y,z values on a 2D grid - python

I have a list of z points associated to pairs x,y, meaning that for example
x y z
3.1 5.2 1.3
4.2 2.3 9.3
5.6 9.8 3.5
and so on. The total number of z values is relatively high, around 10000.
I would like to bin my data, in the following sense:
1) I would like to split the x and y values into cells, so as to make a 2-dimensional grid in x,y.If I have Nx cells for the x axis and Ny for the y axis, I would then have Nx*Ny cells on the grid. For example, the first bin for x could be ranging from 1. to 2., the second from 2. to 3. and so on.
2) For each of this cell in the 2dimensional grid, I would then need to calculate how many points fall into that cell, and sum all their z values. This gives me a numerical value associated to each cell.
I thought about using binned_statistic from scipy.stats, but I would have no idea on how to set the options to accomplish my task. Any suggestions? Also other tools, other than binned_statistic, are well accepted.

Assuming I understand, you can get what you need by exploiting the expand_binnumbers parameter for binned_statistic_2d, thus.
from scipy.stats import binned_statistic_2d
import numpy as np
x = [0.1, 0.1, 0.1, 0.6]
y = [2.1, 2.6, 2.1, 2.1]
z = [2.,3.,5.,7.]
binx = [0.0, 0.5, 1.0]
biny = [2.0, 2.5, 3.0]
ret = binned_statistic_2d(x, y, None, 'count', bins=[binx,biny], \
expand_binnumbers=True)
print (ret.statistic)
print (ret.binnumber)
sums = np.zeros([-1+len(binx), -1+len(biny)])
for i in range(len(x)):
m = ret.binnumber [0][i] - 1
n = ret.binnumber [1][i] - 1
sums[m][n] += sums[m][n] + z[i]
print (sums)
This is just an expansion of one of the examples. Here's the output.
[[ 2. 1.]
[ 1. 0.]]
[[1 1 1 2]
[1 2 1 1]]
[[ 9. 3.]
[ 7. 0.]]

Establish the edges of the cells, iterate over cell edges and use boolean indexing to extract the z values in each cell, keep the sums in a list, convert the list and reshape it.
import itertools
import numpy as np
x = np.array([0.1, 0.1, 0.1, 0.6, 1.2, 2.1])
y = np.array([2.1, 2.6, 2.1, 2.1, 3.4, 4.7])
z = np.array([2., 3., 5., 7., 10, 20])
def pairwise(iterable):
"s -> (s0,s1), (s1,s2), (s2, s3), ..."
a, b = itertools.tee(iterable)
next(b, None)
return itertools.izip(a, b)
minx, maxx = int(min(x)), int(max(x)) + 1
miny, maxy = int(min(y)), int(max(y)) + 1
result = []
x_edges = pairwise(xrange(minx, maxx + 1))
for xleft, xright in x_edges:
xmask = np.logical_and(x >= xleft, x < xright)
y_edges = pairwise(xrange(miny, maxy + 1))
for yleft, yright in y_edges:
ymask = np.logical_and(y >= yleft, y < yright)
cell = z[np.logical_and(xmask, ymask)]
result.append(cell.sum())
result = np.array(result).reshape((maxx - minx, maxy - miny))
>>> result
array([[ 17., 0., 0.],
[ 0., 10., 0.],
[ 0., 0., 20.]])
>>>
Unfortunately, no numpy vectorization magic

Related

Empirical CDF function in python with reasonable NaN behavior

I'm looking to compute the ECDF and am using this statsmodels function:
from statsmodels.distributions.empirical_distribution import ECDF
Looks good at first:
ECDF(np.array([0,1,2,3, 3, 3]))(np.array([0,1,2,3, 3,3]))
array([0.16666667, 0.33333333, 0.5 , 1. , 1. ,
1. ])
However, nan seems to be treated as infinity:
>>> x = np.array([0,1,2,3, np.nan, np.nan])
>>> ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
Same as:
np.array([0,1,2,3, np.inf, np.inf])
ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
Comparing with R:
> x <- c(0,1,2,3,NA,NA)
> x
[1] 0 1 2 3 NA NA
> ecdf(x)(x)
[1] 0.25 0.50 0.75 1.00 NA NA
What's the standard python function for ecdf that is nan aware?
Hot-wiring like so does not seem to work:
def ecdf(x):
return np.where(~np.isfinite(x),
np.full_like(x, np.nan),
ECDF(x[np.isfinite(x)])(x[np.isfinite(x)]))
ecdf(x)
ECDF(x[np.isfinite(x)])(x[np.isfinite(x)]))
File "<__array_function__ internals>", line 6, in where
ValueError: operands could not be broadcast together with shapes (7,) (7,) (4,)
The source code of statsmodel's ECDF is pleasantly brief (after stripping comments):
class ECDF(StepFunction):
def __init__(self, x, side='right'):
x = np.array(x, copy=True)
x.sort()
nobs = len(x)
y = np.linspace(1./nobs,1,nobs)
super(ECDF, self).__init__(x, y, side=side, sorted=True)
Sorting the input samples via x.sort() will move all the np.nan valued elements to the end even after np.inf, which is why they appear to be treated as infinity
bar=np.array([1, np.nan, 2, np.inf, 3])
bar.sort()
# bar is now array([ 1., 2., 3., inf, nan])
The reason np.nan isn't propagated is because ECDF's parent class uses np.searchsorted to find the correct index and then looks it up in y. For np.nan this is simply the last element of the array and a subsequent lookup of self.y will return 1 for this case.
You can make it propagate np.nan with a simple change, which you can realize as a subclass or sibling.
from statsmodels.distributions.empirical_distribution import StepFunction
import numpy as np
class MyECDF(StepFunction):
def __init__(self, x, side='right'):
x = np.sort(x)
# count number of non-nan's instead of length
nobs = np.count_nonzero(~np.isnan(x))
# fill the y values corresponding to np.nan with np.nan
y = np.full_like(x, np.nan)
y[:nobs] = np.linspace(1./nobs,1,nobs)
super(MyECDF, self).__init__(x, y, side=side, sorted=True)
This small change will make the function behave in a way similar to R:
>>> from foobar import MyECDF
>>> from statsmodels.distributions.empirical_distribution import ECDF
>>> import numpy as np
>>> x = np.array([0,1,2,3, np.nan, np.nan])
>>> ECDF(x)(x)
array([0.16666667, 0.33333333, 0.5 , 0.66666667, 1. ,
1. ])
>>> MyECDF(x)(x)
array([0.25, 0.5 , 0.75, 1. , nan, nan])
You can use a masked array:
import numpy.ma as ma
def ecdf(x):
return np.where(np.isnan(x),
np.full_like(x, np.nan),
ECDF(ma.array(x, mask=np.isnan(x)).compressed(), "right")(ma.array(x, mask=np.isnan(x))),
)
>>> ecdf(x)
array([0.25, 0.5 , 0.75, 1. , nan, nan])
Matches what R does natively.

numpy: multiply arbitrary shape array along first axis

I want to multiply an array along it's first axis by some vector.
For instance, if a is 2D, b is 1D, and a.shape[0] == b.shape[0], we can do:
a *= b[:, np.newaxis]
What if a has an arbitrary shape? In numpy, the ellipsis "..." can be interpreted as "fill the remaining indices with ':'". Is there an equivalent for filling the remaining axes with None/np.newaxis?
The code below generates the desired result, but I would prefer a general vectorized way to accomplish this without falling back to a for loop.
from __future__ import print_function
import numpy as np
def foo(a, b):
"""
Multiply a along its first axis by b
"""
if len(a.shape) == 1:
a *= b
elif len(a.shape) == 2:
a *= b[:, np.newaxis]
elif len(a.shape) == 3:
a *= b[:, np.newaxis, np.newaxis]
else:
n = a.shape[0]
for i in range(n):
a[i, ...] *= b[i]
n = 10
b = np.arange(n)
a = np.ones((n, 3))
foo(a, b)
print(a)
a = np.ones((n, 3, 3))
foo(a, b)
print(a)
Just reverse the order of the axes:
transpose = a.T
transpose *= b
a.T is a transposed view of a, where "transposed" means reversing the order of the dimensions for arbitrary-dimensional a. We assign a.T to a separate variable so the *= doesn't try to set the a.T attribute; the results still apply to a, since the transpose is a view.
Demo:
In [55]: a = numpy.ones((2, 2, 3))
In [56]: a
Out[56]:
array([[[1., 1., 1.],
[1., 1., 1.]],
[[1., 1., 1.],
[1., 1., 1.]]])
In [57]: transpose = a.T
In [58]: transpose *= [2, 3]
In [59]: a
Out[59]:
array([[[2., 2., 2.],
[2., 2., 2.]],
[[3., 3., 3.],
[3., 3., 3.]]])
Following the idea of the accepted answer, you could skip the variable assignment to the transpose as follows:
arr = np.tile(np.arange(10, dtype=float), 3).reshape(3, 10)
print(arr)
factors = np.array([0.1, 1, 10])
arr.T[:, :] *= factors
print(arr)
Which would print
[[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]
[0. 1. 2. 3. 4. 5. 6. 7. 8. 9.]]
[[ 0. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9]
[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. ]
[ 0. 10. 20. 30. 40. 50. 60. 70. 80. 90. ]]

numpy.meshgrid explanation

Could someone care to explain the meshgrid method? I cannot wrap my mind around it. The example is from the [SciPy][1] site:
import numpy as np
nx, ny = (3, 2)
x = np.linspace(0, 1, nx)
print ("x =", x)
y = np.linspace(0, 1, ny)
print ("y =", y)
xv, yv = np.meshgrid(x, y)
print ("xv_1 =", xv)
print ("yv_1 =", yv)
xv, yv = np.meshgrid(x, y, sparse=True) # make sparse output arrays
print ("xv_2 =", xv)
print ("yv_2 =", yv)
Printout is :
x = [ 0. 0.5 1. ]
y = [ 0. 1.]
xv_1 = [[ 0. 0.5 1. ]
[ 0. 0.5 1. ]]
yv_1 = [[ 0. 0. 0.]
[ 1. 1. 1.]]
xv_2 = [[ 0. 0.5 1. ]]
yv_2 = [[ 0.]
[ 1.]]
Why are arrays xv_1 and yv_1 formed like this ? Ty :)
[1]: http://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html#numpy.meshgrid
In [214]: nx, ny = (3, 2)
In [215]: x = np.linspace(0, 1, nx)
In [216]: x
Out[216]: array([ 0. , 0.5, 1. ])
In [217]: y = np.linspace(0, 1, ny)
In [218]: y
Out[218]: array([ 0., 1.])
Using unpacking to better see the 2 arrays produced by meshgrid:
In [225]: X,Y = np.meshgrid(x, y)
In [226]: X
Out[226]:
array([[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ]])
In [227]: Y
Out[227]:
array([[ 0., 0., 0.],
[ 1., 1., 1.]])
and for the sparse version. Notice that X1 looks like one row of X (but 2d). and Y1 like one column of Y.
In [228]: X1,Y1 = np.meshgrid(x, y, sparse=True)
In [229]: X1
Out[229]: array([[ 0. , 0.5, 1. ]])
In [230]: Y1
Out[230]:
array([[ 0.],
[ 1.]])
When used in calculations like plus and times, both forms behave the same. That's because of numpy's broadcasting.
In [231]: X+Y
Out[231]:
array([[ 0. , 0.5, 1. ],
[ 1. , 1.5, 2. ]])
In [232]: X1+Y1
Out[232]:
array([[ 0. , 0.5, 1. ],
[ 1. , 1.5, 2. ]])
The shapes might also help:
In [235]: X.shape, Y.shape
Out[235]: ((2, 3), (2, 3))
In [236]: X1.shape, Y1.shape
Out[236]: ((1, 3), (2, 1))
The X and Y have more values than are actually needed for most uses. But usually there isn't much of penalty for using them instead the sparse versions.
Your linear spaced vectors x and y defined by linspace use 3 and 2 points respectively.
These linear spaced vectors are then used by the meshgrid function to create a 2D linear spaced point cloud. This will be a grid of points for each of the x and y coordinates. The size of this point cloud will be 3 x 2.
The output of the function meshgrid creates an indexing matrix that holds in each cell the x and y coordinates for each point of your space.
This is created as follows:
# dummy
def meshgrid_custom(x,y):
xv = np.zeros((len(x),len(y)))
yv = np.zeros((len(x),len(y)))
for i,ix in zip(range(len(x)),x):
for j,jy in zip(range(len(y)),y):
xv[i,j] = ix
yv[i,j] = jy
return xv.T, yv.T
So, for example the point at the location (1,1) has the coordinates:
x = xv_1[1,1] = 0.5
y = yv_1[1,1] = 1.0

confused by numpy meshgrid output

Using Python 2.7 with miniconda interpreter. I am confused by what means N-D coordinate in the following statements, and could anyone tell how in the below sample xv and yv are calculated, it will be great.
"Make N-D coordinate arrays for vectorized evaluations of N-D scalar/vector fields over N-D grids, given one-dimensional coordinate arrays x1, x2,..., xn."
http://docs.scipy.org/doc/numpy/reference/generated/numpy.meshgrid.html
>>> nx, ny = (3, 2)
>>> x = np.linspace(0, 1, nx)
>>> y = np.linspace(0, 1, ny)
>>> xv, yv = meshgrid(x, y)
>>> xv
array([[ 0. , 0.5, 1. ],
[ 0. , 0.5, 1. ]])
>>> yv
array([[ 0., 0., 0.],
[ 1., 1., 1.]])
regards,
Lin
xv,yv are simply defined as:
xv = np.array([x for _ in y])
yv = np.array([y for _ in x]).T
so that for every index pair (i,j), you have
xv[i,j] = x[i]
yv[i,j] = y[j]
which is useful especially for plotting 2D maps.

Resample a numpy array

It's easy to resample an array like
a = numpy.array([1,2,3,4,5,6,7,8,9,10])
with an integer resampling factor. For instance, with a factor 2 :
b = a[::2] # [1 3 5 7 9]
But with a non-integer resampling factor, it doesn't work so easily :
c = a[::1.5] # [1 2 3 4 5 6 7 8 9 10] => not what is needed...
It should be (with linear interpolation):
[1 2.5 4 5.5 7 8.5 10]
or (by taking the nearest neighbour in the array)
[1 3 4 6 7 9 10]
How to resample a numpy array with a non-integer resampling factor?
Example of application: audio signal resampling / repitching
NumPy has numpy.interp which does linear interpolation:
In [1]: numpy.interp(np.arange(0, len(a), 1.5), np.arange(0, len(a)), a)
Out[1]: array([ 1. , 2.5, 4. , 5.5, 7. , 8.5, 10. ])
SciPy has scipy.interpolate.interp1d which can do linear and nearest interpolation (though which point is nearest might not be obvious):
In [2]: from scipy.interpolate import interp1d
In [3]: xp = np.arange(0, len(a), 1.5)
In [4]: lin = interp1d(np.arange(len(a)), a)
In [5]: lin(xp)
Out[5]: array([ 1. , 2.5, 4. , 5.5, 7. , 8.5, 10. ])
In [6]: nearest = interp1d(np.arange(len(a)), a, kind='nearest')
In [7]: nearest(xp)
Out[7]: array([ 1., 2., 4., 5., 7., 8., 10.])
As scipy.signal.resample can be very slow, I searched for other algorithms adapted for audio.
It seems that Erik de Castro Lopo's SRC (a.k.a. Secret Rabbit Code a.k.a. libsamplerate) is one of the best resampling algorithms available.
It is used by scikit's scikit.samplerate, but this library seems to be complicated to install (I gave up on Windows).
Fortunately, there is an easy-to-use and easy-to-install Python wrapper for libsamplerate, made by Tino Wagner: https://pypi.org/project/samplerate/. Installation with pip install samplerate. Usage:
import samplerate
from scipy.io import wavfile
sr, x = wavfile.read('input.wav') # 48 khz file
y = samplerate.resample(x, 44100 * 1.0 / 48000, 'sinc_best')
Interesting reading / comparison of many resampling solutions:
http://signalsprocessed.blogspot.com/2016/08/audio-resampling-in-python.html
Addendum: comparison of spectrograms of a resampled frequency sweep (20hz to 20khz):
1) Original
2) Resampled with libsamplerate / samplerate module
3) Resampled with numpy.interp ("One-dimensional linear interpolation"):
Since you mention this being data from an audio .WAV file, you might look at scipy.signal.resample.
Resample x to num samples using Fourier method along the given axis.
The resampled signal starts at the same value as x but is sampled
with a spacing of len(x) / num * (spacing of x). Because a
Fourier method is used, the signal is assumed to be periodic.
Your linear array a is not a good one to test this on, since it isn't periodic in appearance. But consider sin data:
x=np.arange(10)
y=np.sin(x)
y1, x1 =signal.resample(y,15,x) # 10 pts resampled at 15
compare these with either
y1-np.sin(x1) # or
plot(x, y, x1, y1)
In signal processing, you can think of resampling as basically rescaling the array and interpolating the missing values or values with non-integer index using nearest, linear, cubic, etc methods.
Using scipy.interpolate.interp1d, you can achieve one dimensional resampling using the following function
def resample(x, factor, kind='linear'):
n = np.ceil(x.size / factor)
f = interp1d(np.linspace(0, 1, x.size), x, kind)
return f(np.linspace(0, 1, n))
e.g.:
a = np.array([1,2,3,4,5,6,7,8,9,10])
resample(a, factor=1.5, kind='linear')
yields
array([ 1. , 2.5, 4. , 5.5, 7. , 8.5, 10. ])
and
a = np.array([1,2,3,4,5,6,7,8,9,10])
resample(a, factor=1.5, kind='nearest')
yields
array([ 1., 2., 4., 5., 7., 8., 10.])
And if you want the integer sampling
a = numpy.array([1,2,3,4,5,6,7,8,9,10])
factor = 1.5
x = map(int,numpy.round(numpy.arange(0,len(a),factor)))
sampled = a[x]

Categories