Python: binned_statistic_2d mean calculation ignoring NaNs in data - python

I am using scipy.stats.binned_statistic_2d to bin irregular data onto a uniform grid by finding the mean of points within every bin.
x,y = np.meshgrid(sort(np.random.uniform(0,1,100)),sort(np.random.uniform(0,1,100)))
z = np.sin(x*y)
statistic, xedges, yedges, binnumber = sp.stats.binned_statistic_2d(x.ravel(), y.ravel(), values=z.ravel(), statistic='mean',bins=[np.arange(0,1.1,.1), np.arange(0,1.1,.1)])
plt.figure(1)
plt.pcolormesh(x,y,z, vmin = 0, vmax = 1)
plt.figure(2)
plt.pcolormesh(xedges,yedges,statistic, vmin = 0, vmax = 1)
Produces these plots, as expected:
Scattered data:
Gridded data:
But the data I want to grid has NaNs in it. This is what the result is like when I add NaNs:
x,y = np.meshgrid(sort(np.random.uniform(0,1,100)),sort(np.random.uniform(0,1,100)))
z = np.sin(x*y)
z[50:55,50:55] = np.nan
statistic, xedges, yedges, binnumber = binned_statistic_2d(x.ravel(), y.ravel(), values=z.ravel(), statistic='mean',bins=[np.arange(0,1.1,.1), np.arange(0,1.1,.1)])
plt.figure(3)
plt.pcolormesh(x,y,z, vmin = 0, vmax = 1)
plt.figure(4)
plt.pcolormesh(xedges,yedges,statistic, vmin = 0, vmax = 1)
Scattered:
Gridded:
Obviously if a bin is entirely filled with NaNs, the the resulting mean of that bin should still be NaN. However, I would like bins that are not entirely filled with NaNs to just result in the mean of the non-NaN numbers.
I've tried replacing the "statistic" argument in sp.stats.binned_statistic_2d with np.nanmean. This works, but it goes very very slowly when I use it on large datasets. I've tried digging into the underlying code of `sp.stats.binned_statistic_2d', but I can't figure out exactly how it is calculating the mean, or how to make it ignore NaNs in it's calculation.
Any ideas?

I had the same problem and changed the definition of binned_statistic_dd in scipy.stats and saved a local copy so that it won't be changed if scipy is updated.
I added 'nanmean' to the list of known_stats and
elif statistic == 'nanmean':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.nanmean(values[vv, binnumbers == i])
Full new definition:
def binned_statistic_dd(sample, values, statistic='mean',
bins=10, range=None, expand_binnumbers=False,
binned_statistic_result=None):
"""
Compute a multidimensional binned statistic for a set of data.
This is a generalization of a histogramdd function. A histogram divides
the space into bins, and returns the count of the number of points in
each bin. This function allows the computation of the sum, mean, median,
or other statistic of the values within each bin.
Parameters
----------
sample : array_like
Data to histogram passed as a sequence of N arrays of length D, or
as an (N,D) array.
values : (N,) array_like or list of (N,) array_like
The data on which the statistic will be computed. This must be
the same shape as `sample`, or a list of sequences - each with the
same shape as `sample`. If `values` is such a list, the statistic
will be computed on each independently.
statistic : string or callable, optional
The statistic to compute (default is 'mean').
The following statistics are available:
* 'mean' : compute the mean of values for points within each bin.
Empty bins will be represented by NaN.
* 'median' : compute the median of values for points within each
bin. Empty bins will be represented by NaN.
* 'count' : compute the count of points within each bin. This is
identical to an unweighted histogram. `values` array is not
referenced.
* 'sum' : compute the sum of values for points within each bin.
This is identical to a weighted histogram.
* 'std' : compute the standard deviation within each bin. This
is implicitly calculated with ddof=0. If the number of values
within a given bin is 0 or 1, the computed standard deviation value
will be 0 for the bin.
* 'min' : compute the minimum of values for points within each bin.
Empty bins will be represented by NaN.
* 'max' : compute the maximum of values for point within each bin.
Empty bins will be represented by NaN.
* function : a user-defined function which takes a 1D array of
values, and outputs a single numerical statistic. This function
will be called on the values in each bin. Empty bins will be
represented by function([]), or NaN if this returns an error.
bins : sequence or positive int, optional
The bin specification must be in one of the following forms:
* A sequence of arrays describing the bin edges along each dimension.
* The number of bins for each dimension (nx, ny, ... = bins).
* The number of bins for all dimensions (nx = ny = ... = bins).
range : sequence, optional
A sequence of lower and upper bin edges to be used if the edges are
not given explicitly in `bins`. Defaults to the minimum and maximum
values along each dimension.
expand_binnumbers : bool, optional
'False' (default): the returned `binnumber` is a shape (N,) array of
linearized bin indices.
'True': the returned `binnumber` is 'unraveled' into a shape (D,N)
ndarray, where each row gives the bin numbers in the corresponding
dimension.
See the `binnumber` returned value, and the `Examples` section of
`binned_statistic_2d`.
binned_statistic_result : binnedStatisticddResult
Result of a previous call to the function in order to reuse bin edges
and bin numbers with new values and/or a different statistic.
To reuse bin numbers, `expand_binnumbers` must have been set to False
(the default)
.. versionadded:: 0.17.0
Returns
-------
statistic : ndarray, shape(nx1, nx2, nx3,...)
The values of the selected statistic in each two-dimensional bin.
bin_edges : list of ndarrays
A list of D arrays describing the (nxi + 1) bin edges for each
dimension.
binnumber : (N,) array of ints or (D,N) ndarray of ints
This assigns to each element of `sample` an integer that represents the
bin in which this observation falls. The representation depends on the
`expand_binnumbers` argument. See `Notes` for details.
See Also
--------
numpy.digitize, numpy.histogramdd, binned_statistic, binned_statistic_2d
Notes
-----
Binedges:
All but the last (righthand-most) bin is half-open in each dimension. In
other words, if `bins` is ``[1, 2, 3, 4]``, then the first bin is
``[1, 2)`` (including 1, but excluding 2) and the second ``[2, 3)``. The
last bin, however, is ``[3, 4]``, which *includes* 4.
`binnumber`:
This returned argument assigns to each element of `sample` an integer that
represents the bin in which it belongs. The representation depends on the
`expand_binnumbers` argument. If 'False' (default): The returned
`binnumber` is a shape (N,) array of linearized indices mapping each
element of `sample` to its corresponding bin (using row-major ordering).
If 'True': The returned `binnumber` is a shape (D,N) ndarray where
each row indicates bin placements for each dimension respectively. In each
dimension, a binnumber of `i` means the corresponding value is between
(bin_edges[D][i-1], bin_edges[D][i]), for each dimension 'D'.
.. versionadded:: 0.11.0
Examples
--------
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> from mpl_toolkits.mplot3d import Axes3D
Take an array of 600 (x, y) coordinates as an example.
`binned_statistic_dd` can handle arrays of higher dimension `D`. But a plot
of dimension `D+1` is required.
>>> mu = np.array([0., 1.])
>>> sigma = np.array([[1., -0.5],[-0.5, 1.5]])
>>> multinormal = stats.multivariate_normal(mu, sigma)
>>> data = multinormal.rvs(size=600, random_state=235412)
>>> data.shape
(600, 2)
Create bins and count how many arrays fall in each bin:
>>> N = 60
>>> x = np.linspace(-3, 3, N)
>>> y = np.linspace(-3, 4, N)
>>> ret = stats.binned_statistic_dd(data, np.arange(600), bins=[x, y],
... statistic='count')
>>> bincounts = ret.statistic
Set the volume and the location of bars:
>>> dx = x[1] - x[0]
>>> dy = y[1] - y[0]
>>> x, y = np.meshgrid(x[:-1]+dx/2, y[:-1]+dy/2)
>>> z = 0
>>> bincounts = bincounts.ravel()
>>> x = x.ravel()
>>> y = y.ravel()
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111, projection='3d')
>>> with np.errstate(divide='ignore'): # silence random axes3d warning
... ax.bar3d(x, y, z, dx, dy, bincounts)
Reuse bin numbers and bin edges with new values:
>>> ret2 = stats.binned_statistic_dd(data, -np.arange(600),
... binned_statistic_result=ret,
... statistic='mean')
"""
known_stats = ['mean', 'median', 'count', 'sum', 'std', 'min', 'max',
'nanmean']
if not callable(statistic) and statistic not in known_stats:
raise ValueError('invalid statistic %r' % (statistic,))
try:
bins = index(bins)
except TypeError:
# bins is not an integer
pass
# If bins was an integer-like object, now it is an actual Python int.
# NOTE: for _bin_edges(), see e.g. gh-11365
if isinstance(bins, int) and not np.isfinite(sample).all():
raise ValueError('%r contains non-finite values.' % (sample,))
# `Ndim` is the number of dimensions (e.g. `2` for `binned_statistic_2d`)
# `Dlen` is the length of elements along each dimension.
# This code is based on np.histogramdd
try:
# `sample` is an ND-array.
Dlen, Ndim = sample.shape
except (AttributeError, ValueError):
# `sample` is a sequence of 1D arrays.
sample = np.atleast_2d(sample).T
Dlen, Ndim = sample.shape
# Store initial shape of `values` to preserve it in the output
values = np.asarray(values)
input_shape = list(values.shape)
# Make sure that `values` is 2D to iterate over rows
values = np.atleast_2d(values)
Vdim, Vlen = values.shape
# Make sure `values` match `sample`
if(statistic != 'count' and Vlen != Dlen):
raise AttributeError('The number of `values` elements must match the '
'length of each `sample` dimension.')
try:
M = len(bins)
if M != Ndim:
raise AttributeError('The dimension of bins must be equal '
'to the dimension of the sample x.')
except TypeError:
bins = Ndim * [bins]
if binned_statistic_result is None:
nbin, edges, dedges = _bin_edges(sample, bins, range)
binnumbers = _bin_numbers(sample, nbin, edges, dedges)
else:
edges = binned_statistic_result.bin_edges
nbin = np.array([len(edges[i]) + 1 for i in builtins.range(Ndim)])
# +1 for outlier bins
dedges = [np.diff(edges[i]) for i in builtins.range(Ndim)]
binnumbers = binned_statistic_result.binnumber
result = np.empty([Vdim, nbin.prod()], float)
if statistic == 'mean':
result.fill(np.nan)
flatcount = np.bincount(binnumbers, None)
a = flatcount.nonzero()
for vv in builtins.range(Vdim):
flatsum = np.bincount(binnumbers, values[vv])
result[vv, a] = flatsum[a] / flatcount[a]
elif statistic == 'std':
result.fill(0)
flatcount = np.bincount(binnumbers, None)
a = flatcount.nonzero()
for vv in builtins.range(Vdim):
for i in np.unique(binnumbers):
# NOTE: take std dev by bin, np.std() is 2-pass and stable
binned_data = values[vv, binnumbers == i]
# calc std only when binned data is 2 or more for speed up.
if len(binned_data) >= 2:
result[vv, i] = np.std(binned_data)
elif statistic == 'count':
result.fill(0)
flatcount = np.bincount(binnumbers, None)
a = np.arange(len(flatcount))
result[:, a] = flatcount[np.newaxis, :]
elif statistic == 'sum':
result.fill(0)
for vv in builtins.range(Vdim):
flatsum = np.bincount(binnumbers, values[vv])
a = np.arange(len(flatsum))
result[vv, a] = flatsum
elif statistic == 'median':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.median(values[vv, binnumbers == i])
elif statistic == 'min':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.min(values[vv, binnumbers == i])
elif statistic == 'max':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.max(values[vv, binnumbers == i])
elif statistic == 'nanmean':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.nanmean(values[vv, binnumbers == i])
elif callable(statistic):
with np.errstate(invalid='ignore'), suppress_warnings() as sup:
sup.filter(RuntimeWarning)
try:
null = statistic([])
except Exception:
null = np.nan
result.fill(null)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = statistic(values[vv, binnumbers == i])
# Shape into a proper matrix
result = result.reshape(np.append(Vdim, nbin))
# Remove outliers (indices 0 and -1 for each bin-dimension).
core = tuple([slice(None)] + Ndim * [slice(1, -1)])
result = result[core]
# Unravel binnumbers into an ndarray, each row the bins for each dimension
if(expand_binnumbers and Ndim > 1):
binnumbers = np.asarray(np.unravel_index(binnumbers, nbin))
if np.any(result.shape[1:] != nbin - 2):
raise RuntimeError('Internal Shape Error')
# Reshape to have output (`result`) match input (`values`) shape
result = result.reshape(input_shape[:-1] + list(nbin-2))
return BinnedStatisticddResult(result, edges, binnumbers)

Related

Values of each bin

i got following problem:
hist, edges = np.histogram(data, bins=50)
How can i access the values of each bin? I wanted to calculate the avg of each bin.
Thanks
I think this function does what you want:
import numpy as np
def binned_mean(values, edges):
values = np.asarray(values)
# Classify values into bins
dig = np.digitize(values, edges)
# Mask values out of bins
m = (dig > 0) & (dig < len(edges))
values = values[m]
dig = dig[m] - 1
# Binned sum of values
nbins = len(edges) - 1
s = np.zeros(nbins, dtype=values.dtype)
np.add.at(s, dig, values)
# Binned count of values
count = np.zeros(nbins, dtype=np.int32)
np.add.at(count, dig, 1)
# Means
return s / count.clip(min=1)
Example:
print(binned_mean([1.2, 1.8, 2.1, 2.4, 2.7], [1, 2, 3]))
# [1.5 2.4]
There is a slight difference with the histogram in this function though, as np.digitize considers all bins to be half-closed (either right or left), unlike np.histogram which considers the last edge to be closed.

Python: Dendogram with Scipy doesn´t work

I want to use the dendogram of scipy.
I have the following data:
I have a list with seven different means. For example:
Y = [71.407452200146807, 0, 33.700136456196823, 1112.3757110973756, 31.594949722819372, 34.823881975554166, 28.36368420190157]
Each mean is calculate for a different user. For example:
X = ["user1", "user2", "user3", "user4", "user5", "user6", "user7"]
My aim is to display the data described above with the help of a dendorgram.
I tried the following:
Y = [71.407452200146807, 0, 33.700136456196823, 1112.3757110973756, 31.594949722819372, 34.823881975554166, 28.36368420190157]
X = ["user1", "user2", "user3", "user4", "user5", "user6", "user7"]
# Attempt with matrix
#X = np.concatenate((X, Y),)
#Z = linkage(X)
Z = linkage(Y)
# Plot the dendogram with the results above
dendrogram(Z, leaf_rotation=45., leaf_font_size=12. , show_contracted=True)
plt.style.use("seaborn-whitegrid")
plt.title("Dendogram to find clusters")
plt.ylabel("Distance")
plt.show()
But it says:
ValueError: Length n of condensed distance matrix 'y' must be a binomial coefficient, i.e.there must be a k such that (k \choose 2)=n)!
I already tried to convert my data into a matrix. With:
# Attempt with matrix
#X = np.concatenate((X, Y),)
#Z = linkage(X)
But that doesn´t work too!
Are there any suggestions?
Thanks :-)
The first argument of linkage is either an n x m array, representing n points in m-dimensional space, or a one-dimensional array containing the condensed distance matrix. These are two very different meanings! The first is the raw data, i.e. the observations. The second format assumes that you have already computed all the distances between your observations, and you are providing these distances to linkage, not the original points.
It looks like you want the first case (raw data), with m = 1. So you must reshape the input to have shape (n, 1).
Replace this:
Z = linkage(Y)
with:
Z = linkage(np.reshape(Y, (len(Y), 1)))
So you are using 7 observations in Y len(Y) = 7.
But as per documentation of Linkage, the number of observations len(Y) should be such that.
{n \choose 2} = len(Y)
which means
1/2 * (n -1) * n = len(Y)
so length of Y should be such that n is a valid integer.

numpy histogram boolean index did not match indexed array along dimension 0; dimension is

I am trying to sky subtract astronomical images by creating histograms of the pixel intensities in each image, and then seting the sky value equal to the intensity of the bin with the highest frequency. Then the idea is to subtract this sky value from each pixel within that frame.IndexError: boolean index did not match indexed array along dimension 0; dimension is 3651469 but corresponding boolean dimension is 3651468
#sciFlat is a list containing three images in array form.
sciFlat = np.asarray(sciFlat)
minpix = min(sciFlat.flatten())
maxpix = max(sciFlat.flatten())
rng = int(maxpix-minpix)
#These are histogram ranges, now loop through each image.
#Sky subtract science images.
sciSky = []
for i in range(3):
hf = np.histogram(sciFlat[i].flatten(), bins=rng, range=(minpix,maxpix))
skyval = hf[1][hf[0] == max(hf[0])]
print(skyval)
skySub = sciFlat[i] - skyval
sciSky.append(skySub)
I expect the code to complete successfully as numpy.histogram should return hist (a flattened array of size n) and bin_edges (1D array of length n).
IndexError Traceback (most recent call last)
142 hf = np.histogram(sciFlat[i].flatten(), bins=rng, range=(minpix,maxpix))
143---> **skyval = hf[1][hf[0] == max(hf[0])]** <----
144 print(skyval)
145 skySub = sciFlat[i] - skyval
IndexError: boolean index did not match indexed array along dimension 0; dimension is 3651469 but corresponding boolean dimension is 3651468
I believe it is simply your logic that fails in the loop, writing out histogram output:
for i in range(3):
hist, edges = np.histogram(sciFlat[i].flatten(), bins=rng, range=(minpix,maxpix))
skyval = edges[hist == max(hist)]
print(skyval)
skySub = sciFlat[i] - skyval
sciSky.append(skySub)
Histogram provides you with edges of the values, what you most likely want is the midpoint of the edges:
for i in range(3):
hist, edges = np.histogram(sciFlat[i].flatten(), bins=rng, range=(minpix,maxpix))
mids = edges[:-1] + np.diff(edges)/2
skyval = mids[hist.argmax()]
print(skyval)
skySub = sciFlat[i] - skyval
sciSky.append(skySub)
by setting mids instead of edges, your histogram has the same dimensions. To illustrate the difference of edges and mids:
sciFlat = np.random.uniform(0,15,100)
hist, edges = np.histogram(sciFlat, bins=(sciFlat.max()-sciFlat.min()).astype(int), range=(sciFlat.min(), sciFlat.max()))
mids = edges[:-1] + np.diff(edges)/2
hist.size
Out[33]: 14
edges.size
Out[34]: 15
mids.size
Out[35]: 14
plt.hist(sciFlat, bins=(sciFlat.max()-sciFlat.min()).astype(int), range=(sciFlat.min(), sciFlat.max()))
plt.plot(mids[hist.argmax()], hist.max(), marker='*', ms=20, c='C3', zorder=1)
plt.plot(mids, hist, 'o', zorder=2, c='C1')
Star denotes the largest midpoint, as you see, it is between the edges:

Pyplot truth value of an array with more than one element is ambiguous

I am trying to implement a knn 1D estimate:
# nearest neighbors estimate
def nearest_n(x, k, data):
# Order dataset
#data = np.sort(data, kind='mergesort')
nnb = []
# iterate over all data and get k nearest neighbours around x
for n in data:
if nnb.__len__()<k:
nnb.append(n)
else:
for nb in np.arange(0,k):
if np.abs(x-n) < np.abs(x-nnb[nb]):
nnb[nb] = n
break
nnb = np.array(nnb)
# get volume(distance) v of k nearest neighbours around x
v = nnb.max() - nnb.min()
v = k/(data.__len__()*v)
return v
interval = np.arange(-4.0, 8.0, 0.1)
plt.figure()
for k in (2,8,35):
plt.plot(interval, nearest_n(interval, k,train_data), label=str(o))
plt.legend()
plt.show()
Which throws:
File "x", line 55, in nearest_n
if np.abs(x-n) < np.abs(x-nnb[nb]):
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
I know the error comes from the array input in plot(), but I am not sure how to avoid this in a function with operators >/==/<
'data' comes from a 1D txt file containing floats.
I tried using vectorize:
nearest_n = np.vectorize(nearest_n)
which results in:
line 50, in nearest_n
for n in data:
TypeError: 'numpy.float64' object is not iterable
Here is an example, let's say:
data = [0.5,1.7,2.3,1.2,0.2,2.2]
k = 2
nearest_n(1.5) should then lead to
nbb=[1.2,1.7]
v = 0.5
and return 2/(6*0.5) = 2/3
The function runs for example neares_n(2.0,4,data) and gives 0.0741586011463
You're passing in np.arange(-4, 8, .01) as your x, which is an array of values. So x - n is an array of the same length as x, in this case 120 elements, since subtraction of an array and a scalar does element-wise subtraction. Same with nnb[nb]. So the result of your comparison there is a 120-length array with boolean values depending on whether each element of np.abs(x-n) is less than the corresponding element of np.abs(x-nnb[nb]). This can't be directly used as a conditional, you would need to coalesce these values to a single boolean (using all(), any(), or simply rethinking your code).
plt.figure()
X = np.arange(-4.0,8.0,0.1)
for k in [2,8,35]:
Y = []
for n in X:
Y.append(nearest_n(n,k,train_data))
plt.plot(X,Y,label=str(k))
plt.show()
is working fine. I thought pyplot.plot would do this exact thing for me already, but I guess it does not...

numpy - do operation along specified axis

So I want to implement a matrix standardisation method.
To do that, I've been told to
subtract the mean and divide by the standard deviation for each dimension
And to verify:
after this processing, each dimension has zero mean and unit variance.
That sounds simple enough ...
import numpy as np
def standardize(X : np.ndarray,inplace=True,verbose=False,check=False):
ret = X
if not inplace:
ret = X.copy()
ndim = np.ndim(X)
for d in range(ndim):
m = np.mean(ret,axis=d)
s = np.std(ret,axis=d)
if verbose:
print(f"m{d} =",m)
print(f"s{d} =",s)
# TODO: handle zero s
# TODO: subtract m along the correct axis
# TODO: divide by s along the correct axis
if check:
means = [np.mean(X,axis=d) for d in range(ndim)]
stds = [np.std(X,axis=d) for d in range(ndim)]
if verbose:
print("means=\n",means)
print("stds=\n",stds)
assert all(all(m < 1e-15 for m in mm) for mm in means)
assert all(all(s == 1.0 for s in ss) for ss in stds)
return ret
e.g. for ndim == 2, we could get something like
A=
[[ 0.40923704 0.91397416 0.62257397]
[ 0.15614258 0.56720836 0.80624135]]
m0 = [ 0.28268981 0.74059126 0.71440766] # can broadcast with ret -= m0
s0 = [ 0.12654723 0.1733829 0.09183369] # can broadcast with ret /= s0
m1 = [ 0.33333333 -0.33333333] # ???
s1 = [ 0.94280904 0.94280904] # ???
How do I do that?
Judging by Broadcast an operation along specific axis in python , I thought I may be looking for a way to create
m[None, None, None, .., None, : , None, None, .., None]
Where there is exactly one : at index d.
But even if I knew how to do that, I'm not sure it'd work.
You can swap your axes such that the first axes is the one you want to normalize. This should also work inplace, since swapaxes just returns a view on your data.
Using the numpy command swapaxes:
for d in range(ndim):
m = np.mean(ret,axis=d)
s = np.std(ret,axis=d)
ret = np.swapaxes(ret, 0, d)
# Perform Normalisation of Axis
ret -= m
ret /= s
ret = np.swapaxes(ret, 0, d)

Categories