Values of each bin

Values of each bin - python

i got following problem:
hist, edges = np.histogram(data, bins=50)
How can i access the values of each bin? I wanted to calculate the avg of each bin.
Thanks

I think this function does what you want:
import numpy as np
def binned_mean(values, edges):
values = np.asarray(values)
# Classify values into bins
dig = np.digitize(values, edges)
# Mask values out of bins
m = (dig > 0) & (dig < len(edges))
values = values[m]
dig = dig[m] - 1
# Binned sum of values
nbins = len(edges) - 1
s = np.zeros(nbins, dtype=values.dtype)
np.add.at(s, dig, values)
# Binned count of values
count = np.zeros(nbins, dtype=np.int32)
np.add.at(count, dig, 1)
# Means
return s / count.clip(min=1)
Example:
print(binned_mean([1.2, 1.8, 2.1, 2.4, 2.7], [1, 2, 3]))
# [1.5 2.4]
There is a slight difference with the histogram in this function though, as np.digitize considers all bins to be half-closed (either right or left), unlike np.histogram which considers the last edge to be closed.

Related

How to group/round/quantize similar values in a numpy array

Is there a numpy method that let's me recover a numpy array's quantized structure if I don't know in advance what the quantized values/levels are, but do know for example that they are spaced > 1.0 apart?
For example:
import numpy as np
x = np.array([0.5, 0.5, 1.75, 1.75, 1.75,6.45,6.45,0.5, 11.1, 0.5, 6.45])
x_noise = x + np.random.randn(len(x))/100
Is there a way to solve for x given just x_noise?

If you don't know anything else about the original quantized values, I think the best you can do is to average over the noise:
sort_idx = np.argsort(x_noise)
# group values that are less than 1.0 apart
splits = np.where(np.diff(x_noise[sort_idx]) > 1.0)[0] + 1
groups = np.split(x_noise[sort_idx], splits)
# reconstruct x with the average values
x_approx = np.empty_like(x_noise)
for idx, group in zip(np.split(sort_idx, splits), groups):
x_approx[idx] = np.mean(group)

Python: define individual binning

I am trying to define my own binning and calculate the mean value of some other columns of my dataframe over these bins. Unfortunately, it only works with integer inputs as you can see below. In this particular case "step_size" defines the step of one bin and I would like to use float values like 0.109 which corresponds to 0.109 seconds. Do you have any idea how I can do this? I think the problem is in the definition of "create_bins" but I cannot fix it...
The goal should be to get this: [(0,0.109),(0.109,0,218),(0.218,0.327) ......]
Greets
# =============================================================================
# Define parameters
# =============================================================================
seconds_min = 0
seconds_max = 9000
step_size = 1
bin_number = int((seconds_max-seconds_min)/step_size)
# =============================================================================
# Define function to create your own individual binning
# lower_bound defines the lowest value of the binning interval
# width defines the width of the binning interval
# quantity defines the number of bins
# =============================================================================
def create_bins(lower_bound, width, quantity):
bins = []
for low in range(lower_bound,
lower_bound + quantity * width + 1, width):
bins.append((low, low+width))
return bins
# =============================================================================
# Create binning list
# =============================================================================
bin_list = create_bins(lower_bound=seconds_min,
width=step_size,
quantity=bin_number)
print(bin_list)

The problem lies in the fact that the range function does not allow for float ranges.
You can use the numeric_range function in more_itertools for this:
from more_itertools import numeric_range
seconds_min = 0
seconds_max = 9
step_size = 0.109
bin_number = int((seconds_max-seconds_min)/step_size)
def create_bins(lower_bound, width, quantity):
bins = []
for low in numeric_range(lower_bound,
lower_bound + quantity * width + 1, width):
bins.append((low, low+width))
return bins
bin_list = create_bins(lower_bound=seconds_min,
width=step_size,
quantity=bin_number)
print(bin_list)
# (0.0, 0.109), (0.109, 0.218), (0.218, 0.327) ... ]

Here's a simple way to do it using zip and numpy's arange. I've put the upper limit at 5, but you can, of course, choose other numbers.
list(zip(np.arange(0, 5, .109), np.arange(.109, 5, .109)))
The result is:
[(0.0, 0.109),
(0.109, 0.218),
(0.218, 0.327),
(0.327, 0.436),
(0.436, 0.545),
(0.545, 0.654),
(0.654, 0.763),
...

no numpy:
max_bin=100
min_bin=0
step_size=0.109
number_of_bins = int(1+((max_bin-min_bin)/step_size)) # +1 to cover the whole interval
bins= []
for a in range(number_of_bins):
bins.append((a*step_size, (a+1)*step_size))

How can binned events be identified based on a common condition for the bins? (scipy.binned_statistic)

I am using scipy.binned_statistic to get the frequency of points within a bin, such that:
h, xedge, yedge, binindex = scipy.stats.binned_statistic_2d(X, Y, Y, statistic='mean', bins=160)
I am able to filter out certain bins using the following:
filter = list(np.argwhere(h > 5).flatten())
From this I can get the bin edges/center from edges and yedges for the data I am interested in.
What is the most pythonic way to get the original data from these bins of interest? For example, how do I get the original data that is contained within the bins which have more than 5 points?

Yes, that's possible with some indexing magic. I am not sure if it is the most Pythonic way, but it should be close.
The solution for 1d using stats.binned_statistic:
from scipy import stats
import numpy as np
values = np.array([1.0, 1.0, 2.0, 1.5, 3.0]) # not used with 'count'
x = np.array([1, 1, 1, 4, 7, 7, 7])
statistic, bin_edges, binnumber = stats.binned_statistic(x, values, 'count', bins=3)
print(statistic)
print(bin_edges)
print(binnumber)
# find the bins with equal or more than three events
# if you are using custom bins where events can be lower or
# higher than your specified bins -> handle this
# get the bin numbers according to some condition
idx_bin = np.where(statistic >= 3)[0]
print(idx_bin)
# A binnumber of i means the corresponding value is
# between (bin_edges[i-1], bin_edges[i]).
# -> increment the bin indices by one
idx_bin += 1
print(idx_bin)
# the rest is easy, get the boolean mask and apply it
is_event = np.in1d(binnumber, idx_bin)
events = x[is_event]
print(events)
For 2d or nd you could use the solution above multiple times and combine the is_event masks for each dimension using np.logical_and (2d) or np.logical_and.reduce((x, y, z)) (nd, see here).
The solution for 2d using stats.binned_statistic_2d is basically the same:
from scipy import stats
import numpy as np
x = np.array([1, 1.5, 2.0, 4, 5.5, 1.5, 7, 1])
y = np.array([1.0, 7.0, 1.0, 3, 7, 7, 7, 1])
values = np.ones_like(x) # not used with 'count'
# check keyword expand_binnumbers, use non-linearized
# as they can be used as indices without flattening
ret = stats.binned_statistic_2d(x,
y,
values,
'count',
bins=2,
expand_binnumbers=True)
print(ret.statistic)
print('binnumber', ret.binnumber)
binnumber = ret.binnumber
statistic = ret.statistic
# find the bins with equal or more than three events
# if you are using custom bins where events can be lower or
# higher than your specified bins -> handle this
# get the bin numbers according to some condition
idx_bin_x, idx_bin_y = np.where(statistic >= 3)#[0]
print(idx_bin_x)
print(idx_bin_y)
# A binnumber of i means the corresponding value is
# between (bin_edges[i-1], bin_edges[i]).
# -> increment the bin indices by one
idx_bin_x += 1
idx_bin_y += 1
print(idx_bin_x)
print(idx_bin_y)
# the rest is easy, get the boolean mask and apply it
is_event_x = np.in1d(binnumber[0], idx_bin_x)
is_event_y = np.in1d(binnumber[1], idx_bin_y)
is_event_xy = np.logical_and(is_event_x, is_event_y)
events_x = x[is_event_xy]
events_y = y[is_event_xy]
print('x', events_x)
print('y', events_y)

Finding anomalous values from sinusoidal data

How can I find anomalous values from following data. I am simulating a sinusoidal pattern. While I can plot the data and spot any anomalies or noise in data, but how can I do it without plotting the data. I am looking for simple approaches other than Machine learning methods.
import random
import numpy as np
import matplotlib.pyplot as plt
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
print("in_array : ", in_array)
out_array = np.sin(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Inject random noise
noise_input = random.uniform(-.5, .5); print("Noise : ",noise_input)
in_array[random.randint(0,len(in_array)-1)] = noise_input
print(in_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
Data with noise

I've thought of the following approach to your problem, since you have only some values that are anomalous in the time vector, it means that the rest of the values have a regular progression, which means that if we gather all the data points in the vector under clusters and calculate the average step for the biggest cluster (which is essentially the pool of values that represent the real deal), then we can use that average to do a triad detection, in a given threshold, over the vector and detect which of the elements are anomalous.
For this we need two functions: calculate_average_step which will calculate that average for the biggest cluster of close values, and then we need detect_anomalous_values which will yield the indexes of the anomalous values in our vector, based on that average calculated earlier.
After we detected the anomalous values, we can go ahead and replace them with an estimated value, which we can determine from our average step value and by using the adjacent points in the vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
def calculate_average_step(array, threshold=5):
"""
Determine the average step by doing a weighted average based on clustering of averages.
array: our array
threshold: the +/- offset for grouping clusters. Aplicable on all elements in the array.
"""
# determine all the steps
steps = []
for i in range(0, len(array) - 1):
steps.append(abs(array[i] - array[i+1]))
# determine the steps clusters
clusters = []
skip_indexes = []
cluster_index = 0
for i in range(len(steps)):
if i in skip_indexes:
continue
# determine the cluster band (based on threshold)
cluster_lower = steps[i] - (steps[i]/100) * threshold
cluster_upper = steps[i] + (steps[i]/100) * threshold
# create the new cluster
clusters.append([])
clusters[cluster_index].append(steps[i])
# try to match elements from the rest of the array
for j in range(i + 1, len(steps)):
if not (cluster_lower <= steps[j] <= cluster_upper):
continue
clusters[cluster_index].append(steps[j])
skip_indexes.append(j)
cluster_index += 1 # increment the cluster id
clusters = sorted(clusters, key=lambda x: len(x), reverse=True)
biggest_cluster = clusters[0] if len(clusters) > 0 else None
if biggest_cluster is None:
return None
return sum(biggest_cluster) / len(biggest_cluster) # return our most common average
def detect_anomalous_values(array, regular_step, threshold=5):
"""
Will scan every triad (3 points) in the array to detect anomalies.
array: the array to iterate over.
regular_step: the step around which we form the upper/lower band for filtering
treshold: +/- variation between the steps of the first and median element and median and third element.
"""
assert(len(array) >= 3) # must have at least 3 elements
anomalous_indexes = []
step_lower = regular_step - (regular_step / 100) * threshold
step_upper = regular_step + (regular_step / 100) * threshold
# detection will be forward from i (hence 3 elements must be available for the d)
for i in range(0, len(array) - 2):
a = array[i]
b = array[i+1]
c = array[i+2]
first_step = abs(a-b)
second_step = abs(b-c)
first_belonging = step_lower <= first_step <= step_upper
second_belonging = step_lower <= second_step <= step_upper
# detect that both steps are alright
if first_belonging and second_belonging:
continue # all is good here, nothing to do
# detect if the first point in the triad is bad
if not first_belonging and second_belonging:
anomalous_indexes.append(i)
# detect the last point in the triad is bad
if first_belonging and not second_belonging:
anomalous_indexes.append(i+2)
# detect the mid point in triad is bad (or everything is bad)
if not first_belonging and not second_belonging:
anomalous_indexes.append(i+1)
# we won't add here the others because they will be detected by
# the rest of the triad scans
return sorted(set(anomalous_indexes)) # return unique indexes
if __name__ == "__main__":
N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
noisy_out_array = np.sin(in_array)
# display noisy sin
plt.figure()
plt.plot(in_array, noisy_out_array, color = 'red', marker = "o");
plt.title("noisy numpy.sin()")
# detect anomalous values
average_step = calculate_average_step(in_array)
anomalous_indexes = detect_anomalous_values(in_array, average_step)
# replace anomalous points with an estimated value based on our calculated average
for anomalous in anomalous_indexes:
# try forward extrapolation
try:
in_array[anomalous] = in_array[anomalous-1] + average_step
# else try backwward extrapolation
except IndexError:
in_array[anomalous] = in_array[anomalous+1] - average_step
# generate sine wave
out_array = np.sin(in_array)
plt.figure()
plt.plot(in_array, out_array, color = 'green', marker = "o");
plt.title("cleaned numpy.sin()")
plt.show()
Noisy sine:
Cleaned sine:

Your problem relies in the time vector (which is of 1 dimension). You will need to apply some sort of filter on that vector.
First thing that came to mind was medfilt (median filter) from scipy and it looks something like this:
from scipy.signal import medfilt
l1 = [0, 10, 20, 30, 2, 50, 70, 15, 90, 100]
l2 = medfilt(l1)
print(l2)
the output of this will be:
[ 0. 10. 20. 20. 30. 50. 50. 70. 90. 90.]
the problem with this filter though is that if we apply some noise values to the edges of the vector like [200, 0, 10, 20, 30, 2, 50, 70, 15, 90, 100, -50] then the output would be something like [ 0. 10. 10. 20. 20. 30. 50. 50. 70. 90. 90. 0.] and obviously this is not ok for the sine plot since it will produce the same artifacts for the sine values array.
A better approach to this problem is to treat the time vector as an y output and it's index values as the x input and do a linear regression on the "time linear function", not the quotes, it just means we're faking the 2 dimensional model by applying a fake X vector. The code implies the use of scipy's linregress (linear regression) function:
from scipy.stats import linregress
l1 = [5, 0, 10, 20, 30, -20, 50, 70, 15, 90, 100]
l1_x = range(0, len(l1))
slope, intercept, r_val, p_val, std_err = linregress(l1_x, l1)
l1 = intercept + slope * l1_x
print(l1)
whose output will be:
[-10.45454545 -1.63636364 7.18181818 16. 24.81818182
33.63636364 42.45454545 51.27272727 60.09090909 68.90909091
77.72727273]
Now let's apply this to your time vector.
import random
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from scipy.stats import linregress
N = 20
# N = 10 # Set signal sample length
t1 = -np.pi # Simulation begins at t1
t2 = np.pi; # Simulation ends at t2
in_array = np.linspace(t1, t2, N)
# add some noise
noise_input = random.uniform(-.5, .5);
in_array[random.randint(0, len(in_array)-1)] = noise_input
# apply filter on time array
in_array_x = range(0, len(in_array))
slope, intercept, r_val, p_val, std_err = linregress(in_array_x, in_array)
in_array = intercept + slope * in_array_x
# generate sine wave
out_array = np.sin(in_array)
print("OUT ARRAY")
print(out_array)
plt.plot(in_array, out_array, color = 'red', marker = "o") ; plt.title("numpy.sin()")
plt.show()
the output will be:
the resulting signal will be an approximation of the original, as it is with any form of extrapolation/interpolation/regression filtering.

Python: binned_statistic_2d mean calculation ignoring NaNs in data

I am using scipy.stats.binned_statistic_2d to bin irregular data onto a uniform grid by finding the mean of points within every bin.
x,y = np.meshgrid(sort(np.random.uniform(0,1,100)),sort(np.random.uniform(0,1,100)))
z = np.sin(x*y)
statistic, xedges, yedges, binnumber = sp.stats.binned_statistic_2d(x.ravel(), y.ravel(), values=z.ravel(), statistic='mean',bins=[np.arange(0,1.1,.1), np.arange(0,1.1,.1)])
plt.figure(1)
plt.pcolormesh(x,y,z, vmin = 0, vmax = 1)
plt.figure(2)
plt.pcolormesh(xedges,yedges,statistic, vmin = 0, vmax = 1)
Produces these plots, as expected:
Scattered data:
Gridded data:
But the data I want to grid has NaNs in it. This is what the result is like when I add NaNs:
x,y = np.meshgrid(sort(np.random.uniform(0,1,100)),sort(np.random.uniform(0,1,100)))
z = np.sin(x*y)
z[50:55,50:55] = np.nan
statistic, xedges, yedges, binnumber = binned_statistic_2d(x.ravel(), y.ravel(), values=z.ravel(), statistic='mean',bins=[np.arange(0,1.1,.1), np.arange(0,1.1,.1)])
plt.figure(3)
plt.pcolormesh(x,y,z, vmin = 0, vmax = 1)
plt.figure(4)
plt.pcolormesh(xedges,yedges,statistic, vmin = 0, vmax = 1)
Scattered:
Gridded:
Obviously if a bin is entirely filled with NaNs, the the resulting mean of that bin should still be NaN. However, I would like bins that are not entirely filled with NaNs to just result in the mean of the non-NaN numbers.
I've tried replacing the "statistic" argument in sp.stats.binned_statistic_2d with np.nanmean. This works, but it goes very very slowly when I use it on large datasets. I've tried digging into the underlying code of `sp.stats.binned_statistic_2d', but I can't figure out exactly how it is calculating the mean, or how to make it ignore NaNs in it's calculation.
Any ideas?

I had the same problem and changed the definition of binned_statistic_dd in scipy.stats and saved a local copy so that it won't be changed if scipy is updated.
I added 'nanmean' to the list of known_stats and
elif statistic == 'nanmean':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.nanmean(values[vv, binnumbers == i])
Full new definition:
def binned_statistic_dd(sample, values, statistic='mean',
bins=10, range=None, expand_binnumbers=False,
binned_statistic_result=None):
"""
Compute a multidimensional binned statistic for a set of data.
This is a generalization of a histogramdd function. A histogram divides
the space into bins, and returns the count of the number of points in
each bin. This function allows the computation of the sum, mean, median,
or other statistic of the values within each bin.
Parameters
----------
sample : array_like
Data to histogram passed as a sequence of N arrays of length D, or
as an (N,D) array.
values : (N,) array_like or list of (N,) array_like
The data on which the statistic will be computed. This must be
the same shape as `sample`, or a list of sequences - each with the
same shape as `sample`. If `values` is such a list, the statistic
will be computed on each independently.
statistic : string or callable, optional
The statistic to compute (default is 'mean').
The following statistics are available:
* 'mean' : compute the mean of values for points within each bin.
Empty bins will be represented by NaN.
* 'median' : compute the median of values for points within each
bin. Empty bins will be represented by NaN.
* 'count' : compute the count of points within each bin. This is
identical to an unweighted histogram. `values` array is not
referenced.
* 'sum' : compute the sum of values for points within each bin.
This is identical to a weighted histogram.
* 'std' : compute the standard deviation within each bin. This
is implicitly calculated with ddof=0. If the number of values
within a given bin is 0 or 1, the computed standard deviation value
will be 0 for the bin.
* 'min' : compute the minimum of values for points within each bin.
Empty bins will be represented by NaN.
* 'max' : compute the maximum of values for point within each bin.
Empty bins will be represented by NaN.
* function : a user-defined function which takes a 1D array of
values, and outputs a single numerical statistic. This function
will be called on the values in each bin. Empty bins will be
represented by function([]), or NaN if this returns an error.
bins : sequence or positive int, optional
The bin specification must be in one of the following forms:
* A sequence of arrays describing the bin edges along each dimension.
* The number of bins for each dimension (nx, ny, ... = bins).
* The number of bins for all dimensions (nx = ny = ... = bins).
range : sequence, optional
A sequence of lower and upper bin edges to be used if the edges are
not given explicitly in `bins`. Defaults to the minimum and maximum
values along each dimension.
expand_binnumbers : bool, optional
'False' (default): the returned `binnumber` is a shape (N,) array of
linearized bin indices.
'True': the returned `binnumber` is 'unraveled' into a shape (D,N)
ndarray, where each row gives the bin numbers in the corresponding
dimension.
See the `binnumber` returned value, and the `Examples` section of
`binned_statistic_2d`.
binned_statistic_result : binnedStatisticddResult
Result of a previous call to the function in order to reuse bin edges
and bin numbers with new values and/or a different statistic.
To reuse bin numbers, `expand_binnumbers` must have been set to False
(the default)
.. versionadded:: 0.17.0
Returns
-------
statistic : ndarray, shape(nx1, nx2, nx3,...)
The values of the selected statistic in each two-dimensional bin.
bin_edges : list of ndarrays
A list of D arrays describing the (nxi + 1) bin edges for each
dimension.
binnumber : (N,) array of ints or (D,N) ndarray of ints
This assigns to each element of `sample` an integer that represents the
bin in which this observation falls. The representation depends on the
`expand_binnumbers` argument. See `Notes` for details.
See Also
--------
numpy.digitize, numpy.histogramdd, binned_statistic, binned_statistic_2d
Notes
-----
Binedges:
All but the last (righthand-most) bin is half-open in each dimension. In
other words, if `bins` is ``[1, 2, 3, 4]``, then the first bin is
``[1, 2)`` (including 1, but excluding 2) and the second ``[2, 3)``. The
last bin, however, is ``[3, 4]``, which *includes* 4.
`binnumber`:
This returned argument assigns to each element of `sample` an integer that
represents the bin in which it belongs. The representation depends on the
`expand_binnumbers` argument. If 'False' (default): The returned
`binnumber` is a shape (N,) array of linearized indices mapping each
element of `sample` to its corresponding bin (using row-major ordering).
If 'True': The returned `binnumber` is a shape (D,N) ndarray where
each row indicates bin placements for each dimension respectively. In each
dimension, a binnumber of `i` means the corresponding value is between
(bin_edges[D][i-1], bin_edges[D][i]), for each dimension 'D'.
.. versionadded:: 0.11.0
Examples
--------
>>> from scipy import stats
>>> import matplotlib.pyplot as plt
>>> from mpl_toolkits.mplot3d import Axes3D
Take an array of 600 (x, y) coordinates as an example.
`binned_statistic_dd` can handle arrays of higher dimension `D`. But a plot
of dimension `D+1` is required.
>>> mu = np.array([0., 1.])
>>> sigma = np.array([[1., -0.5],[-0.5, 1.5]])
>>> multinormal = stats.multivariate_normal(mu, sigma)
>>> data = multinormal.rvs(size=600, random_state=235412)
>>> data.shape
(600, 2)
Create bins and count how many arrays fall in each bin:
>>> N = 60
>>> x = np.linspace(-3, 3, N)
>>> y = np.linspace(-3, 4, N)
>>> ret = stats.binned_statistic_dd(data, np.arange(600), bins=[x, y],
... statistic='count')
>>> bincounts = ret.statistic
Set the volume and the location of bars:
>>> dx = x[1] - x[0]
>>> dy = y[1] - y[0]
>>> x, y = np.meshgrid(x[:-1]+dx/2, y[:-1]+dy/2)
>>> z = 0
>>> bincounts = bincounts.ravel()
>>> x = x.ravel()
>>> y = y.ravel()
>>> fig = plt.figure()
>>> ax = fig.add_subplot(111, projection='3d')
>>> with np.errstate(divide='ignore'): # silence random axes3d warning
... ax.bar3d(x, y, z, dx, dy, bincounts)
Reuse bin numbers and bin edges with new values:
>>> ret2 = stats.binned_statistic_dd(data, -np.arange(600),
... binned_statistic_result=ret,
... statistic='mean')
"""
known_stats = ['mean', 'median', 'count', 'sum', 'std', 'min', 'max',
'nanmean']
if not callable(statistic) and statistic not in known_stats:
raise ValueError('invalid statistic %r' % (statistic,))
try:
bins = index(bins)
except TypeError:
# bins is not an integer
pass
# If bins was an integer-like object, now it is an actual Python int.
# NOTE: for _bin_edges(), see e.g. gh-11365
if isinstance(bins, int) and not np.isfinite(sample).all():
raise ValueError('%r contains non-finite values.' % (sample,))
# `Ndim` is the number of dimensions (e.g. `2` for `binned_statistic_2d`)
# `Dlen` is the length of elements along each dimension.
# This code is based on np.histogramdd
try:
# `sample` is an ND-array.
Dlen, Ndim = sample.shape
except (AttributeError, ValueError):
# `sample` is a sequence of 1D arrays.
sample = np.atleast_2d(sample).T
Dlen, Ndim = sample.shape
# Store initial shape of `values` to preserve it in the output
values = np.asarray(values)
input_shape = list(values.shape)
# Make sure that `values` is 2D to iterate over rows
values = np.atleast_2d(values)
Vdim, Vlen = values.shape
# Make sure `values` match `sample`
if(statistic != 'count' and Vlen != Dlen):
raise AttributeError('The number of `values` elements must match the '
'length of each `sample` dimension.')
try:
M = len(bins)
if M != Ndim:
raise AttributeError('The dimension of bins must be equal '
'to the dimension of the sample x.')
except TypeError:
bins = Ndim * [bins]
if binned_statistic_result is None:
nbin, edges, dedges = _bin_edges(sample, bins, range)
binnumbers = _bin_numbers(sample, nbin, edges, dedges)
else:
edges = binned_statistic_result.bin_edges
nbin = np.array([len(edges[i]) + 1 for i in builtins.range(Ndim)])
# +1 for outlier bins
dedges = [np.diff(edges[i]) for i in builtins.range(Ndim)]
binnumbers = binned_statistic_result.binnumber
result = np.empty([Vdim, nbin.prod()], float)
if statistic == 'mean':
result.fill(np.nan)
flatcount = np.bincount(binnumbers, None)
a = flatcount.nonzero()
for vv in builtins.range(Vdim):
flatsum = np.bincount(binnumbers, values[vv])
result[vv, a] = flatsum[a] / flatcount[a]
elif statistic == 'std':
result.fill(0)
flatcount = np.bincount(binnumbers, None)
a = flatcount.nonzero()
for vv in builtins.range(Vdim):
for i in np.unique(binnumbers):
# NOTE: take std dev by bin, np.std() is 2-pass and stable
binned_data = values[vv, binnumbers == i]
# calc std only when binned data is 2 or more for speed up.
if len(binned_data) >= 2:
result[vv, i] = np.std(binned_data)
elif statistic == 'count':
result.fill(0)
flatcount = np.bincount(binnumbers, None)
a = np.arange(len(flatcount))
result[:, a] = flatcount[np.newaxis, :]
elif statistic == 'sum':
result.fill(0)
for vv in builtins.range(Vdim):
flatsum = np.bincount(binnumbers, values[vv])
a = np.arange(len(flatsum))
result[vv, a] = flatsum
elif statistic == 'median':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.median(values[vv, binnumbers == i])
elif statistic == 'min':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.min(values[vv, binnumbers == i])
elif statistic == 'max':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.max(values[vv, binnumbers == i])
elif statistic == 'nanmean':
result.fill(np.nan)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = np.nanmean(values[vv, binnumbers == i])
elif callable(statistic):
with np.errstate(invalid='ignore'), suppress_warnings() as sup:
sup.filter(RuntimeWarning)
try:
null = statistic([])
except Exception:
null = np.nan
result.fill(null)
for i in np.unique(binnumbers):
for vv in builtins.range(Vdim):
result[vv, i] = statistic(values[vv, binnumbers == i])
# Shape into a proper matrix
result = result.reshape(np.append(Vdim, nbin))
# Remove outliers (indices 0 and -1 for each bin-dimension).
core = tuple([slice(None)] + Ndim * [slice(1, -1)])
result = result[core]
# Unravel binnumbers into an ndarray, each row the bins for each dimension
if(expand_binnumbers and Ndim > 1):
binnumbers = np.asarray(np.unravel_index(binnumbers, nbin))
if np.any(result.shape[1:] != nbin - 2):
raise RuntimeError('Internal Shape Error')
# Reshape to have output (`result`) match input (`values`) shape
result = result.reshape(input_shape[:-1] + list(nbin-2))
return BinnedStatisticddResult(result, edges, binnumbers)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Values of each bin - python

i got following problem: hist, edges = np.histogram(data, bins=50) How can i access the values of each bin? I wanted to calculate the avg of each bin. Thanks

Related

How to group/round/quantize similar values in a numpy array

Python: define individual binning

How can binned events be identified based on a common condition for the bins? (scipy.binned_statistic)

Finding anomalous values from sinusoidal data

Python: binned_statistic_2d mean calculation ignoring NaNs in data

Categories

Resources