Matplotlib: How to make a histogram with bins of equal area?

Matplotlib: How to make a histogram with bins of equal area? - python

Given some list of numbers following some arbitrary distribution, how can I define bin positions for matplotlib.pyplot.hist() so that the area in each bin is equal to (or close to) some constant area, A? The area should be calculated by multiplying the number of items in the bin by the width of the bin and its value should be no greater than A.
Here is a MWE to display a histogram with normally distributed sample data:
import matplotlib.pyplot as plt
import numpy as np
x = np.random.randn(100)
plt.hist(x, bin_pos)
plt.show()
Here bin_pos is a list representing the positions of the boundaries of the bins (see related question here.

I found this question intriguing. The solution depends on whether you want to plot a density function, or a true histogram. The latter case turns out to be quite a bit more challenging. Here is more info on the difference between a histogram and a density function.
Density Functions
This will do what you want for a density function:
def histedges_equalN(x, nbin):
npt = len(x)
return np.interp(np.linspace(0, npt, nbin + 1),
np.arange(npt),
np.sort(x))
x = np.random.randn(1000)
n, bins, patches = plt.hist(x, histedges_equalN(x, 10), normed=True)
Note the use of normed=True, which specifies that we're calculating and plotting a density function. In this case the areas are identically equal (you can check by looking at n * np.diff(bins)). Also note that this solution involves finding bins that have the same number of points.
Histograms
Here is a solution that gives approximately equal area boxes for a histogram:
def histedges_equalA(x, nbin):
pow = 0.5
dx = np.diff(np.sort(x))
tmp = np.cumsum(dx ** pow)
tmp = np.pad(tmp, (1, 0), 'constant')
return np.interp(np.linspace(0, tmp.max(), nbin + 1),
tmp,
np.sort(x))
n, bins, patches = plt.hist(x, histedges_equalA(x, nbin), normed=False)
These boxes, however, are not all equal area. The first and last, in particular, tend to be about 30% larger than the others. This is an artifact of the sparse distribution of the data at the tails of the normal distribution and I believe it will persist anytime their is a sparsely populated region in a data set.
Side note: I played with the value pow a bit, and found that a value of about 0.56 had a lower RMS error for the normal distribution. I stuck with the square-root because it performs best when the data is tightly-spaced (relative to the bin-width), and I'm pretty sure there is a theoretical basis for it that I haven't bothered to dig into (anyone?).
The issue with equal-area histograms
As far as I can tell it is not possible to obtain an exact solution to this problem. This is because it is sensitive to the discretization of the data. For example, suppose the first point in your dataset is an outlier at -13 and the next value is at -3, as depicted by the red dots in this image:
Now suppose the total "area" of your histogram is 150 and you want 10 bins. In that case the area of each histogram bar should be about 15, but you can't get there because as soon as your bar includes the second point, its area jumps from 10 to 20. That is, the data does not allow this bar to have an area between 10 and 20. One solution for this might be to adjust the lower-bound of the box to increase its area, but this starts to become arbitrary and does not work if this 'gap' is in the middle of the data set.

Related

Seaborn KDEPlot - not enough variation in data?

I have a data frame containing ~900 rows; I'm trying to plot KDEplots for some of the columns. In some columns, a majority of the values are the same, minimum value. When I include too many of the minimum values, the KDEPlot abruptly stops showing the minimums. For example, the following includes 600 values, of which 450 are the minimum, and the plot looks fine:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:600]
sb.kdeplot(y)
But including 451 of the minimum values gives a very different output:
y = df.sort_values(by='col1', ascending=False)['col1'].values[:601]
sb.kdeplot(y)
Eventually I would like to plot bivariate KDEPlots of different columns against each other, but I'd like to understand this first.

The problem is the default algorithm that is chosen for the "bandwidth" of the kde. The default method is 'scott', which isn't very helpful when there are many equal values.
The bandwidth is the width of the gaussians that are positioned at every sample point and summed up. Lower bandwidths are closer to the data, higher bandwidths smooth everything out. The sweet spot is somewhere in the middle. In this case bw=0.3 could be a good option. In order to compare different kde's it is recommended to each time choose exactly the same bandwidth.
Here is some sample code to show the difference between bw='scott' and bw=0.3. The example data are 150 values from a standard normal distribution together with either 400, 450 or 500 fixed values.
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns; sns.set()
fig, axs = plt.subplots(nrows=2, ncols=3, figsize=(10,5), gridspec_kw={'hspace':0.3})
for i, bw in enumerate(['scott', 0.3]):
for j, num_same in enumerate([400, 450, 500]):
y = np.concatenate([np.random.normal(0, 1, 150), np.repeat(-3, num_same)])
sns.kdeplot(y, bw=bw, ax=axs[i, j])
axs[i, j].set_title(f'bw:{bw}; fixed values:{num_same}')
plt.show()
The third plot gives a warning that the kde can not be drawn using Scott's suggested bandwidth.
PS: As mentioned by #mwascom in the comments, in this case scipy.statsmodels.nonparametric.kde is used (not scipy.stats.gaussian_kde). There the default is "scott" - 1.059 * A * nobs ** (-1/5.), where A is min(std(X),IQR/1.34). The min() clarifies the abrupt change in behavior. IQR is the "interquartile range", the difference between the 75th and 25th percentiles.
Edit: Since Seaborn 0.11, the statsmodel backend has been dropped, so kde's are only calculated via scipy.stats.gaussian_kde.

If the sample has repeated values, this implies that the underlying distribution is not continuous. In the data that you show to illustrate the issue, we can see a Dirac distribution on the left. The kernel smoothing might be applied for such data, but with care. Indeed, to approximate such data, we might use a kernel smoothing where the bandwidth associated to the Dirac is zero. However, in most KDE methods, there is only one single bandwidth for all kernel atoms. Moreover, the various rules used to compute the bandwidth are based on some estimation of the rugosity of the second derivative of the PDF of the distribution. This cannot be applied to a discontinuous distribution.
We can, however, try to separate the sample into two sub-samples:
the sub-sample(s) with replications,
the sub-sample with unique realizations.
(This idea has already been mentionned by johanc).
Below is an attempt to perform this classification. The np.unique method is used to count the occurences of the replicated realizations. The replicated values are associated with Diracs and the weight in the mixture is estimated from the fraction of these replicated values in the sample. The remaining realizations, uniques, are then used to estimate the continuous distribution with KDE.
The following function will be useful in order to overcome a limitation with the current implementation of the draw method of Mixtures with OpenTURNS.
def DrawMixtureWithDiracs(distribution):
"""Draw a distributions which has Diracs.
https://github.com/openturns/openturns/issues/1489"""
graph = distribution.drawPDF()
graph.setLegends(["Mixture"])
for atom in distribution.getDistributionCollection():
if atom.getName() == "Dirac":
curve = atom.drawPDF()
curve.setLegends(["Dirac"])
graph.add(curve)
return graph
The following script creates a use-case with a Mixture containing a Dirac and a gaussian distributions.
import openturns as ot
import numpy as np
distribution = ot.Mixture([ot.Dirac(-3.0),
ot.Normal()], [0.5, 0.5])
DrawMixtureWithDiracs(distribution)
This is the result.
Then we create a sample.
sample = distribution.getSample(100)
This is where your problem begins. We count the number of occurences of each realizations.
array = np.array(sample)
unique, index, count = np.unique(array, axis=0, return_index=True,
return_counts=True)
For all realizations, replicated values are associated with Diracs and unique values are put in a separate list.
sampleSize = sample.getSize()
listOfDiracs = []
listOfWeights = []
uniqueValues = []
for i in range(len(unique)):
if count[i] == 1:
uniqueValues.append(unique[i][0])
else:
atom = ot.Dirac(unique[i])
listOfDiracs.append(atom)
w = count[i] / sampleSize
print("New Dirac =", unique[i], " with weight =", w)
listOfWeights.append(w)
The weight of the continuous atom is the complementary of the sum of the weights of the Diracs. This way, the sum of the weights will be equal to 1.
complementaryWeight = 1.0 - sum(listOfWeights)
weights = list(listOfWeights)
weights.append(complementaryWeight)
The easy part comes: the unique realizations can be used to fit a kernel smoothing. The KDE is then added to the list of atoms.
sampleUniques = ot.Sample(uniqueValues, 1)
factory = ot.KernelSmoothing()
kde = factory.build(sampleUniques)
atoms = list(listOfDiracs)
atoms.append(kde)
Et voilà: the Mixture is ready.
mixture_estimated = ot.Mixture(atoms, weights)
The following script compares the initial Mixture and the estimated one.
graph = DrawMixtureWithDiracs(distribution)
graph.setColors(["dodgerblue3", "dodgerblue3"])
curve = DrawMixtureWithDiracs(mixture_estimated)
curve.setColors(["darkorange1", "darkorange1"])
curve.setLegends(["Est. Mixture", "Est. Dirac"])
graph.add(curve)
graph
The figure seems satisfactory, since the continuous distribution is estimated from a sub-sample which size is only equal to 50, i.e. one half of the full sample.

Distance between two group of values in a numpy array

I have a very basic question which in theory is easy to do (with fewer points and a lot of manual labour in ArcGIS), but I am not able to start at all with the coding to solve this problem (also I am new to complicated python coding).
I have 2 variables 'Root zone' aka RTZ and 'Tree cover' aka TC both are an array of 250x186 values (which are basically grids with each grid having a specific value). The values in TC varies from 0 to 100. Each grid size is 0.25 degrees (might be helpful in understanding the distance).
My problem is "I want to calculate the distance of each TC value ranging between 50-100 (so each value of TC value greater than 50 at each lat and lon) from the points where nearest TC ranges between 0-30 (less than 30)."
Just take into consideration that we are not looking at the np.nan part of the TC. So the white part in TC is also white in RZS.
What I want to do is create a 2-dimensional scatter plot with X-axis denoting the 'distance of 50-100 TC from 0-30 values', Y-axis denoting 'RZS of those 50-100 TC points'. The above figure might make things more clear.
I hope I could have provided any code for this, but I am not to even able to start on the distance thing.
Please provide any suggestion on how should I proceed with this.
Let's consider an example:
If you look at the x: 70 and y:70, one can see a lot of points with values from 0-30 of the tree cover all across the dataset. But I only want the distance from the nearest value to my point which falls between 0-30.

The following code might work, with random example data:
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
# Create some completely random data, and include an area of NaNs as well
rzs = np.random.uniform(0, 100, size=(250, 168))
tc = np.random.lognormal(3.0, size=(250, 168))
tc = np.clip(tc, 0, 100)
rzs[60:80,:] = np.nan
tc[60:80,:] = np.nan
plt.subplot(2,2,1)
plt.imshow(rzs)
plt.colorbar()
plt.subplot(2,2,2)
plt.imshow(tc)
plt.colorbar()
Now do the real work:
# Select the indices of the low- and high-valued points
# This will results in warnings here because of NaNs;
# the NaNs should be filtered out in the indices, since they will
# compare to False in all the comparisons, and thus not be
# indexed by 'low' and 'high'
low = (tc >= 0) & (tc <= 30)
high = (tc >= 50) & (tc <= 100)
# Get the coordinates for the low- and high-valued points,
# combine and transpose them to be in the correct format
y, x = np.where(low)
low_coords = np.array([x, y]).T
y, x = np.where(high)
high_coords = np.array([x, y]).T
# We now calculate the distances between *all* low-valued points, and *all* high-valued points.
# This calculation scales as O^2, as does the memory cost (of the output),
# so be wary when using it with large input sizes.
from scipy.spatial.distance import cdist, pdist
distances = cdist(low_coords, high_coords)
# Now find the minimum distance along the axis of the high-valued coords,
# which here is the second axis.
# Since we also want to find values corresponding to those minimum distances,
# we should use the `argmin` function instead of a normal `min` function.
indices = distances.argmin(axis=1)
mindistances = distances[np.arange(distances.shape[0]), indices]
minrzs = rzs.flatten()[indices]
plt.scatter(mindistances, minrzs)
The resulting plot looks a bit weird, since there are rather discrete distances because of the grid (1, sqrt(1^1+1^1), 2, sqrt(1^1+2^2), sqrt(2^2+2^2), 3, sqrt(1^1+3^2), ...); this is because both TC values are randomly distributed, and thus low values may end up directly adjacent to high values (and because we're looking for minimum distances, most plotted points are for these cases). The vertical distribution is because the RZS values were uniformly distributed between 0 and 100.
This is simply a result of the input example data, which is not too representative of the real data.

Standard deviation of binned values with `scipy.stats.binned_statistic`

When I bin my data accordingly to scipy.stats.binned_statistic (see here for example), how do I get the error (that is the standard deviation) on the average binned values?
For example, if I bin my data as following:
windspeed = 8 * np.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * np.random.rand(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='median', bins=[1,2,3,4,5,6,7])
plt.figure()
plt.plot(windspeed, boatspeed, 'b.', label='raw data')
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], colors='g', lw=5,
label='binned statistic of data')
plt.legend()
how do I get the standard deviation on the bin_means?

The way to go about this is to construct a probability density estimate from the histogram (this is just a question of normalizing the histogram appropriately), and then computing the standard deviation or any other statistic for the estimated density.
The appropriate normalization is whatever is needed to get the area under the histogram to be 1. As for computing statistics for the density estimate, work from the definition of the statistic as integral(p(x)*f(x), x, -infinity, +infinity), substituting the density estimate for p(x) and whatever is needed for f(x), e.g. x and x^2 to get the first and second moments, from which you calculate the variance and then the standard deviation.
I'll post some formulas tomorrow, or maybe someone else wants to give it a try in the meantime. You might be able to look up some formulas, but my advice is to always try to work out the answer before resorting to looking it up.

Maybe I'm a bit late to answer, but I was wondering how to do the same thing and came across this question. I think calculating it with stats.binned_statistic_2d should be possible, but I haven't figured it out yet. For now I calculated it manually, like so (note than in my code I use a fixed number of equally spaced bins):
windspeed = 8 * numpy.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * numpy.random.rand(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='median', bins=10)
stds = []
# Match each value to the bin number it belongs to
pairs = zip(boatspeed, binnumber)
# Calculate stdev for all elements inside each bin
for n in list(set(binnumber)): # Iterate over each bin
in_bin = [x for x, nbin in pairs if nbin == n] # Get all elements inside bin n
stds.append(numpy.std(in_bin))
# Calculate the locations of the bins' centers, for plotting
bin_centers = []
for i in range(len(bin_edges) - 1):
center = bin_edges[i] + (float(bin_edges[i + 1]) - float(bin_edges[i]))/2.
bin_centers.append(center)
# Plot means
pyplot.figure()
pyplot.hlines(bin_means, bin_edges[:-1], bin_edges[1:], colors='g', lw=5,
label='binned statistic of data')
# Plot stdev as vertical lines, probably can also be done with errorbar
pyplot.vlines(bin_centers, bin_means - stds, bin_means + stds)
pyplot.legend()
pyplot.show()
Resulting plot (minus the data points):
You have to be careful with the bins. In the code I'm working on using this, one of the bins has no points and I have to adjust my calculations of the stdev accordingly.

just change this line
bin_std, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='std', bins=[1,2,3,4,5,6,7])

Need help weighting (scaling) each of the bins in a histogram by a different factor

I'm trying to make a histogram of the radial distribution of a circular scatterring of particles, and I'm trying to scale the histogram so that the radial distribution is in particles per unit area.
Disclaimer: If you don't care about the math behind what I'm talking about, just skip over this section:
I'm splitting the radial distribution in to annuluses of equal width, going out from the center. So, in the center, I will have a circle of some radius, a. The area of this inner most portion will be $\pi a^{2}$.
Now if we want to know the area of the annulus going from radial distance a to 2a, we do $$ \int_{a}^{2a} 2 \pi r \ dr = 3 \pi a^{2} $$
Continuing in a similar fashion (going from 2a to 3a, 3a to 4a, etc.) we see that the areas increase as follows: $$ Areas = \pi a^{2}, 3 \pi a^{2}, 5 \pi a^{2}, 7 \pi a^{2}, ... $$
So, when I weight the histogram for the radial distribution of my scatter, going out from the center, each bin will have to be weighted so that the count of first bin is left alone, the count of the second bin is divided by 3, the count of the third bin is divided by 5, etc, etc.
So: Here's my try at the code:
import numpy as np
import matplotlib.pyplot as plt
# making random sample of 100000 points between -2.5 and 2.5
y_vec = 5*np.random.random(100000) - 2.5
z_vec = 5*np.random.random(100000) - 2.5
# blank canvasses for the y, z, and radial arrays
y_vec2 = []
z_vec2 = []
R_vec = []
# number of bins I want in the ending histogram
bns = 40
# cutting out the random samplings that aren't in a circular distribution
# and making the radial array
for i in range(0, 100000):
if np.sqrt((y_vec[i]*y_vec[i] + z_vec[i]*z_vec[i])) <= 2.5:
y_vec2.append(y_vec[i])
z_vec2.append(z_vec[i])
R_vec.append(np.sqrt(y_vec[i]*y_vec[i] + z_vec[i]*z_vec[i]))
# setting up the figures and plots
fig, ax = plt.subplots()
fig2, hst = plt.subplots()
# creating a weighting array for the histogram
wghts = []
i = 0
c = 1
# making the weighting array so that each of the bins will be weighted correctly
# (splitting the radial array up evenly in to groups of the size the bins will be
# and weighting them appropriately). I assumed the because the documentation says
# the "weights" array has to be the same size as the "x" initial input, that the
# weights act on each point individually...
while i < bns:
wghts.extend((1/c)*np.ones(len(R_vec)/bns))
c = c + 2
i = i + 1
# Making the plots
ax.scatter(y_vec2, z_vec2)
hst.hist(R_vec, bins = bns, weights = wghts)
# plotting
plt.show()
The scatter plot looks great:
But, the radial plot suggest that I got the weighting wrong. It should be constant across all annuli, but it is increasing, as though it were not weighted at all:
The erratic look of the Radial Distribution suggests to me that the weighting function in the "hist" operator weights each member of R_vec individually instead of weighting the bins.
How would I weight the bins by the factors I need to scale them by? Any help?

You are correct when you surmise that the weights weight the individual values and not the bins. This is documented:
Each value in x only contributes its associated weight towards the bin count (instead of 1).
Therefore the basic problem is that, in calculating the weights, you aren't taking account of the order of the points. You created points at random, but then you create the weights in sequence from greatest to least. This means you're not assigning the right weights to the right points.
The way you should create the weights is by directly computing each point's weight from its radius. The way you seem to want to do this is by discretizing the radius into a binned radius, then weighting inversely by that. Instead of what you're doing for the weights, try this:
R_vec = np.array(R_vec)
wghts = 1 / (2*(R_vec//(2.5/bns))+1)
This gives me the right result:
You can also get essentially the same result without doing the binning in the weighting --- that is, just directly weight each point by the reciporcal of its radius:
R_vec = np.array(R_vec)
wghts = 1 / R_vec
The advantage of doing this is that you can then plot a histogram a different number of bins without recomputing the weights. It also makes somewhat more conceptual sense to weight each point by how far out it is in a continuous sense, not by whether it falls on one side or the other of a discrete bin boundary.

When you want to plot something "per unit area", use area as your independent variable.
This way, you can still use a histogram if you like, but you don't have to worry about non-uniform binning or weighting.
I replaced your line:
hst.hist(R_vec, bins = bns, weights = wghts)
with:
hst.hist(np.pi*np.square(R_vec),bins=bns)

probability density function from histogram in python to fit another histrogram

I have a question concerning fitting and getting random numbers.
Situation is as such:
Firstly I have a histogram from data points.
import numpy as np
"""create random data points """
mu = 10
sigma = 5
n = 1000
datapoints = np.random.normal(mu,sigma,n)
""" create normalized histrogram of the data """
bins = np.linspace(0,20,21)
H, bins = np.histogram(data,bins,density=True)
I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.
Thanks for your help

You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.
Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.
Example:
import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot
# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])
# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
u = random.uniform( x[0], x[-1] )
b[i] = inverse_density_function( u )
# plot both
pyplot.hist(a, 100)
pyplot.hist(b, 100)
pyplot.show()
This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.
P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.