find mean bin values using histogram2d python [duplicate] - python

This question already has answers here:
binning data in python with scipy/numpy
(6 answers)
Closed 7 years ago.
How do you calculate the mean values for bins with a 2D histogram in python? I have temperature ranges for the x and y axis and I am trying to plot the probability of lightning using bins for the respective temperatures. I am reading in the data from a csv file and my code is such:
filename = 'Random_Events_All_Sorted_85GHz.csv'
df = pd.read_csv(filename)
min37 = df.min37
min85 = df.min85
verification = df.five_min_1
#Numbers
x = min85
y = min37
H = verification
#Estimate the 2D histogram
nbins = 4
H, xedges, yedges = np.histogram2d(x,y,bins=nbins)
#Rotate and flip H
H = np.rot90(H)
H = np.flipud(H)
#Mask zeros
Hmasked = np.ma.masked_where(H==0,H)
#Plot 2D histogram using pcolor
fig1 = plt.figure()
plt.pcolormesh(xedges,yedges,Hmasked)
plt.xlabel('min 85 GHz PCT (K)')
plt.ylabel('min 37 GHz PCT (K)')
cbar = plt.colorbar()
cbar.ax.set_ylabel('Probability of Lightning (%)')
plt.show()
This makes a nice looking plot, but the data that is plotted is the count, or number of samples that fall into each bin. The verification variable is an array that contains 1's and 0's, where a 1 indicates lightning and a 0 indicates no lightning. I want the data in the plot to be the probability of lightning for a given bin based on the data from the verification variable - thus I need bin_mean*100 in order to get this percentage.
I tried using an approach similar to what is shown here (binning data in python with scipy/numpy), but I was having difficulty getting it to work for a 2D histogram.

There is an elegant and fast way to do this! Use weights parameter to sum values:
denominator, xedges, yedges = np.histogram2d(x,y,bins=nbins)
nominator, _, _ = np.histogram2d(x,y,bins=[xedges, yedges], weights=verification)
So all you need is to divide in each bin the sum of values by the number of events:
result = nominator / denominator.clip(1)
Voila!

This is doable at least with the following method
# xedges, yedges as returned by 'histogram2d'
# create an array for the output quantities
avgarr = np.zeros((nbins, nbins))
# determine the X and Y bins each sample coordinate belongs to
xbins = np.digitize(x, xedges[1:-1])
ybins = np.digitize(y, yedges[1:-1])
# calculate the bin sums (note, if you have very many samples, this is more
# effective by using 'bincount', but it requires some index arithmetics
for xb, yb, v in zip(xbins, ybins, verification):
avgarr[yb, xb] += v
# replace 0s in H by NaNs (remove divide-by-zero complaints)
# if you do not have any further use for H after plotting, the
# copy operation is unnecessary, and this will the also take care
# of the masking (NaNs are plotted transparent)
divisor = H.copy()
divisor[divisor==0.0] = np.nan
# calculate the average
avgarr /= divisor
# now 'avgarr' contains the averages (NaNs for no-sample bins)
If you know the bin edges beforehand, you can do the histogram part in the same just by adding one row.

Related

Finding corresponding bins between two data sets

So I have two data sets which overlap in their parameter space:
I want to bin up the red set and find the standard deviation of each bin. Then for each point in the blue set, I want to find which red bin that point corresponds to and grab the standard deviation calculated for that bin.
So far, I've been using scipy.statistics.binned_2d, but I'm not sure where to go from here:
import scipy.stats
import numpy as np
# given numpy recarrays red_set and blue_set with columns x,y,values
nbins = 50
red_bins = scipy.stats.binned_statistic_2d(red_set['x'],
red_set['y'],
red_set['values'],
statistic = np.std,
bins = nbins)
blue_bins = scipy.stats.binned_statistic_2d(blue_set['x']
blue_set['y']
blue_set['values']
statistic = count,
bins = red_bins[1],red_bins[2])
Now, I don't know how to get the value of the corresponding red bin for each blue point. I know that scipy.statistics.binned_2d's third return is a binnumber for each input data point, but I don't know how to translate that to the actual calculated statistic (standard deviation in this example).
I know that the blue set is getting binned exactly the same as the red (a quick plot will confirm this). It seems like it should be totally straightforward to grab the corresponding red bin, but I can't figure it out.
Let me know if I can make my question clearer
You need to make sure you specify the same range when binning the data. In that way, the corresponding indices of the bins will be consistent. I've used the lower level numpy function hist2d, extension to standard deviations can be done in the same way using scipy.stats.binned_statistic_2d,
import numpy as np
import matplotlib.pyplot as plt
#Setup random data
red = np.random.randn(100,2)
blue = np.random.randn(100,2)
#plot
plt.plot(red[:,0],red[:,1],'r.')
plt.plot(blue[:,0],blue[:,1],'b.')
#Specify limits of binned data
xmin = -3.; xmax = 3.
ymin = -3.; ymax = 3.
#Bin data using hist2d
rbins, xrb, yrb = np.histogram2d(red[:,0],red[:,1],bins=10,range=[[xmin,xmax],[ymin,ymax]])
bbins, xbb, ybb = np.histogram2d(blue[:,0],blue[:,1],bins=10,range=[[xmin,xmax],[ymin,ymax]])
#Check that bins correspond to the same positions in space
assert all(xrb == xbb)
assert all(yrb == ybb)
#Obtain centers of the bins and plots difference
xc = xrb[:-1] + 0.5 * (xrb[1:] - xrb[:-1])
yc = yrb[:-1] + 0.5 * (yrb[1:] - yrb[:-1])
plt.contourf(xc, yc, rbins-bbins, alpha=0.4)
plt.colorbar()
plt.show()

Need help weighting (scaling) each of the bins in a histogram by a different factor

I'm trying to make a histogram of the radial distribution of a circular scatterring of particles, and I'm trying to scale the histogram so that the radial distribution is in particles per unit area.
Disclaimer: If you don't care about the math behind what I'm talking about, just skip over this section:
I'm splitting the radial distribution in to annuluses of equal width, going out from the center. So, in the center, I will have a circle of some radius, a. The area of this inner most portion will be $\pi a^{2}$.
Now if we want to know the area of the annulus going from radial distance a to 2a, we do $$ \int_{a}^{2a} 2 \pi r \ dr = 3 \pi a^{2} $$
Continuing in a similar fashion (going from 2a to 3a, 3a to 4a, etc.) we see that the areas increase as follows: $$ Areas = \pi a^{2}, 3 \pi a^{2}, 5 \pi a^{2}, 7 \pi a^{2}, ... $$
So, when I weight the histogram for the radial distribution of my scatter, going out from the center, each bin will have to be weighted so that the count of first bin is left alone, the count of the second bin is divided by 3, the count of the third bin is divided by 5, etc, etc.
So: Here's my try at the code:
import numpy as np
import matplotlib.pyplot as plt
# making random sample of 100000 points between -2.5 and 2.5
y_vec = 5*np.random.random(100000) - 2.5
z_vec = 5*np.random.random(100000) - 2.5
# blank canvasses for the y, z, and radial arrays
y_vec2 = []
z_vec2 = []
R_vec = []
# number of bins I want in the ending histogram
bns = 40
# cutting out the random samplings that aren't in a circular distribution
# and making the radial array
for i in range(0, 100000):
if np.sqrt((y_vec[i]*y_vec[i] + z_vec[i]*z_vec[i])) <= 2.5:
y_vec2.append(y_vec[i])
z_vec2.append(z_vec[i])
R_vec.append(np.sqrt(y_vec[i]*y_vec[i] + z_vec[i]*z_vec[i]))
# setting up the figures and plots
fig, ax = plt.subplots()
fig2, hst = plt.subplots()
# creating a weighting array for the histogram
wghts = []
i = 0
c = 1
# making the weighting array so that each of the bins will be weighted correctly
# (splitting the radial array up evenly in to groups of the size the bins will be
# and weighting them appropriately). I assumed the because the documentation says
# the "weights" array has to be the same size as the "x" initial input, that the
# weights act on each point individually...
while i < bns:
wghts.extend((1/c)*np.ones(len(R_vec)/bns))
c = c + 2
i = i + 1
# Making the plots
ax.scatter(y_vec2, z_vec2)
hst.hist(R_vec, bins = bns, weights = wghts)
# plotting
plt.show()
The scatter plot looks great:
But, the radial plot suggest that I got the weighting wrong. It should be constant across all annuli, but it is increasing, as though it were not weighted at all:
The erratic look of the Radial Distribution suggests to me that the weighting function in the "hist" operator weights each member of R_vec individually instead of weighting the bins.
How would I weight the bins by the factors I need to scale them by? Any help?
You are correct when you surmise that the weights weight the individual values and not the bins. This is documented:
Each value in x only contributes its associated weight towards the bin count (instead of 1).
Therefore the basic problem is that, in calculating the weights, you aren't taking account of the order of the points. You created points at random, but then you create the weights in sequence from greatest to least. This means you're not assigning the right weights to the right points.
The way you should create the weights is by directly computing each point's weight from its radius. The way you seem to want to do this is by discretizing the radius into a binned radius, then weighting inversely by that. Instead of what you're doing for the weights, try this:
R_vec = np.array(R_vec)
wghts = 1 / (2*(R_vec//(2.5/bns))+1)
This gives me the right result:
You can also get essentially the same result without doing the binning in the weighting --- that is, just directly weight each point by the reciporcal of its radius:
R_vec = np.array(R_vec)
wghts = 1 / R_vec
The advantage of doing this is that you can then plot a histogram a different number of bins without recomputing the weights. It also makes somewhat more conceptual sense to weight each point by how far out it is in a continuous sense, not by whether it falls on one side or the other of a discrete bin boundary.
When you want to plot something "per unit area", use area as your independent variable.
This way, you can still use a histogram if you like, but you don't have to worry about non-uniform binning or weighting.
I replaced your line:
hst.hist(R_vec, bins = bns, weights = wghts)
with:
hst.hist(np.pi*np.square(R_vec),bins=bns)

Calculate the Cumulative Distribution Function (CDF) in Python

How can I calculate in python the Cumulative Distribution Function (CDF)?
I want to calculate it from an array of points I have (discrete distribution), not with the continuous distributions that, for example, scipy has.
(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsum divided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsum of the array multiplied by the distances between the points will do.)
If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.
Let us have a closer look at this with a simple example:
import matplotlib.pyplot as plt
import numpy as np
# create some randomly ddistributed data:
data = np.random.randn(10000)
# sort the data:
data_sorted = np.sort(data)
# calculate the proportional values of samples
p = 1. * np.arange(len(data)) / (len(data) - 1)
# plot the sorted data:
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')
ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')
This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally, it is not as long as the number of points is finite.
This function is easy to invert, and it depends on your application which form you need.
Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipy does support discrete data when calculating cdf's
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete
# plot the cdf
sns.lineplot(x=x, y=norm_cdf)
plt.show()
We can even print the first few values of the cdf to show they are discrete
print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])
The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate
mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix
# generate 2d normally distributed samples using 0 mean and the covariance matrix above
x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)
In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm() - there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.
The empirical cumulative distribution function is a CDF that jumps exactly at the values in your data set. It is the CDF for a discrete distribution that places a mass at each of your values, where the mass is proportional to the frequency of the value. Since the sum of the masses must be 1, these constraints determine the location and height of each jump in the empirical CDF.
Given an array a of values, you compute the empirical CDF by first obtaining the frequencies of the values. The numpy function unique() is helpful here because it returns not only the frequencies, but also the values in sorted order. To calculate the cumulative distribution, use the cumsum() function, and divide by the total sum. The following function returns the values in sorted order and the corresponding cumulative distribution:
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
cusum = np.cumsum(counts)
return x, cusum / cusum[-1]
To plot the empirical CDF you can use matplotlib's plot() function. The option drawstyle='steps-post' ensures that jumps occur at the right place. However, you need to force a jump at the smallest data value, so it's necessary to insert an additional element in front of x and y.
import matplotlib.pyplot as plt
def plot_ecdf(a):
x, y = ecdf(a)
x = np.insert(x, 0, x[0])
y = np.insert(y, 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
Example usages:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
plot_ecdf(xvec)
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
plot_ecdf(df['x'])
with output:
For calculating CDF for array of discerete numbers:
import numpy as np
pdf, bin_edges = np.histogram(
data, # array of data
bins=500, # specify the number of bins for distribution function
density=True # True to return probability density function (pdf) instead of count
)
cdf = np.cumsum(pdf*np.diff(bins_edges))
Note that the return array pdf has the length of bins (500 here) and bin_edges has the length of bins+1 (501 here).
So, to calculate the CDF which is nothing but the area below the PDF distribution curve, we can simply calculate the cumulative sum of bin widths (np.diff(bins_edges)) times pdf using Numpy cumsum function
Here's an alternative pandas solution to calculating the empirical CDF, using pd.cut to sort the data into evenly spaced bins first, and then cumsum to compute the distribution.
def empirical_cdf(s: pd.Series, n_bins: int = 100):
# Sort the data into `n_bins` evenly spaced bins:
discretized = pd.cut(s, n_bins)
# Count the number of datapoints in each bin:
bin_counts = discretized.value_counts().sort_index().reset_index()
# Calculate the locations of each bin as just the mean of the bin start and end:
bin_counts["loc"] = (pd.IntervalIndex(bin_counts["index"]).left + pd.IntervalIndex(bin_counts["index"]).right) / 2
# Compute the CDF with cumsum:
return bin_counts.set_index("loc").iloc[:, -1].cumsum()
Below is an example use of the function to discretize the distribution of 10000 datapoints into 100 evenly spaced bins:
s = pd.Series(np.random.randn(10000))
cdf = empirical_cdf(s, n_bins=100)
fig, ax = plt.subplots()
ax.scatter(cdf.index, cdf.values)
import random
import numpy as np
import matplotlib.pyplot as plt
def get_discrete_cdf(values):
values = (values - np.min(values)) / (np.max(values) - np.min(values))
values_sort = np.sort(values)
values_sum = np.sum(values)
values_sums = []
cur_sum = 0
for it in values_sort:
cur_sum += it
values_sums.append(cur_sum)
cdf = [values_sums[np.searchsorted(values_sort, it)]/values_sum for it in values]
return cdf
rand_values = [np.random.normal(loc=0.0) for _ in range(1000)]
_ = plt.hist(rand_values, bins=20)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("nums")
cdf = get_discrete_cdf(rand_values)
x_p = list(zip(rand_values, cdf))
x_p.sort(key=lambda it: it[0])
x = [it[0] for it in x_p]
y = [it[1] for it in x_p]
_ = plt.plot(x, y)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("prob")

Frequency distribution in python - not adding to 100% [duplicate]

Taking a tip from another thread (#EnricoGiampieri's answer to cumulative distribution plots python), I wrote:
# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')
I put in the density=1 from the documentation on np.histogram, which says:
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
Well, indeed, when plotted, they don't sum to 1. But, I do not understand the "bins of unity width." When I set the bins to 1, of course, I get an empty chart; when I set them to the population size, I don't get a sum to 1 (more like 0.2). When I use the 40 bins suggested, they sum to about .006.
Can anybody give me some guidance? Thanks!
You can simply normalize your values variable yourself like so:
unity_values = values / values.sum()
A full example would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)
ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)
ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()
You need to make sure your bins are all width 1. That is:
np.all(np.diff(base)==1)
To achieve this, you have to manually specify your bins:
bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)
And you get:
In [18]: np.all(np.diff(base)==1)
Out[18]: True
In [19]: np.sum(values)
Out[19]: 0.99999999999999989
Actually the statement
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
means that the output that we are getting is the probability density function for the respective bins,
now since in pdf, the probability between two value say 'a' and 'b' is represented by the area under the pdf curve between the range 'a' and 'b'.
therefore to get the probability value for a respective bin, we have to multiply the pdf value of that bin by its bin width, and then the sequence of probabilities obtained can be directly used for calculating the cumulative probabilities(as they are now normalized).
note that the sum of the new calculated probabilities will give 1, which satisfies the fact that the sum of total probability is 1, or in other words, we can say that our probabilities are normalized.
see code below,
here i have use bins of different widths, some are of width 1 and some are of width 2,
import numpy as np
import math
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=1000),
rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data
mini=math.floor(min(a))
maxi=math.ceil(max(a))
print(mini)
print(maxi)
ar1=np.arange(mini,maxi/2)
ar2=np.arange(math.ceil(maxi/2),maxi+2,2)
ar=np.hstack((ar1,ar2))
print(ar) # ar is the array of unequal widths, which is used below to generate the bin_edges
counts, bin_edges = np.histogram(a, bins=ar,
density = True)
print(counts) # the pdf values of respective bin_edges
print(bin_edges) # the corresponding bin_edges
print(np.sum(counts*np.diff(bin_edges))) #finding total sum of probabilites, equal to 1
print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.
Now the reason I think they try to mention by saying that the width of bins should be 1, is might be because of the fact that if the width of bins is equal to 1, then the value of pdf and probabilities for any bin are equal, because if we calculate the area under the bin, then we are basically multiplying the 1 with the corresponding pdf of that bin, which is again equal to that pdf value.
so in this case, the value of pdf is equal to the value of the respective bins probabilities and hence already normalized.

numpy histogram cumulative density does not sum to 1

Taking a tip from another thread (#EnricoGiampieri's answer to cumulative distribution plots python), I wrote:
# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')
I put in the density=1 from the documentation on np.histogram, which says:
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
Well, indeed, when plotted, they don't sum to 1. But, I do not understand the "bins of unity width." When I set the bins to 1, of course, I get an empty chart; when I set them to the population size, I don't get a sum to 1 (more like 0.2). When I use the 40 bins suggested, they sum to about .006.
Can anybody give me some guidance? Thanks!
You can simply normalize your values variable yourself like so:
unity_values = values / values.sum()
A full example would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)
ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)
ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()
You need to make sure your bins are all width 1. That is:
np.all(np.diff(base)==1)
To achieve this, you have to manually specify your bins:
bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)
And you get:
In [18]: np.all(np.diff(base)==1)
Out[18]: True
In [19]: np.sum(values)
Out[19]: 0.99999999999999989
Actually the statement
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
means that the output that we are getting is the probability density function for the respective bins,
now since in pdf, the probability between two value say 'a' and 'b' is represented by the area under the pdf curve between the range 'a' and 'b'.
therefore to get the probability value for a respective bin, we have to multiply the pdf value of that bin by its bin width, and then the sequence of probabilities obtained can be directly used for calculating the cumulative probabilities(as they are now normalized).
note that the sum of the new calculated probabilities will give 1, which satisfies the fact that the sum of total probability is 1, or in other words, we can say that our probabilities are normalized.
see code below,
here i have use bins of different widths, some are of width 1 and some are of width 2,
import numpy as np
import math
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=1000),
rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data
mini=math.floor(min(a))
maxi=math.ceil(max(a))
print(mini)
print(maxi)
ar1=np.arange(mini,maxi/2)
ar2=np.arange(math.ceil(maxi/2),maxi+2,2)
ar=np.hstack((ar1,ar2))
print(ar) # ar is the array of unequal widths, which is used below to generate the bin_edges
counts, bin_edges = np.histogram(a, bins=ar,
density = True)
print(counts) # the pdf values of respective bin_edges
print(bin_edges) # the corresponding bin_edges
print(np.sum(counts*np.diff(bin_edges))) #finding total sum of probabilites, equal to 1
print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.
Now the reason I think they try to mention by saying that the width of bins should be 1, is might be because of the fact that if the width of bins is equal to 1, then the value of pdf and probabilities for any bin are equal, because if we calculate the area under the bin, then we are basically multiplying the 1 with the corresponding pdf of that bin, which is again equal to that pdf value.
so in this case, the value of pdf is equal to the value of the respective bins probabilities and hence already normalized.

Categories