Frequency distribution in python - not adding to 100% [duplicate] - python

Taking a tip from another thread (#EnricoGiampieri's answer to cumulative distribution plots python), I wrote:
# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')
I put in the density=1 from the documentation on np.histogram, which says:
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
Well, indeed, when plotted, they don't sum to 1. But, I do not understand the "bins of unity width." When I set the bins to 1, of course, I get an empty chart; when I set them to the population size, I don't get a sum to 1 (more like 0.2). When I use the 40 bins suggested, they sum to about .006.
Can anybody give me some guidance? Thanks!

You can simply normalize your values variable yourself like so:
unity_values = values / values.sum()
A full example would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)
ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)
ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()

You need to make sure your bins are all width 1. That is:
np.all(np.diff(base)==1)
To achieve this, you have to manually specify your bins:
bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)
And you get:
In [18]: np.all(np.diff(base)==1)
Out[18]: True
In [19]: np.sum(values)
Out[19]: 0.99999999999999989

Actually the statement
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
means that the output that we are getting is the probability density function for the respective bins,
now since in pdf, the probability between two value say 'a' and 'b' is represented by the area under the pdf curve between the range 'a' and 'b'.
therefore to get the probability value for a respective bin, we have to multiply the pdf value of that bin by its bin width, and then the sequence of probabilities obtained can be directly used for calculating the cumulative probabilities(as they are now normalized).
note that the sum of the new calculated probabilities will give 1, which satisfies the fact that the sum of total probability is 1, or in other words, we can say that our probabilities are normalized.
see code below,
here i have use bins of different widths, some are of width 1 and some are of width 2,
import numpy as np
import math
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=1000),
rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data
mini=math.floor(min(a))
maxi=math.ceil(max(a))
print(mini)
print(maxi)
ar1=np.arange(mini,maxi/2)
ar2=np.arange(math.ceil(maxi/2),maxi+2,2)
ar=np.hstack((ar1,ar2))
print(ar) # ar is the array of unequal widths, which is used below to generate the bin_edges
counts, bin_edges = np.histogram(a, bins=ar,
density = True)
print(counts) # the pdf values of respective bin_edges
print(bin_edges) # the corresponding bin_edges
print(np.sum(counts*np.diff(bin_edges))) #finding total sum of probabilites, equal to 1
print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.
Now the reason I think they try to mention by saying that the width of bins should be 1, is might be because of the fact that if the width of bins is equal to 1, then the value of pdf and probabilities for any bin are equal, because if we calculate the area under the bin, then we are basically multiplying the 1 with the corresponding pdf of that bin, which is again equal to that pdf value.
so in this case, the value of pdf is equal to the value of the respective bins probabilities and hence already normalized.

Related

Standard deviation of binned values with `scipy.stats.binned_statistic`

When I bin my data accordingly to scipy.stats.binned_statistic (see here for example), how do I get the error (that is the standard deviation) on the average binned values?
For example, if I bin my data as following:
windspeed = 8 * np.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * np.random.rand(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='median', bins=[1,2,3,4,5,6,7])
plt.figure()
plt.plot(windspeed, boatspeed, 'b.', label='raw data')
plt.hlines(bin_means, bin_edges[:-1], bin_edges[1:], colors='g', lw=5,
label='binned statistic of data')
plt.legend()
how do I get the standard deviation on the bin_means?
The way to go about this is to construct a probability density estimate from the histogram (this is just a question of normalizing the histogram appropriately), and then computing the standard deviation or any other statistic for the estimated density.
The appropriate normalization is whatever is needed to get the area under the histogram to be 1. As for computing statistics for the density estimate, work from the definition of the statistic as integral(p(x)*f(x), x, -infinity, +infinity), substituting the density estimate for p(x) and whatever is needed for f(x), e.g. x and x^2 to get the first and second moments, from which you calculate the variance and then the standard deviation.
I'll post some formulas tomorrow, or maybe someone else wants to give it a try in the meantime. You might be able to look up some formulas, but my advice is to always try to work out the answer before resorting to looking it up.
Maybe I'm a bit late to answer, but I was wondering how to do the same thing and came across this question. I think calculating it with stats.binned_statistic_2d should be possible, but I haven't figured it out yet. For now I calculated it manually, like so (note than in my code I use a fixed number of equally spaced bins):
windspeed = 8 * numpy.random.rand(500)
boatspeed = .3 * windspeed**.5 + .2 * numpy.random.rand(500)
bin_means, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='median', bins=10)
stds = []
# Match each value to the bin number it belongs to
pairs = zip(boatspeed, binnumber)
# Calculate stdev for all elements inside each bin
for n in list(set(binnumber)): # Iterate over each bin
in_bin = [x for x, nbin in pairs if nbin == n] # Get all elements inside bin n
stds.append(numpy.std(in_bin))
# Calculate the locations of the bins' centers, for plotting
bin_centers = []
for i in range(len(bin_edges) - 1):
center = bin_edges[i] + (float(bin_edges[i + 1]) - float(bin_edges[i]))/2.
bin_centers.append(center)
# Plot means
pyplot.figure()
pyplot.hlines(bin_means, bin_edges[:-1], bin_edges[1:], colors='g', lw=5,
label='binned statistic of data')
# Plot stdev as vertical lines, probably can also be done with errorbar
pyplot.vlines(bin_centers, bin_means - stds, bin_means + stds)
pyplot.legend()
pyplot.show()
Resulting plot (minus the data points):
You have to be careful with the bins. In the code I'm working on using this, one of the bins has no points and I have to adjust my calculations of the stdev accordingly.
just change this line
bin_std, bin_edges, binnumber = stats.binned_statistic(windspeed,
boatspeed, statistic='std', bins=[1,2,3,4,5,6,7])

Cumulative Distribution Function from arbitrary Probability Distribution Function

I'm trying to plot a Probability Distribution Function for a given set of data from a csv file
import numpy as np
import math
import matplotlib.pyplot as plt
data=np.loadtxt('data.csv',delimiter=',',skiprows=1)
x_value1= data[:,1]
x_value2= data[:,2]
weight1= data[:,3]
weight2= data[:,4]
where weight1 is an array of data that represents the weight for data in x_value1 and weight2 represents the same for x_value2. I produce a histogram where I put the weights in the parameter
plt.hist(x_value1,bins=40,color='r', normed=True, weights=weight1, alpha=0.8, label='x_value1')
plt.hist(x_value2, bins=40,color='b', normed=True, weights=weight2, alpha=0.6, label='x_value2')
My problem now is converting this PDF to CDF. I read from one of the posts here that you can use numpy.cumsum() to convert a set of data to CDF, so I tried it together with np.histogram()
values1,base1= np.histogram(x_value1, bins=40)
values2,base2= np.histogram(x_value2, bins=40)
cumulative1=np.cumsum(values1)
cumulative2=np.cumsum(values2)
plt.plot(base1[:-1],cumulative1,c='red',label='x_value1')
plt.plot(base2[:-1],cumulative2,c='blue',label='x_value2')
plt.title("CDF for x_value1 and x_value2")
plt.xlabel("x")
plt.ylabel("y")
plt.show()
I don't know if this plot is right because I didn't include the weights (weight1 and weight2) while doing the CDF. How can I include the weights while plotting the CDF?
If I understand your data correctly, you have a number of samples which have some weight associated with them. Maybe what you want is the experimental CDF of the sample.
The samples are in vector x and weights in vector w. Let us first construct a Nx2 array of them:
arr = np.column_stack((x,w))
Then we will sort this array by the samples:
arr = arr[arr[:,0].argsort()]
This sorting may look a bit odd, but argsort gives the sorted order (0 for the smallest, 1 for the second smallest, etc.). When the two-column array is indexed by this result, the rows are arranged so that the first column is ascending. (Using only sort with axis=0 does not work, as it sorts both columns independently.)
Now we can create the cumulative fraction by taking the cumulative sum of weights:
cum = np.cumsum(arr[:,1])
This must be normalized so that the full scale is 1.
cum /= cum[-1]
Now we can plot the cumulative distribution:
plt.plot(arr[:,0], cum)
Now X axis is the input value and Y axis corresponds to the fraction of samples below each level.

Calculate the Cumulative Distribution Function (CDF) in Python

How can I calculate in python the Cumulative Distribution Function (CDF)?
I want to calculate it from an array of points I have (discrete distribution), not with the continuous distributions that, for example, scipy has.
(It is possible that my interpretation of the question is wrong. If the question is how to get from a discrete PDF into a discrete CDF, then np.cumsum divided by a suitable constant will do if the samples are equispaced. If the array is not equispaced, then np.cumsum of the array multiplied by the distances between the points will do.)
If you have a discrete array of samples, and you would like to know the CDF of the sample, then you can just sort the array. If you look at the sorted result, you'll realize that the smallest value represents 0% , and largest value represents 100 %. If you want to know the value at 50 % of the distribution, just look at the array element which is in the middle of the sorted array.
Let us have a closer look at this with a simple example:
import matplotlib.pyplot as plt
import numpy as np
# create some randomly ddistributed data:
data = np.random.randn(10000)
# sort the data:
data_sorted = np.sort(data)
# calculate the proportional values of samples
p = 1. * np.arange(len(data)) / (len(data) - 1)
# plot the sorted data:
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax1.plot(p, data_sorted)
ax1.set_xlabel('$p$')
ax1.set_ylabel('$x$')
ax2 = fig.add_subplot(122)
ax2.plot(data_sorted, p)
ax2.set_xlabel('$x$')
ax2.set_ylabel('$p$')
This gives the following plot where the right-hand-side plot is the traditional cumulative distribution function. It should reflect the CDF of the process behind the points, but naturally, it is not as long as the number of points is finite.
This function is easy to invert, and it depends on your application which form you need.
Assuming you know how your data is distributed (i.e. you know the pdf of your data), then scipy does support discrete data when calculating cdf's
import numpy as np
import scipy
import matplotlib.pyplot as plt
import seaborn as sns
x = np.random.randn(10000) # generate samples from normal distribution (discrete data)
norm_cdf = scipy.stats.norm.cdf(x) # calculate the cdf - also discrete
# plot the cdf
sns.lineplot(x=x, y=norm_cdf)
plt.show()
We can even print the first few values of the cdf to show they are discrete
print(norm_cdf[:10])
>>> array([0.39216484, 0.09554546, 0.71268696, 0.5007396 , 0.76484329,
0.37920836, 0.86010018, 0.9191937 , 0.46374527, 0.4576634 ])
The same method to calculate the cdf also works for multiple dimensions: we use 2d data below to illustrate
mu = np.zeros(2) # mean vector
cov = np.array([[1,0.6],[0.6,1]]) # covariance matrix
# generate 2d normally distributed samples using 0 mean and the covariance matrix above
x = np.random.multivariate_normal(mean=mu, cov=cov, size=1000) # 1000 samples
norm_cdf = scipy.stats.norm.cdf(x)
print(norm_cdf.shape)
>>> (1000, 2)
In the above examples, I had prior knowledge that my data was normally distributed, which is why I used scipy.stats.norm() - there are multiple distributions scipy supports. But again, you need to know how your data is distributed beforehand to use such functions. If you don't know how your data is distributed and you just use any distribution to calculate the cdf, you most likely will get incorrect results.
The empirical cumulative distribution function is a CDF that jumps exactly at the values in your data set. It is the CDF for a discrete distribution that places a mass at each of your values, where the mass is proportional to the frequency of the value. Since the sum of the masses must be 1, these constraints determine the location and height of each jump in the empirical CDF.
Given an array a of values, you compute the empirical CDF by first obtaining the frequencies of the values. The numpy function unique() is helpful here because it returns not only the frequencies, but also the values in sorted order. To calculate the cumulative distribution, use the cumsum() function, and divide by the total sum. The following function returns the values in sorted order and the corresponding cumulative distribution:
import numpy as np
def ecdf(a):
x, counts = np.unique(a, return_counts=True)
cusum = np.cumsum(counts)
return x, cusum / cusum[-1]
To plot the empirical CDF you can use matplotlib's plot() function. The option drawstyle='steps-post' ensures that jumps occur at the right place. However, you need to force a jump at the smallest data value, so it's necessary to insert an additional element in front of x and y.
import matplotlib.pyplot as plt
def plot_ecdf(a):
x, y = ecdf(a)
x = np.insert(x, 0, x[0])
y = np.insert(y, 0, 0.)
plt.plot(x, y, drawstyle='steps-post')
plt.grid(True)
plt.savefig('ecdf.png')
Example usages:
xvec = np.array([7,1,2,2,7,4,4,4,5.5,7])
plot_ecdf(xvec)
df = pd.DataFrame({'x':[7,1,2,2,7,4,4,4,5.5,7]})
plot_ecdf(df['x'])
with output:
For calculating CDF for array of discerete numbers:
import numpy as np
pdf, bin_edges = np.histogram(
data, # array of data
bins=500, # specify the number of bins for distribution function
density=True # True to return probability density function (pdf) instead of count
)
cdf = np.cumsum(pdf*np.diff(bins_edges))
Note that the return array pdf has the length of bins (500 here) and bin_edges has the length of bins+1 (501 here).
So, to calculate the CDF which is nothing but the area below the PDF distribution curve, we can simply calculate the cumulative sum of bin widths (np.diff(bins_edges)) times pdf using Numpy cumsum function
Here's an alternative pandas solution to calculating the empirical CDF, using pd.cut to sort the data into evenly spaced bins first, and then cumsum to compute the distribution.
def empirical_cdf(s: pd.Series, n_bins: int = 100):
# Sort the data into `n_bins` evenly spaced bins:
discretized = pd.cut(s, n_bins)
# Count the number of datapoints in each bin:
bin_counts = discretized.value_counts().sort_index().reset_index()
# Calculate the locations of each bin as just the mean of the bin start and end:
bin_counts["loc"] = (pd.IntervalIndex(bin_counts["index"]).left + pd.IntervalIndex(bin_counts["index"]).right) / 2
# Compute the CDF with cumsum:
return bin_counts.set_index("loc").iloc[:, -1].cumsum()
Below is an example use of the function to discretize the distribution of 10000 datapoints into 100 evenly spaced bins:
s = pd.Series(np.random.randn(10000))
cdf = empirical_cdf(s, n_bins=100)
fig, ax = plt.subplots()
ax.scatter(cdf.index, cdf.values)
import random
import numpy as np
import matplotlib.pyplot as plt
def get_discrete_cdf(values):
values = (values - np.min(values)) / (np.max(values) - np.min(values))
values_sort = np.sort(values)
values_sum = np.sum(values)
values_sums = []
cur_sum = 0
for it in values_sort:
cur_sum += it
values_sums.append(cur_sum)
cdf = [values_sums[np.searchsorted(values_sort, it)]/values_sum for it in values]
return cdf
rand_values = [np.random.normal(loc=0.0) for _ in range(1000)]
_ = plt.hist(rand_values, bins=20)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("nums")
cdf = get_discrete_cdf(rand_values)
x_p = list(zip(rand_values, cdf))
x_p.sort(key=lambda it: it[0])
x = [it[0] for it in x_p]
y = [it[1] for it in x_p]
_ = plt.plot(x, y)
_ = plt.xlabel("rand_values")
_ = plt.ylabel("prob")

numpy histogram cumulative density does not sum to 1

Taking a tip from another thread (#EnricoGiampieri's answer to cumulative distribution plots python), I wrote:
# plot cumulative density function of nearest nbr distances
# evaluate the histogram
values, base = np.histogram(nearest, bins=20, density=1)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, label='data')
I put in the density=1 from the documentation on np.histogram, which says:
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
Well, indeed, when plotted, they don't sum to 1. But, I do not understand the "bins of unity width." When I set the bins to 1, of course, I get an empty chart; when I set them to the population size, I don't get a sum to 1 (more like 0.2). When I use the 40 bins suggested, they sum to about .006.
Can anybody give me some guidance? Thanks!
You can simply normalize your values variable yourself like so:
unity_values = values / values.sum()
A full example would look something like this:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(size=37)
density, bins = np.histogram(x, normed=True, density=True)
unity_density = density / density.sum()
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, sharex=True, figsize=(8,4))
widths = bins[:-1] - bins[1:]
ax1.bar(bins[1:], density, width=widths)
ax2.bar(bins[1:], density.cumsum(), width=widths)
ax3.bar(bins[1:], unity_density, width=widths)
ax4.bar(bins[1:], unity_density.cumsum(), width=widths)
ax1.set_ylabel('Not normalized')
ax3.set_ylabel('Normalized')
ax3.set_xlabel('PDFs')
ax4.set_xlabel('CDFs')
fig.tight_layout()
You need to make sure your bins are all width 1. That is:
np.all(np.diff(base)==1)
To achieve this, you have to manually specify your bins:
bins = np.arange(np.floor(nearest.min()),np.ceil(nearest.max()))
values, base = np.histogram(nearest, bins=bins, density=1)
And you get:
In [18]: np.all(np.diff(base)==1)
Out[18]: True
In [19]: np.sum(values)
Out[19]: 0.99999999999999989
Actually the statement
"Note that the sum of the histogram values will not be equal to 1 unless bins of unity width are chosen; it is not a probability mass function. "
means that the output that we are getting is the probability density function for the respective bins,
now since in pdf, the probability between two value say 'a' and 'b' is represented by the area under the pdf curve between the range 'a' and 'b'.
therefore to get the probability value for a respective bin, we have to multiply the pdf value of that bin by its bin width, and then the sequence of probabilities obtained can be directly used for calculating the cumulative probabilities(as they are now normalized).
note that the sum of the new calculated probabilities will give 1, which satisfies the fact that the sum of total probability is 1, or in other words, we can say that our probabilities are normalized.
see code below,
here i have use bins of different widths, some are of width 1 and some are of width 2,
import numpy as np
import math
rng = np.random.RandomState(10) # deterministic random data
a = np.hstack((rng.normal(size=1000),
rng.normal(loc=5, scale=2, size=1000))) # 'a' is our distribution of data
mini=math.floor(min(a))
maxi=math.ceil(max(a))
print(mini)
print(maxi)
ar1=np.arange(mini,maxi/2)
ar2=np.arange(math.ceil(maxi/2),maxi+2,2)
ar=np.hstack((ar1,ar2))
print(ar) # ar is the array of unequal widths, which is used below to generate the bin_edges
counts, bin_edges = np.histogram(a, bins=ar,
density = True)
print(counts) # the pdf values of respective bin_edges
print(bin_edges) # the corresponding bin_edges
print(np.sum(counts*np.diff(bin_edges))) #finding total sum of probabilites, equal to 1
print(np.cumsum(counts*np.diff(bin_edges))) #to get the cummulative sum, see the last value, it is 1.
Now the reason I think they try to mention by saying that the width of bins should be 1, is might be because of the fact that if the width of bins is equal to 1, then the value of pdf and probabilities for any bin are equal, because if we calculate the area under the bin, then we are basically multiplying the 1 with the corresponding pdf of that bin, which is again equal to that pdf value.
so in this case, the value of pdf is equal to the value of the respective bins probabilities and hence already normalized.

Hist in matplotlib: Bins are not centered and proportions not correct on the axis

take a look at this example:
import matplotlib.pyplot as plt
l = [3,3,3,2,1,4,4,5,5,5,5,5,5,5,5,5]
plt.hist(l,normed=True)
plt.show()
The output is posted as a picture. I have two questions:
a) Why are only the 4 and 5 bins centered around its value? Shouldn't the others be that as well? Is there a trick to get them centered?
b)Why are the bins not normalised to proportion? I want the y values of all the bins to sum up to one.
Note that my real example contains much more values in the list, but they are all discrete.
You should adjust the keyword arguments of the plt.hist function. There are many of them and the documentation can help you answer many of these questions.
a. ) You can pass the keywords bins=range(1,7) and align=left. Setting the bins keyword to a sequence gives the borders of each bin. For example, [1,2], [2,3], [3,4], ..., [5, 6].
b. ) Check your bin widths (rwidth!=1). From the matplotlib.pyplot.hist documentation:
If True, the first element of the return tuple will be the counts
normalized to form a probability density, i.e., n/(len(x)*dbin). In a
probability density, the integral of the histogram should be 1; you
can verify that with a trapezoidal integration of the probability
density function:
This means that the area under your bins is summing up to one, but because the bin widths are less than 1, the heights get normalized in such a way that the heights don't add up to 1. If you adjust rwidth=1, you get a good looking plot:
plt.hist(l, bins=range(1,7), align='left', rwidth=1, normed=True)

Categories