How does matplotlib calculate the density for historgram - python

Reading through the matplotlib plt.hist documentations , there is a density parameter that can be set to true.The documentation says
density : bool, optional
If ``True``, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
the area (or integral) under the histogram will sum to 1.
This is achieved by dividing the count by the number of
observations times the bin width and not dividing by the total
number of observations. If *stacked* is also ``True``, the sum of
the histograms is normalized to 1.
The line This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations
I tried replicating this with the sample data.
**Using matplotlib inbuilt calculations** .
ser = pd.Series(np.random.normal(size=1000))
ser.hist(density = 1, bins=100)
**Manual calculation of the density** :
arr_hist , edges = np.histogram( ser, bins =100)
samp = arr_hist / ser.shape[0] * np.diff(edges)
plt.bar(edges[0:-1] , samp )
plt.grid()
Both the plots are completely different on the y-axis scales , could someone point what exactly is going wrong and how to replicate the density calculation manually ?

That is an ambiguity in the language. The sentence
This is achieved by dividing the count by the number of observations times the bin width
needs to be read like
This is achieved by dividing (the count) by (the number of observations times the bin width)
i.e.
count / (number of observations * bin width)
Complete code:
import numpy as np
import matplotlib.pyplot as plt
arr = np.random.normal(size=1000)
fig, (ax1, ax2) = plt.subplots(2)
ax1.hist(arr, density = True, bins=100)
ax1.grid()
arr_hist , edges = np.histogram(arr, bins =100)
samp = arr_hist / (arr.shape[0] * np.diff(edges))
ax2.bar(edges[0:-1] , samp, width=np.diff(edges) )
ax2.grid()
plt.show()

Related

Cannot understand matplotlib pyplot histogram

I am just learning some basics of Data Analysis.
I have a simple csv data file like the one below.
START,FIRST,SECOND,ITEM
1,100,200,A
2,100,200,B
2,100,300,C
2,200,300,D
3,200,100,E
3,200,100,F
3,200,100,G
3,200,100,H
3,200,100,I
3,200,100,J
I wrote this small program to read this csv file and then print a histogram using matplotlib for the three columns START, FIRST, and SECOND. I also print a scatter plot for FIRST vs SECOND columns.
#!/exp/anaconda3/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
file_name = 'junk.csv'
data = pd.read_csv(file_name)
print(data.describe())
plt.rcParams['axes.grid'] = True
fix, axs = plt.subplots(2,2, figsize=(15,10))
axs[0, 0].hist(data['START'], 100, density=True, facecolor='g', alpha=0.8)
axs[1, 0].scatter(data['FIRST'], data['SECOND'], facecolor='violet')
axs[0, 1].hist(data['FIRST'], 100, density=True, facecolor='r', alpha=0.8)
axs[1, 1].hist(data['SECOND'], 100, density=True, facecolor='b', alpha=0.8)
plt.show()
What I do not understand is in the histogram plots, for example, the bottom right hand image with blue bars in the attached picture, why does it not simply plot how many times the number 200 is occurring instead of showing that 200 occurs 0.10 times. How is that possible? Same goes for the 300 as well.
Can someone help me understand what and how matplot is coming up with the Y-axis values? These values do not make sense to me.
Thank you.
Ruby Drew
Try density = False. The density parameter tells matplotlib whether you want it to normalise the heights or not such that it represents a probability density.
First note that a histogram is primarily meant to count continuous samples in small bins. For discrete data, the bins should be carefully chosen to have boundaries nicely in-between the values. When you add bins=N, matplotlib supposes a continuous distribution and subdivides the space from smallest to largest sample into N equally-sized bins. For discrete data this can have unexpected side effects, such as putting samples in either one bin for the values on the bin boundaries.
With density=True, the heights of the bars is recalculated such that the total area of all bins sums to 1. For a continuous distribution with many samples, this resembles the probability density function and can be used to draw a kde plot with the same y-axis.
So, what's happening in the blue histogram:
100 bins are created between 100 and 300. Each bin will be 2 wide.
3 bins get values: the bin 100-102 gets count 3, either the bin 198-200 or the bin 200-202 get a count of 1, the bin 298-300 also gets a count of one.
The total height of the bins is now 5. As their width is 2, the histogram counts need to be divided by (total_height * bin_width) to obtain a total area of 1.
Clearly, the sum of height times width of the bars sums to 1: 0.3*2 + 0.1*2 + 0.1*2 = 1.
The latest version (0.11) of Seaborn's histplot has a parameter to indicate that a distribution is discrete. And a parameter stat= where you can choose between 'count' for bin heights indicating the usual counts and 'probability' for heights relative to their probability, mimicking a probability mass function. The blue histogram could be drawn as:
import seaborn as sns
sns.histplot(data, x='SECOND', discrete=True, stat='probability', facecolor='b', alpha=0.8, ax=axs[1, 1])

Scipy.stats.gaussian_kde gives a pdf that is outside the range (0,1) [duplicate]

Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
For example if I run
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
What interpretation can I give when the y-axis has such a large range?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
Also, how can I get these functions to show probability on the y-axis?
The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
Here is an example to illustrate what's going on.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001 meter wide. The highest bin (in red) is about 40 high. The probability that a value falls into that bin is 40*0.001 = 0.04.
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm wide. The highest bin is about 0.04 high. The probability that a value falls into that bin is also 0.04, because of the bin width of 1.
PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3.

Get bin width used for seaborn plot

How do I find out what bin width was used when doing a distplot in Seaborn? I have two datasets I would like to share bin widhts, but don't know how to return the default value used for the first dataset. for something like the simple example below, how would I find out the bin width used?
import nump as np
import seaborn as sns
f, axs = plt.subplots(1,1)
distribution=np.random.rand(1000)
sns.distplot(distribution, hist=True , kde_kws={"shade": True},ax=axs)
Seaborn uses Freedman-Diaconis rule to calculate bin width if bins parameter is not specified in the function seaborn.distplot()
The equation is as follows (from wikipedia):
We can calculate IQR and the cube-root of n with the following code.
Q1 = np.quantile(distribution, 0.25)
Q3 = np.quantile(distribution, 0.75)
IQR = Q3 - Q1
cube = np.cbrt(len(distribution)
The bin width is:
In[] : 2*IQR/cube
Out[]: 0.10163947994817446
Finally, we can now calculate the number of bins.
In[] : 1/(2*IQR/cube) # '1' is the range of the array for this example
Out[]: 9.838696543015526
When we round up the result, it amounts to 10. That's our number of bins. We can now specify bins parameter to get the same number of bins (or same bin width for the same range)
Graph w/o specifying bins:
f, axs = plt.subplots(1,1)
distribution=np.random.rand(1000)
sns.distplot(distribution, hist=True , kde_kws={"shade": True},ax=axs)
Graph w/ specifying the parameter bins=10:
f, axs = plt.subplots(1,1)
sns.distplot(distribution, bins=10, hist=True , kde_kws={"shade": True},ax=axs)
Update:
Seaborn version 0.9 was mentioning Freedman-Diaconis rule as a way to calculate bin size:
Specification of hist bins, or None to use Freedman-Diaconis rule.
The description changed in version 0.10 as follows:
Specification of hist bins. If unspecified, as reference rule is used that tries to find a useful default.

making the y-axis of a histogram probability, python

I have plotted a histogram in python, using matplotlib and I need the y-axis to be the probability, I cannot find how to do this. For example i want it to look similar to this http://www.mathamazement.com/images/Pre-Calculus/10_Sequences-Series-and-Summation-Notation/10_07_Probability/10-coin-toss-histogram.JPG
Here is my code, I will attached my plot aswell if needed
plt.figure(figsize=(10,10))
mu = np.mean(a) #mean of distribution
sigma = np.std(a) # standard deviation of distribution
n, bins,patches=plt.hist(a,bin, normed=True, facecolor='white')
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins,y,'r--')
print np.sum(n*np.diff(bins))# proved the intergal over bars is unity
plt.show()
Just divide all your sample counts by the total number of samples. This gives the probability rather than the count.
As #SteveBarnes points out, divide the sample counts by the total number of samples to get the probabilities for each bin. To get a plot like the one you linked to, your "bins" should just be the integers from 0 to 10. A simple way to compute the histogram for a sample from a discrete distribution is np.bincount.
Here's a snippet that creates a plot like the one you linked to:
import numpy as np
import matplotlib.pyplot as plt
n = 10
num_samples = 10000
# Generate a random sample.
a = np.random.binomial(n, 0.5, size=num_samples)
# Count the occurrences in the sample.
b = np.bincount(a, minlength=n+1)
# p is the array of probabilities.
p = b / float(b.sum())
plt.bar(np.arange(len(b)) - 0.5, p, width=1, facecolor='white')
plt.xlim(-0.5, n + 0.5)
plt.xlabel("Number of heads (k)")
plt.ylabel("P(k)")
plt.show()

matplotlib reducing y axis by a factor to represent percent frequency

so i have 3 lists of fractions and i used a histogram to show how often each fraction showed up. The problem is that there are 100000 of each and i need to reduce the y vaues by that much to get a frequency percentage. Here is my code now
bins = numpy.linspace(0, 1, 50)
z = np.linspace(0,1,50)
g = (lambda z: 2 * np.exp((-2)*(z**2)*(1000000000)))
w = g(z)
plt.plot(z,w)
pyplot.hist(Vrand, bins, alpha=0.5)
pyplot.hist(Vfirst, bins, alpha=0.5)
pyplot.hist(Vmin, bins, alpha=0.2)
pyplot.show()
it is the last chunk of code i need the y axis divided by 100000
Update:
when i try to divide by 100000 using np histograms all the values =0 except the line above
bins = numpy.linspace(0, 1, 50)
z = np.linspace(0,1,50)
g = (lambda z: 2 * np.exp((-2)*(z**2)*(100000)))
w = g(z)
plt.plot(z,w)
hist, bins = np.histogram(Vrand, bins)
hist /= 100000.0
widths = np.diff(bins)
pyplot.bar(bins[:-1], hist, widths)
matplotlib histogram has a "normed" parameter that you can use to scale everything to [0,1] interval
pyplot.hist(Vrand, bins, normed=1)
or use weights parameter to scale it by different coefficient.
You can also use the retuning value of numpy histogram and scale it whatever you want (tested in python 3.x)
hist, bins = np.histogram(Vrand, bins)
hist /= 100000.0
widths = np.diff(bins)
pyplot.bar(bins[:-1], hist, widths)
First two solutions are in my opinion better, as we should not "reinvent the wheel" and implement by hand what is already done in library.
Firstly I would recommend you think about your style, use either plt or pyplot not both and you should include in example code some fake data to illustrate the problem and your imports.
So, the issue is that in the following example the counts are very large:
bins = np.linspace(0, 1, 50)
data = np.random.normal(0.5, 0.1, size=100000)
plt.hist(data, bins)
plt.show()
You tried to fix this by dividing the bin count by an integer:
hist, bins = plt.histogram(data, bins)
hist_divided = hist/10000
The issue here is that hist is an array of int's and dividing integers is tricky. For example
>>> 2/3
0
>>> 3/2
1
This is what gives you a row of 0's if you pick too large a value to divide by. Instead you can divide by a float as suggested by #lejlot, notice you need to divide by 10000.0 and not 10000.
Or the other suggestion made by #lejlot just use the normed argument in the call to 'hist'. This rescales all the numbs in hist such that the sum of their squares is 1, very useful when comparing values.
I also notice you appear to be having this issue because your plotting a line plot on the same axis as the histogram, if this line plot is outside of the [0,1] range you will again encounter the same issue, instead of rescale the histogram axis you should twin the x axis.

Categories