Cannot understand matplotlib pyplot histogram - python

I am just learning some basics of Data Analysis.
I have a simple csv data file like the one below.
START,FIRST,SECOND,ITEM
1,100,200,A
2,100,200,B
2,100,300,C
2,200,300,D
3,200,100,E
3,200,100,F
3,200,100,G
3,200,100,H
3,200,100,I
3,200,100,J
I wrote this small program to read this csv file and then print a histogram using matplotlib for the three columns START, FIRST, and SECOND. I also print a scatter plot for FIRST vs SECOND columns.
#!/exp/anaconda3/bin/python3
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
file_name = 'junk.csv'
data = pd.read_csv(file_name)
print(data.describe())
plt.rcParams['axes.grid'] = True
fix, axs = plt.subplots(2,2, figsize=(15,10))
axs[0, 0].hist(data['START'], 100, density=True, facecolor='g', alpha=0.8)
axs[1, 0].scatter(data['FIRST'], data['SECOND'], facecolor='violet')
axs[0, 1].hist(data['FIRST'], 100, density=True, facecolor='r', alpha=0.8)
axs[1, 1].hist(data['SECOND'], 100, density=True, facecolor='b', alpha=0.8)
plt.show()
What I do not understand is in the histogram plots, for example, the bottom right hand image with blue bars in the attached picture, why does it not simply plot how many times the number 200 is occurring instead of showing that 200 occurs 0.10 times. How is that possible? Same goes for the 300 as well.
Can someone help me understand what and how matplot is coming up with the Y-axis values? These values do not make sense to me.
Thank you.
Ruby Drew

Try density = False. The density parameter tells matplotlib whether you want it to normalise the heights or not such that it represents a probability density.

First note that a histogram is primarily meant to count continuous samples in small bins. For discrete data, the bins should be carefully chosen to have boundaries nicely in-between the values. When you add bins=N, matplotlib supposes a continuous distribution and subdivides the space from smallest to largest sample into N equally-sized bins. For discrete data this can have unexpected side effects, such as putting samples in either one bin for the values on the bin boundaries.
With density=True, the heights of the bars is recalculated such that the total area of all bins sums to 1. For a continuous distribution with many samples, this resembles the probability density function and can be used to draw a kde plot with the same y-axis.
So, what's happening in the blue histogram:
100 bins are created between 100 and 300. Each bin will be 2 wide.
3 bins get values: the bin 100-102 gets count 3, either the bin 198-200 or the bin 200-202 get a count of 1, the bin 298-300 also gets a count of one.
The total height of the bins is now 5. As their width is 2, the histogram counts need to be divided by (total_height * bin_width) to obtain a total area of 1.
Clearly, the sum of height times width of the bars sums to 1: 0.3*2 + 0.1*2 + 0.1*2 = 1.
The latest version (0.11) of Seaborn's histplot has a parameter to indicate that a distribution is discrete. And a parameter stat= where you can choose between 'count' for bin heights indicating the usual counts and 'probability' for heights relative to their probability, mimicking a probability mass function. The blue histogram could be drawn as:
import seaborn as sns
sns.histplot(data, x='SECOND', discrete=True, stat='probability', facecolor='b', alpha=0.8, ax=axs[1, 1])

Related

How to Generate Two Separate Y-Axes For A Histogram on the Same Figure In Seaborn

I'd like to generate a single figure that has two y axes: Count (from the histogram) and Density (from the KDE).
I want to use sns.displot in Seaborn >= v 0.11.
import seaborn as sns
df = sns.load_dataset('tips')
# graph 1: This should be the Y-Axis on the left side of the figure
sns.displot(df['total_bill'], kind='hist', bins=10)
# graph 2: This should be the Y-axis on the right side of the figure
sns.displot(df['total_bill'], kind='kde')
The code I've written generates two separate graphs; I could just use a facet grid for two separate graphs, but I want to be more concise and place the two y-axes on the two separate grids into a single figure sharing the same x-axis.
displot() is a figure-level function, which can create multiple subplots inside a figure. As such, you don't have control over individual axes.
To create combined plots, you can use the underlying axes-level functions: histplot() and kdeplot() for Seaborn v.0.11. These functions accept an ax= parameter. twinx() creates a second y-axis.
import matplotlib.pyplot as plt
import seaborn as sns
df = sns.load_dataset('tips')
fig, ax = plt.subplots()
sns.histplot(df['total_bill'], bins=10, ax=ax)
ax2 = ax.twinx()
sns.kdeplot(df['total_bill'], ax=ax2)
plt.tight_layout()
plt.show()
Edit:
As mentioned in the comments, the y-axes aren't aligned. The left axis only tells something about the histogram. E.g. the highest bin having height 68 means that there are exactly 68 total bills between 12.618 and 17.392. The right axis only tells something about the kde. E.g. a y-value of 0.043 for x=20 would mean there is about 4.3 % probability that the total bill would be between 19.5 and 20.5.
To align both similar to sns.histplot(..., kde=True), the area of the histogram can be calculated (bin width times number of data values) and used as a scaling factor. Such scaling would make the area of the histogram and the area below the kde curve equal when measured in pixels:
num_bins = 10
bin_width = (df['total_bill'].max() - df['total_bill'].min()) / num_bins
hist_area = len(df) * bin_width
ax2.set_ylim(ymax=ax.get_ylim()[1] / hist_area)
Note that the right axis would be more similar to a percentage if the histogram would use a bin width with a power of ten (e.g. sns.histplot(..., bins=np.arange(0, df['total_bill'].max()+10, 10)). Which bins would be most suitable strongly depends on how you want to interpret your data.

Scipy.stats.gaussian_kde gives a pdf that is outside the range (0,1) [duplicate]

Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
For example if I run
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
What interpretation can I give when the y-axis has such a large range?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
Also, how can I get these functions to show probability on the y-axis?
The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
Here is an example to illustrate what's going on.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001 meter wide. The highest bin (in red) is about 40 high. The probability that a value falls into that bin is 40*0.001 = 0.04.
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm wide. The highest bin is about 0.04 high. The probability that a value falls into that bin is also 0.04, because of the bin width of 1.
PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3.

Histogram bin size

I have a code like this and I am wondering why my bin size of the two plotted graphs is different?
import matplotlib.pyplot as pyplot
bins=15
pyplot.rcParams["figure.figsize"] = (10,10)
#echte_Ladezeit
pyplot.hist(Y_test, bins, alpha=1, label='Y_test; orange Dateien',
color='orange', weights = np.ones_like(Y_test)/float(len(Y_test)))
pyplot.hist(Y_train, bins, alpha=1, label='Y_train; grüne Dateien',
color='green', weights = np.ones_like(Y_train)/float(len(Y_train)))
pyplot.title('Verteilung echte_Ladezeit')
pyplot.xlabel('echte_Ladezeit')
pyplot.ylabel('Häufigkeit [%]')
pyplot.legend(loc='upper right')
pyplot.show()
actually the marked width of the orange and the green one should be the same right? Do I have any mistake in my code?
Your code contains pyplot.hist(..., bins, ...) where bins = 15. This means 15 bins equally spaced between max and min values. Max and min values are different for two datasets so you get different sets of 15 bins. If you want to get bins of equal width for every dataset then you have at least two options.
Normalize datasets - max and min values should be the same for both datasets.
Define bins as a sequence (for example, list(range(0, 40000 + 1, 5000))) as described here.

Plot 2 histograms with different length of data points in one graph using matplotlib

I have two set of data with one containing around 11 million data points and the another around 5000. I would like to plot them both on one histogram. But because of the difference in size I need to normalise the frequency so I can plot them on the same figure. Below I have simulated what I have done with my data to be able to plot them. I have used the normed=True.
from numpy.random import randn
import matplotlib.pyplot as plt
import random
datalist1=[]
for x in range(1,50000):
datalist1.append(random.uniform(1,2))
datalist2=randn(5000000)
fig= plt.figure(1)
plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
plt.xlabel("Value")
plt.ylabel("Normalised Frequency")
plt.legend()
plt.show()
Can you please tell me if this is a good way to get around this issue? I would like to match the tallest hight between the two histogram frequencies to be 1 (or 100%).
The normed=True setting normalizes the histogram to an area of 1. That gives the histogram an interpretation as estimates of probability density functions.
In short, it actually makes sense not to normalize on the peak but on the area.
But if you really want to normalize by height you can modify the polygon data of the histogram:
h = plt.hist(datalist1,bins=20,color='b',alpha=0.3,label='theoretical',histtype='stepfilled', normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
h = plt.hist(datalist2,bins=20,alpha=0.5,color='g',label='experimental',histtype='stepfilled',normed=True)
p = h[2][0]
p.xy[:,1] /= p.xy[:, 1].max()
This solution feels a bit hackish, but at least it's quick and dirty :)

Plot a histogram such that bar heights sum to 1 (probability)

I'd like to plot a normalized histogram from a vector using matplotlib. I tried the following:
plt.hist(myarray, normed=True)
as well as:
plt.hist(myarray, normed=1)
but neither option produces a y-axis from [0, 1] such that the bar heights of the histogram sum to 1.
If you want the sum of all bars to be equal unity, weight each bin by the total number of values:
weights = np.ones_like(myarray) / len(myarray)
plt.hist(myarray, weights=weights)
Note for Python 2.x: add casting to float() for one of the operators of the division as otherwise you would end up with zeros due to integer division
It would be more helpful if you posed a more complete working (or in this case non-working) example.
I tried the following:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(1000)
fig = plt.figure()
ax = fig.add_subplot(111)
n, bins, rectangles = ax.hist(x, 50, density=True)
fig.canvas.draw()
plt.show()
This will indeed produce a bar-chart histogram with a y-axis that goes from [0,1].
Further, as per the hist documentation (i.e. ax.hist? from ipython), I think the sum is fine too:
*normed*:
If *True*, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
``n/(len(x)*dbin)``. In a probability density, the integral of
the histogram should be 1; you can verify that with a
trapezoidal integration of the probability density function::
pdf, bins, patches = ax.hist(...)
print np.sum(pdf * np.diff(bins))
Giving this a try after the commands above:
np.sum(n * np.diff(bins))
I get a return value of 1.0 as expected. Remember that normed=True doesn't mean that the sum of the value at each bar will be unity, but rather than the integral over the bars is unity. In my case np.sum(n) returned approx 7.2767.
I know this answer is too late considering the question is dated 2010 but I came across this question as I was facing a similar problem myself. As already stated in the answer, normed=True means that the total area under the histogram is equal to 1 but the sum of heights is not equal to 1. However, I wanted to, for convenience of physical interpretation of a histogram, make one with sum of heights equal to 1.
I found a hint in the following question - Python: Histogram with area normalized to something other than 1
But I was not able to find a way of making bars mimic the histtype="step" feature hist(). This diverted me to : Matplotlib - Stepped histogram with already binned data
If the community finds it acceptable I should like to put forth a solution which synthesises ideas from both the above posts.
import matplotlib.pyplot as plt
# Let X be the array whose histogram needs to be plotted.
nx, xbins, ptchs = plt.hist(X, bins=20)
plt.clf() # Get rid of this histogram since not the one we want.
nx_frac = nx/float(len(nx)) # Each bin divided by total number of objects.
width = xbins[1] - xbins[0] # Width of each bin.
x = np.ravel(zip(xbins[:-1], xbins[:-1]+width))
y = np.ravel(zip(nx_frac,nx_frac))
plt.plot(x,y,linestyle="dashed",label="MyLabel")
#... Further formatting.
This has worked wonderfully for me though in some cases I have noticed that the left most "bar" or the right most "bar" of the histogram does not close down by touching the lowest point of the Y-axis. In such a case adding an element 0 at the begging or the end of y achieved the necessary result.
Just thought I'd share my experience. Thank you.
Here is another simple solution using np.histogram() method.
myarray = np.random.random(100)
results, edges = np.histogram(myarray, normed=True)
binWidth = edges[1] - edges[0]
plt.bar(edges[:-1], results*binWidth, binWidth)
You can indeed check that the total sums up to 1 with:
> print sum(results*binWidth)
1.0
The easiest solution is to use seaborn.histplot, or seaborn.displot with kind='hist', and specify stat='probability'
probability: or proportion: normalize such that bar heights sum to 1
density: normalize such that the total area of the histogram equals 1
data: pandas.DataFrame, numpy.ndarray, mapping, or sequence
seaborn is a high-level API for matplotlib
Tested in python 3.8.12, matplotlib 3.4.3, seaborn 0.11.2
Imports and Data
import seaborn as sns
import matplotlib.pyplot as plt
# load data
df = sns.load_dataset('penguins')
sns.histplot
axes-level plot
# create figure and axes
fig, ax = plt.subplots(figsize=(6, 5))
p = sns.histplot(data=df, x='flipper_length_mm', stat='probability', ax=ax)
sns.displot
figure-level plot
p = sns.displot(data=df, x='flipper_length_mm', stat='probability', height=4, aspect=1.5)
Since matplotlib 3.0.2, normed=True is deprecated. To get the desired output I had to do:
import numpy as np
data=np.random.randn(1000)
bins=np.arange(-3.0,3.0,51)
counts, _ = np.histogram(data,bins=bins)
if density: # equivalent of normed=True
counts_weighter=counts.sum()
else: # equivalent of normed=False
counts_weighter=1.0
plt.hist(bins[:-1],bins=bins,weights=counts/counts_weighter)
Trying to specify weights and density simultaneously as arguments to plt.hist() did not work for me. If anyone know of a way to get that working without having access to the normed keyword argument then please let me know in the comments and I will delete/modify this answer.
If you want bin centres then don't use bins[:-1] which are the bin edges - you need to choose a suitable scheme for how to calculate the centres (which may or may not be trivially derived).

Categories