Get bin width used for seaborn plot - python

How do I find out what bin width was used when doing a distplot in Seaborn? I have two datasets I would like to share bin widhts, but don't know how to return the default value used for the first dataset. for something like the simple example below, how would I find out the bin width used?
import nump as np
import seaborn as sns
f, axs = plt.subplots(1,1)
distribution=np.random.rand(1000)
sns.distplot(distribution, hist=True , kde_kws={"shade": True},ax=axs)

Seaborn uses Freedman-Diaconis rule to calculate bin width if bins parameter is not specified in the function seaborn.distplot()
The equation is as follows (from wikipedia):
We can calculate IQR and the cube-root of n with the following code.
Q1 = np.quantile(distribution, 0.25)
Q3 = np.quantile(distribution, 0.75)
IQR = Q3 - Q1
cube = np.cbrt(len(distribution)
The bin width is:
In[] : 2*IQR/cube
Out[]: 0.10163947994817446
Finally, we can now calculate the number of bins.
In[] : 1/(2*IQR/cube) # '1' is the range of the array for this example
Out[]: 9.838696543015526
When we round up the result, it amounts to 10. That's our number of bins. We can now specify bins parameter to get the same number of bins (or same bin width for the same range)
Graph w/o specifying bins:
f, axs = plt.subplots(1,1)
distribution=np.random.rand(1000)
sns.distplot(distribution, hist=True , kde_kws={"shade": True},ax=axs)
Graph w/ specifying the parameter bins=10:
f, axs = plt.subplots(1,1)
sns.distplot(distribution, bins=10, hist=True , kde_kws={"shade": True},ax=axs)
Update:
Seaborn version 0.9 was mentioning Freedman-Diaconis rule as a way to calculate bin size:
Specification of hist bins, or None to use Freedman-Diaconis rule.
The description changed in version 0.10 as follows:
Specification of hist bins. If unspecified, as reference rule is used that tries to find a useful default.

Related

Scipy.stats.gaussian_kde gives a pdf that is outside the range (0,1) [duplicate]

Sometimes when I create a histogram, using say seaborn's displot function, with norm_hist = True, the y-axis is less than 1 as expected for a PDF. Other times it takes on values greater than one.
For example if I run
sns.set();
x = np.random.randn(10000)
ax = sns.distplot(x)
Then the y-axis on the histogram goes from 0.0 to 0.4 as expected, but if the data is not normal the y-axis can be as large as 30 even if norm_hist = True.
What am I missing about the normalization arguments for histogram functions, e.g. norm_hist for sns.distplot? Even if I normalize the data myself by creating a new variable thus:
new_var = data/sum(data)
so that the data sums to 1, the y-axis will still show values far larger than 1 (like 30 for example) whether the norm_hist argument is True or not.
What interpretation can I give when the y-axis has such a large range?
I think what is happening is my data is concentrated closely around zero so in order for the data to have an area equal to 1 (under the kde for example) the height of the histogram has to be larger than 1...but since probabilities can't be above 1 what does the result mean?
Also, how can I get these functions to show probability on the y-axis?
The rule isn't that all the bars should sum to one. The rule is that all the areas of all the bars should sum to one. When the bars are very narrow, their sum can be quite large although their areas sum to one. The height of a bar times its width is the probability that a value would all in that range. To have the height being equal to the probability, you need bars of width one.
Here is an example to illustrate what's going on.
import numpy as np
from matplotlib import pyplot as plt
import seaborn as sns
fig, axs = plt.subplots(ncols=2, figsize=(14, 3))
a = np.random.normal(0, 0.01, 100000)
sns.distplot(a, bins=np.arange(-0.04, 0.04, 0.001), ax=axs[0])
axs[0].set_title('Measuring in meters')
axs[0].containers[0][40].set_color('r')
a *= 1000
sns.distplot(a, bins=np.arange(-40, 40, 1), ax=axs[1])
axs[1].set_title('Measuring in milimeters')
axs[1].containers[0][40].set_color('r')
plt.show()
The plot at the left uses bins of 0.001 meter wide. The highest bin (in red) is about 40 high. The probability that a value falls into that bin is 40*0.001 = 0.04.
The plot at the right uses exactly the same data, but measures in milimeter. Now the bins are 1 mm wide. The highest bin is about 0.04 high. The probability that a value falls into that bin is also 0.04, because of the bin width of 1.
PS: As an example of a distribution for which the probability density function has zones larger than 1, see the Pareto distribution with α = 3.

How does matplotlib calculate the density for historgram

Reading through the matplotlib plt.hist documentations , there is a density parameter that can be set to true.The documentation says
density : bool, optional
If ``True``, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
the area (or integral) under the histogram will sum to 1.
This is achieved by dividing the count by the number of
observations times the bin width and not dividing by the total
number of observations. If *stacked* is also ``True``, the sum of
the histograms is normalized to 1.
The line This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations
I tried replicating this with the sample data.
**Using matplotlib inbuilt calculations** .
ser = pd.Series(np.random.normal(size=1000))
ser.hist(density = 1, bins=100)
**Manual calculation of the density** :
arr_hist , edges = np.histogram( ser, bins =100)
samp = arr_hist / ser.shape[0] * np.diff(edges)
plt.bar(edges[0:-1] , samp )
plt.grid()
Both the plots are completely different on the y-axis scales , could someone point what exactly is going wrong and how to replicate the density calculation manually ?
That is an ambiguity in the language. The sentence
This is achieved by dividing the count by the number of observations times the bin width
needs to be read like
This is achieved by dividing (the count) by (the number of observations times the bin width)
i.e.
count / (number of observations * bin width)
Complete code:
import numpy as np
import matplotlib.pyplot as plt
arr = np.random.normal(size=1000)
fig, (ax1, ax2) = plt.subplots(2)
ax1.hist(arr, density = True, bins=100)
ax1.grid()
arr_hist , edges = np.histogram(arr, bins =100)
samp = arr_hist / (arr.shape[0] * np.diff(edges))
ax2.bar(edges[0:-1] , samp, width=np.diff(edges) )
ax2.grid()
plt.show()

normalize histogram with bin size [duplicate]

I'd like to plot a normalized histogram from a vector using matplotlib. I tried the following:
plt.hist(myarray, normed=True)
as well as:
plt.hist(myarray, normed=1)
but neither option produces a y-axis from [0, 1] such that the bar heights of the histogram sum to 1.
If you want the sum of all bars to be equal unity, weight each bin by the total number of values:
weights = np.ones_like(myarray) / len(myarray)
plt.hist(myarray, weights=weights)
Note for Python 2.x: add casting to float() for one of the operators of the division as otherwise you would end up with zeros due to integer division
It would be more helpful if you posed a more complete working (or in this case non-working) example.
I tried the following:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(1000)
fig = plt.figure()
ax = fig.add_subplot(111)
n, bins, rectangles = ax.hist(x, 50, density=True)
fig.canvas.draw()
plt.show()
This will indeed produce a bar-chart histogram with a y-axis that goes from [0,1].
Further, as per the hist documentation (i.e. ax.hist? from ipython), I think the sum is fine too:
*normed*:
If *True*, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
``n/(len(x)*dbin)``. In a probability density, the integral of
the histogram should be 1; you can verify that with a
trapezoidal integration of the probability density function::
pdf, bins, patches = ax.hist(...)
print np.sum(pdf * np.diff(bins))
Giving this a try after the commands above:
np.sum(n * np.diff(bins))
I get a return value of 1.0 as expected. Remember that normed=True doesn't mean that the sum of the value at each bar will be unity, but rather than the integral over the bars is unity. In my case np.sum(n) returned approx 7.2767.
I know this answer is too late considering the question is dated 2010 but I came across this question as I was facing a similar problem myself. As already stated in the answer, normed=True means that the total area under the histogram is equal to 1 but the sum of heights is not equal to 1. However, I wanted to, for convenience of physical interpretation of a histogram, make one with sum of heights equal to 1.
I found a hint in the following question - Python: Histogram with area normalized to something other than 1
But I was not able to find a way of making bars mimic the histtype="step" feature hist(). This diverted me to : Matplotlib - Stepped histogram with already binned data
If the community finds it acceptable I should like to put forth a solution which synthesises ideas from both the above posts.
import matplotlib.pyplot as plt
# Let X be the array whose histogram needs to be plotted.
nx, xbins, ptchs = plt.hist(X, bins=20)
plt.clf() # Get rid of this histogram since not the one we want.
nx_frac = nx/float(len(nx)) # Each bin divided by total number of objects.
width = xbins[1] - xbins[0] # Width of each bin.
x = np.ravel(zip(xbins[:-1], xbins[:-1]+width))
y = np.ravel(zip(nx_frac,nx_frac))
plt.plot(x,y,linestyle="dashed",label="MyLabel")
#... Further formatting.
This has worked wonderfully for me though in some cases I have noticed that the left most "bar" or the right most "bar" of the histogram does not close down by touching the lowest point of the Y-axis. In such a case adding an element 0 at the begging or the end of y achieved the necessary result.
Just thought I'd share my experience. Thank you.
Here is another simple solution using np.histogram() method.
myarray = np.random.random(100)
results, edges = np.histogram(myarray, normed=True)
binWidth = edges[1] - edges[0]
plt.bar(edges[:-1], results*binWidth, binWidth)
You can indeed check that the total sums up to 1 with:
> print sum(results*binWidth)
1.0
The easiest solution is to use seaborn.histplot, or seaborn.displot with kind='hist', and specify stat='probability'
probability: or proportion: normalize such that bar heights sum to 1
density: normalize such that the total area of the histogram equals 1
data: pandas.DataFrame, numpy.ndarray, mapping, or sequence
seaborn is a high-level API for matplotlib
Tested in python 3.8.12, matplotlib 3.4.3, seaborn 0.11.2
Imports and Data
import seaborn as sns
import matplotlib.pyplot as plt
# load data
df = sns.load_dataset('penguins')
sns.histplot
axes-level plot
# create figure and axes
fig, ax = plt.subplots(figsize=(6, 5))
p = sns.histplot(data=df, x='flipper_length_mm', stat='probability', ax=ax)
sns.displot
figure-level plot
p = sns.displot(data=df, x='flipper_length_mm', stat='probability', height=4, aspect=1.5)
Since matplotlib 3.0.2, normed=True is deprecated. To get the desired output I had to do:
import numpy as np
data=np.random.randn(1000)
bins=np.arange(-3.0,3.0,51)
counts, _ = np.histogram(data,bins=bins)
if density: # equivalent of normed=True
counts_weighter=counts.sum()
else: # equivalent of normed=False
counts_weighter=1.0
plt.hist(bins[:-1],bins=bins,weights=counts/counts_weighter)
Trying to specify weights and density simultaneously as arguments to plt.hist() did not work for me. If anyone know of a way to get that working without having access to the normed keyword argument then please let me know in the comments and I will delete/modify this answer.
If you want bin centres then don't use bins[:-1] which are the bin edges - you need to choose a suitable scheme for how to calculate the centres (which may or may not be trivially derived).

matplotlib: disregard outliers when plotting

I'm plotting some data from various tests. Sometimes in a test I happen to have one outlier (say 0.1), while all other values are three orders of magnitude smaller.
With matplotlib, I plot against the range [0, max_data_value]
How can I just zoom into my data and not display outliers, which would mess up the x-axis in my plot?
Should I simply take the 95 percentile and have the range [0, 95_percentile] on the x-axis?
There's no single "best" test for an outlier. Ideally, you should incorporate a-priori information (e.g. "This parameter shouldn't be over x because of blah...").
Most tests for outliers use the median absolute deviation, rather than the 95th percentile or some other variance-based measurement. Otherwise, the variance/stddev that is calculated will be heavily skewed by the outliers.
Here's a function that implements one of the more common outlier tests.
def is_outlier(points, thresh=3.5):
"""
Returns a boolean array with True if points are outliers and False
otherwise.
Parameters:
-----------
points : An numobservations by numdimensions array of observations
thresh : The modified z-score to use as a threshold. Observations with
a modified z-score (based on the median absolute deviation) greater
than this value will be classified as outliers.
Returns:
--------
mask : A numobservations-length boolean array.
References:
----------
Boris Iglewicz and David Hoaglin (1993), "Volume 16: How to Detect and
Handle Outliers", The ASQC Basic References in Quality Control:
Statistical Techniques, Edward F. Mykytka, Ph.D., Editor.
"""
if len(points.shape) == 1:
points = points[:,None]
median = np.median(points, axis=0)
diff = np.sum((points - median)**2, axis=-1)
diff = np.sqrt(diff)
med_abs_deviation = np.median(diff)
modified_z_score = 0.6745 * diff / med_abs_deviation
return modified_z_score > thresh
As an example of using it, you'd do something like the following:
import numpy as np
import matplotlib.pyplot as plt
# The function above... In my case it's in a local utilities module
from sci_utilities import is_outlier
# Generate some data
x = np.random.random(100)
# Append a few "bad" points
x = np.r_[x, -3, -10, 100]
# Keep only the "good" points
# "~" operates as a logical not operator on boolean numpy arrays
filtered = x[~is_outlier(x)]
# Plot the results
fig, (ax1, ax2) = plt.subplots(nrows=2)
ax1.hist(x)
ax1.set_title('Original')
ax2.hist(filtered)
ax2.set_title('Without Outliers')
plt.show()
If you aren't fussed about rejecting outliers as mentioned by Joe and it is purely aesthetic reasons for doing this, you could just set your plot's x axis limits:
plt.xlim(min_x_data_value,max_x_data_value)
Where the values are your desired limits to display.
plt.ylim(min,max) works to set limits on the y axis also.
I think using pandas quantile is useful and much more flexible.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
fig = plt.figure()
ax1 = fig.add_subplot(121)
ax2 = fig.add_subplot(122)
pd_series = pd.Series(np.random.normal(size=300))
pd_series_adjusted = pd_series[pd_series.between(pd_series.quantile(.05), pd_series.quantile(.95))]
ax1.boxplot(pd_series)
ax1.set_title('Original')
ax2.boxplot(pd_series_adjusted)
ax2.set_title('Adjusted')
plt.show()
I usually pass the data through the function np.clip, If you have some reasonable estimate of the maximum and minimum value of your data, just use that. If you don't have a reasonable estimate, the histogram of clipped data will show you the size of the tails, and if the outliers are really just outliers the tail should be small.
What I run is something like this:
import numpy as np
import matplotlib.pyplot as plt
data = np.random.normal(3, size=100000)
plt.hist(np.clip(data, -15, 8), bins=333, density=True)
You can compare the results if you change the min and max in the clipping function until you find the right values for your data.
In this example, you can see immediately that the max value of 8 is not good because you are removing a lot of meaningful information. The min value of -15 should be fine since the tail is not even visible.
You could probably write some code that based on this find some good bounds that minimize the sizes of the tails according to some tolerance.
In some cases (e.g. in histogram plots such as the one in Joe Kington's answer) rescaling the plot could show that the outliers exist but that they have been partially cropped out by the zoom scale. Removing the outliers would not have the same effect as just rescaling. Automatically finding appropriate axes limits seems generally more desirable and easier than detecting and removing outliers.
Here's an autoscale idea using percentiles and data-dependent margins to achieve a nice view.
# xdata = some x data points ...
# ydata = some y data points ...
# Finding limits for y-axis
ypbot = np.percentile(ydata, 1)
yptop = np.percentile(ydata, 99)
ypad = 0.2*(yptop - ypbot)
ymin = ypbot - ypad
ymax = yptop + ypad
Example usage:
fig = plt.figure(figsize=(6, 8))
ax1 = fig.add_subplot(211)
ax1.scatter(xdata, ydata, s=1, c='blue')
ax1.set_title('Original')
ax1.axhline(y=0, color='black')
ax2 = fig.add_subplot(212)
ax2.scatter(xdata, ydata, s=1, c='blue')
ax2.axhline(y=0, color='black')
ax2.set_title('Autscaled')
ax2.set_ylim([ymin, ymax])
plt.show()

Plot a histogram such that bar heights sum to 1 (probability)

I'd like to plot a normalized histogram from a vector using matplotlib. I tried the following:
plt.hist(myarray, normed=True)
as well as:
plt.hist(myarray, normed=1)
but neither option produces a y-axis from [0, 1] such that the bar heights of the histogram sum to 1.
If you want the sum of all bars to be equal unity, weight each bin by the total number of values:
weights = np.ones_like(myarray) / len(myarray)
plt.hist(myarray, weights=weights)
Note for Python 2.x: add casting to float() for one of the operators of the division as otherwise you would end up with zeros due to integer division
It would be more helpful if you posed a more complete working (or in this case non-working) example.
I tried the following:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.randn(1000)
fig = plt.figure()
ax = fig.add_subplot(111)
n, bins, rectangles = ax.hist(x, 50, density=True)
fig.canvas.draw()
plt.show()
This will indeed produce a bar-chart histogram with a y-axis that goes from [0,1].
Further, as per the hist documentation (i.e. ax.hist? from ipython), I think the sum is fine too:
*normed*:
If *True*, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
``n/(len(x)*dbin)``. In a probability density, the integral of
the histogram should be 1; you can verify that with a
trapezoidal integration of the probability density function::
pdf, bins, patches = ax.hist(...)
print np.sum(pdf * np.diff(bins))
Giving this a try after the commands above:
np.sum(n * np.diff(bins))
I get a return value of 1.0 as expected. Remember that normed=True doesn't mean that the sum of the value at each bar will be unity, but rather than the integral over the bars is unity. In my case np.sum(n) returned approx 7.2767.
I know this answer is too late considering the question is dated 2010 but I came across this question as I was facing a similar problem myself. As already stated in the answer, normed=True means that the total area under the histogram is equal to 1 but the sum of heights is not equal to 1. However, I wanted to, for convenience of physical interpretation of a histogram, make one with sum of heights equal to 1.
I found a hint in the following question - Python: Histogram with area normalized to something other than 1
But I was not able to find a way of making bars mimic the histtype="step" feature hist(). This diverted me to : Matplotlib - Stepped histogram with already binned data
If the community finds it acceptable I should like to put forth a solution which synthesises ideas from both the above posts.
import matplotlib.pyplot as plt
# Let X be the array whose histogram needs to be plotted.
nx, xbins, ptchs = plt.hist(X, bins=20)
plt.clf() # Get rid of this histogram since not the one we want.
nx_frac = nx/float(len(nx)) # Each bin divided by total number of objects.
width = xbins[1] - xbins[0] # Width of each bin.
x = np.ravel(zip(xbins[:-1], xbins[:-1]+width))
y = np.ravel(zip(nx_frac,nx_frac))
plt.plot(x,y,linestyle="dashed",label="MyLabel")
#... Further formatting.
This has worked wonderfully for me though in some cases I have noticed that the left most "bar" or the right most "bar" of the histogram does not close down by touching the lowest point of the Y-axis. In such a case adding an element 0 at the begging or the end of y achieved the necessary result.
Just thought I'd share my experience. Thank you.
Here is another simple solution using np.histogram() method.
myarray = np.random.random(100)
results, edges = np.histogram(myarray, normed=True)
binWidth = edges[1] - edges[0]
plt.bar(edges[:-1], results*binWidth, binWidth)
You can indeed check that the total sums up to 1 with:
> print sum(results*binWidth)
1.0
The easiest solution is to use seaborn.histplot, or seaborn.displot with kind='hist', and specify stat='probability'
probability: or proportion: normalize such that bar heights sum to 1
density: normalize such that the total area of the histogram equals 1
data: pandas.DataFrame, numpy.ndarray, mapping, or sequence
seaborn is a high-level API for matplotlib
Tested in python 3.8.12, matplotlib 3.4.3, seaborn 0.11.2
Imports and Data
import seaborn as sns
import matplotlib.pyplot as plt
# load data
df = sns.load_dataset('penguins')
sns.histplot
axes-level plot
# create figure and axes
fig, ax = plt.subplots(figsize=(6, 5))
p = sns.histplot(data=df, x='flipper_length_mm', stat='probability', ax=ax)
sns.displot
figure-level plot
p = sns.displot(data=df, x='flipper_length_mm', stat='probability', height=4, aspect=1.5)
Since matplotlib 3.0.2, normed=True is deprecated. To get the desired output I had to do:
import numpy as np
data=np.random.randn(1000)
bins=np.arange(-3.0,3.0,51)
counts, _ = np.histogram(data,bins=bins)
if density: # equivalent of normed=True
counts_weighter=counts.sum()
else: # equivalent of normed=False
counts_weighter=1.0
plt.hist(bins[:-1],bins=bins,weights=counts/counts_weighter)
Trying to specify weights and density simultaneously as arguments to plt.hist() did not work for me. If anyone know of a way to get that working without having access to the normed keyword argument then please let me know in the comments and I will delete/modify this answer.
If you want bin centres then don't use bins[:-1] which are the bin edges - you need to choose a suitable scheme for how to calculate the centres (which may or may not be trivially derived).

Categories