Plot a histogram such that the total height equals 1 - python

This is a follow-up question to this answer. I'm trying to plot normed histogram, but instead of getting 1 as maximum value on y axis, I'm getting different numbers.
For array k=(1,4,3,1)
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(1,4,3,1)
plt.hist(k, normed=1)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
I get this histogram, that doesn't look like normed.
For a different array k=(3,3,3,3)
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(3,3,3,3)
plt.hist(k, normed=1)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
I get this histogram with max y-value is 10.
For different k I get different max value of y even though normed=1 or normed=True.
Why the normalization (if it works) changes based on the data and how can I make maximum value of y equals to 1?
UPDATE:
I am trying to implement Carsten König answer from plotting histograms whose bar heights sum to 1 in matplotlib and getting very weird result:
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(1,4,3,1)
weights = np.ones_like(k)/len(k)
plt.hist(k, weights=weights)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
Result:
What am I doing wrong?

When plotting a normalized histogram, the area under the curve should sum to 1, not the height.
In [44]:
import matplotlib.pyplot as plt
k=(3,3,3,3)
x, bins, p=plt.hist(k, density=True) # used to be normed=True in older versions
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
In [45]:
print bins
[ 2.5 2.6 2.7 2.8 2.9 3. 3.1 3.2 3.3 3.4 3.5]
Here, this example, the bin width is 0.1, the area underneath the curve sums up to one (0.1*10).
x stores the height for each bins. p stores each of those individual bins objects (actually, they are patches. So we just sum up x and modify the height of each bin object.
To have the sum of height to be 1, add the following before plt.show():
for item in p:
item.set_height(item.get_height()/sum(x))

You could use the solution outlined here:
weights = np.ones_like(myarray)/float(len(myarray))
plt.hist(myarray, weights=weights)

One way is to get the probabilities on your own, and then plot with plt.bar:
In [91]: from collections import Counter
...: c=Counter(k)
...: print c
Counter({1: 2, 3: 1, 4: 1})
In [92]: plt.bar(c.keys(), c.values())
...: plt.show()
result:

A normed histogram is defined such that the sum of products of width and height of each column is equal to the total count. That's why you are not getting your max equal to one.
However, if you still want to force it to be 1, you could use numpy and matplotlib.pyplot.bar in the following way
sample = np.random.normal(0,10,100)
#generate bins boundaries and heights
bin_height,bin_boundary = np.histogram(sample,bins=10)
#define width of each column
width = bin_boundary[1]-bin_boundary[0]
#standardize each column by dividing with the maximum height
bin_height = bin_height/float(max(bin_height))
#plot
plt.bar(bin_boundary[:-1],bin_height,width = width)
plt.show()

I found it very easy to use plotly express. Here is my code for your example:
import plotly.express as px
k= [1,4,3,1]
px.histogram(k,nbins=10,range_x=[0,10],histnorm='probability')
Which gives the normalize histogram the way that you want it. If you want to use percentage instead of probability you can simply change the last line of code to
px.histogram(k,nbins=10,range_x=[0,10],histnorm='percent')
If you don't want to manually set the range_x and nbins to make sure area of histogram is always one, you can use the following codes:
x_min=int(min(k))-1
x_max=int(max(k))+1
x_bins = x_max-x_min
px.histogram(k,nbins=x_bins,range_x=[x_min,x_max],histnorm='probability')

Related

Representing an experiment with two dices using matplotlib - wrong representation [duplicate]

I'm generating some histograms with matplotlib and I'm having some trouble figuring out how to get the xticks of a histogram to align with the bars.
Here's a sample of the code I use to generate the histogram:
from matplotlib import pyplot as py
py.hist(histogram_data, 49, alpha=0.75)
py.title(column_name)
py.xticks(range(49))
py.show()
I know that all of values in the histogram_data array are in [0,1,...,48]. Which, assuming I did the math right, means there are 49 unique values. I'd like to show a histogram of each of those values. Here's a picture of what's generated.
How can I set up the graph such that all of the xticks are aligned to the left, middle or right of each of the bars?
Short answer: Use plt.hist(data, bins=range(50)) instead to get left-aligned bins, plt.hist(data, bins=np.arange(50)-0.5) to get center-aligned bins, etc.
Also, if performance matters, because you want counts of unique integers, there are a couple of slightly more efficient methods (np.bincount) that I'll show at the end.
Problem Statement
As a stand-alone example of what you're seeing, consider the following:
import matplotlib.pyplot as plt
import numpy as np
# Generate a random array of integers between 0-9
# data.min() will be 0 and data.max() will be 9 (not 10)
data = np.random.randint(0, 10, 1000)
plt.hist(data, bins=10)
plt.xticks(range(10))
plt.show()
As you've noticed, the bins aren't aligned with integer intervals. This is basically because you asked for 10 bins between 0 and 9, which isn't quite the same as asking for bins for the 10 unique values.
The number of bins you want isn't exactly the same as the number of unique values. What you actually should do in this case is manually specify the bin edges.
To explain what's going on, let's skip matplotlib.pyplot.hist and just use the underlying numpy.histogram function.
For example, let's say you have the values [0, 1, 2, 3]. Your first instinct would be to do:
In [1]: import numpy as np
In [2]: np.histogram([0, 1, 2, 3], bins=4)
Out[2]: (array([1, 1, 1, 1]), array([ 0. , 0.75, 1.5 , 2.25, 3. ]))
The first array returned is the counts and the second is the bin edges (in other words, where bar edges would be in your plot).
Notice that we get the counts we'd expect, but because we asked for 4 bins between the min and max of the data, the bin edges aren't on integer values.
Next, you might try:
In [3]: np.histogram([0, 1, 2, 3], bins=3)
Out[3]: (array([1, 1, 2]), array([ 0., 1., 2., 3.]))
Note that the bin edges (the second array) are what you were expecting, but the counts aren't. That's because the last bin behaves differently than the others, as noted in the documentation for numpy.histogram:
Notes
-----
All but the last (righthand-most) bin is half-open. In other words, if
`bins` is::
[1, 2, 3, 4]
then the first bin is ``[1, 2)`` (including 1, but excluding 2) and the
second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which *includes*
4.
Therefore, what you actually should do is specify exactly what bin edges you want, and either include one beyond your last data point or shift the bin edges to the 0.5 intervals. For example:
In [4]: np.histogram([0, 1, 2, 3], bins=range(5))
Out[4]: (array([1, 1, 1, 1]), array([0, 1, 2, 3, 4]))
Bin Alignment
Now let's apply this to the first example and see what it looks like:
import matplotlib.pyplot as plt
import numpy as np
# Generate a random array of integers between 0-9
# data.min() will be 0 and data.max() will be 9 (not 10)
data = np.random.randint(0, 10, 1000)
plt.hist(data, bins=range(11)) # <- The only difference
plt.xticks(range(10))
plt.show()
Okay, great! However, we now effectively have left-aligned bins. What if we wanted center-aligned bins to better reflect the fact that these are unique values?
The quick way is to just shift the bin edges:
import matplotlib.pyplot as plt
import numpy as np
# Generate a random array of integers between 0-9
# data.min() will be 0 and data.max() will be 9 (not 10)
data = np.random.randint(0, 10, 1000)
bins = np.arange(11) - 0.5
plt.hist(data, bins)
plt.xticks(range(10))
plt.xlim([-1, 10])
plt.show()
Similarly for right-aligned bins, just shift by -1.
Another approach
For the particular case of unique integer values, there's another, more efficient approach we can take.
If you're dealing with unique integer counts starting with 0, you're better off using numpy.bincount than using numpy.hist.
For example:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randint(0, 10, 1000)
counts = np.bincount(data)
# Switching to the OO-interface. You can do all of this with "plt" as well.
fig, ax = plt.subplots()
ax.bar(range(10), counts, width=1, align='center')
ax.set(xticks=range(10), xlim=[-1, 10])
plt.show()
There are two big advantages to this approach. One is speed. numpy.histogram (and therefore plt.hist) basically runs the data through numpy.digitize and then numpy.bincount. Because you're dealing with unique integer values, there's no need to take the numpy.digitize step.
However, the bigger advantage is more control over display. If you'd prefer thinner rectangles, just use a smaller width:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randint(0, 10, 1000)
counts = np.bincount(data)
# Switching to the OO-interface. You can do all of this with "plt" as well.
fig, ax = plt.subplots()
ax.bar(range(10), counts, width=0.8, align='center')
ax.set(xticks=range(10), xlim=[-1, 10])
plt.show()
What you are looking for is to know the edges of each bin and use it as xtick.
Say you have some numbers in x to generate a histogram.
import matplotlib.pyplot as plt
import numpy as np
import random
n=1000
x=np.zeros(1000)
for i in range(n):
x[i]=random.uniform(0,100)
Now let's create the histogram.
n, bins, edges = plt.hist(x,bins=5,ec="red",alpha=0.7)
n is the array with the no. of items in each bin
bins is the array with the values in edges of the bins
edges is list of patch objects
Now since you have the location of the edge of bins starting from left to right, display it as the xticks.
plt.xticks(bins)
plt.show()
If comment bins.append(sorted(set(labels))[-1]):
bins = [i_bin - 0.5 for i_bin in set(labels)]
# bins.append(sorted(set(labels))[-1])
plt.hist(labels, bins)
plt.show()
If not:
bins = [i_bin - 0.5 for i_bin in set(labels)]
bins.append(sorted(set(labels))[-1])
plt.hist(labels, bins)
plt.show()
Using the OO interface to configure ticks has the advantage of centering the labels while preserving the xticks. Also, it works with any plotting function and doesn't depend on np.bincount() or ax.bar()
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
data = np.random.randint(0, 10, 1000)
mybins = range(11)
fig, ax = plt.subplots()
ax.hist(data, bins=mybins, rwidth=0.8)
ax.set_xticks(mybins)
ax.xaxis.set_minor_locator(tkr.AutoMinorLocator(n=2))
ax.xaxis.set_minor_formatter(tkr.FixedFormatter(mybins))
ax.xaxis.set_major_formatter(tkr.NullFormatter())
for tick in ax.xaxis.get_minor_ticks():
tick.tick1line.set_markersize(0)
(source: pbrd.co)
I think the best way is to use the patches and bins returned from matplotlib.hist. Below is a simple example.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.randint(10, 60, 1000)
height, bins, patches = plt.hist(data, bins=15, ec='k')
ticks = [(patch.get_x() + (patch.get_x() + patch.get_width()))/2 for patch in patches] ## or ticklabels
ticklabels = (bins[1:] + bins[:-1]) / 2 ## or ticks
plt.xticks(ticks, np.round(ticklabels, 2), rotation=90)
plt.show()

How can I create a list of the values on the y-axis without having to plot a graph in Python?

I have a piece of code that plots a random walk with a specified number of bins on my y-axis. Is there a way in Python to replicate/recreate the values on my y-axis, without having to plot the graph? Below is the code I've been working on and the method I've tried is to divide the min-max range by the number of
wanted bins and thereafter create a list with these values. However, I find my method far from optimal and not close to the results I get by using the below code.
I am greatful for any help on this matter!
import matplotlib.pyplot as plt
import numpy as np
import random
dims = 1
step_n = 2000
step_set = [-1, 0, 1]
origin = np.zeros((1,dims))
random.seed(30)
step_shape = (step_n,dims)
steps = np.random.choice(a=step_set, size=step_shape)
path = np.concatenate([origin, steps]).cumsum(0)
# create subplot
fig, ax = plt.subplots(1,1, figsize=(20, 11))
img = ax.plot(path)
plt.locator_params(axis='y', nbins=20)
y_values = ax.get_yticks() # y_values is a numpy array with my y values
I am not sure, if I understood your problem correctly.
Matplotlib defines the differences between the ticks in a way, that I assume are mostly multiples of 5.
But a general approach could be, to calculate a padding based on the bins you want and add/subtract it. For your given example the following gives the same result as ax.get_yticks()
bins = 19
padding = np.ceil((np.max(path) - np.min(path)) / bins)
np.linspace(np.min(path) - padding, np.max(path) + padding, bins, dtype=np.int32)

How to re-scale the counts in a matplotlib histogram

I have a matplotlib histogram that works fine.
hist_bin_width = 4
on_hist = plt.hist(my_data,bins=range(-100, 200,hist_bin_width),alpha=.3,color='#6e9bd1',label='on')
All I want to do is to rescale by a factor of, say, 2. I don't want to change the bin width, or to change the y axis labels. I want to take the counts in all the bins (e.g. bin 1 has 17 counts) and multiply by 2 so that bin 1 now has 34 counts in it.
Is this possible?
Thank you.
As it's just a simple rescaling of the y-axis, this must be possible. The complication arises because Matplotlib's hist computes and draws the histogram, making it difficult to intervene. However, as the documentation also notes, you can use the weights parameter to "draw a histogram of data that has already been binned". You can bin the data in a first step with Numpy's histogram function. Applying the scaling factor is then straightforward:
from matplotlib import pyplot
import numpy
numpy.random.seed(0)
data = numpy.random.normal(50, 20, 10000)
(counts, bins) = numpy.histogram(data, bins=range(101))
factor = 2
pyplot.hist(bins[:-1], bins, weights=factor*counts)
pyplot.show()
pyplot.hist's weights argument can be used to weight each data point with a factor like
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
data = np.random.normal(50, 20, 10000)
factor = 2
hist_bin_width = 40
plt.hist(data, bins=range(-100, 200, hist_bin_width),
weights=factor*np.ones_like(data))
plt.show()

display a histogram with very non-uniform bin widths

Here is the histogram
To generate this plot, I did:
bins = np.array([0.03, 0.3, 2, 100])
plt.hist(m, bins = bins, weights=np.zeros_like(m) + 1. / m.size)
However, as you noticed, I want to plot the histogram of the relative frequency of each data point with only 3 bins that have different sizes:
bin1 = 0.03 -> 0.3
bin2 = 0.3 -> 2
bin3 = 2 -> 100
The histogram looks ugly since the size of the last bin is extremely large relative to the other bins. How can I fix the histogram? I want to change the width of the bins but I do not want to change the range of each bin.
As #cel pointed out, this is no longer a histogram, but you can do what you are asking using plt.bar and np.histogram. You then just need to set the xticklabels to a string describing the bin edges. For example:
import numpy as np
import matplotlib.pyplot as plt
bins = [0.03,0.3,2,100] # your bins
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99] # random data
hist, bin_edges = np.histogram(data,bins) # make the histogram
fig,ax = plt.subplots()
# Plot the histogram heights against integers on the x axis
ax.bar(range(len(hist)),hist,width=1)
# Set the ticks to the middle of the bars
ax.set_xticks([0.5+i for i,j in enumerate(hist)])
# Set the xticklabels to a string that tells us what the bin edges were
ax.set_xticklabels(['{} - {}'.format(bins[i],bins[i+1]) for i,j in enumerate(hist)])
plt.show()
EDIT
If you update to matplotlib v1.5.0, you will find that bar now takes a kwarg tick_label, which can make this plotting even easier (see here):
hist, bin_edges = np.histogram(data,bins)
ax.bar(range(len(hist)),hist,width=1,align='center',tick_label=
['{} - {}'.format(bins[i],bins[i+1]) for i,j in enumerate(hist)])
If your actual values of the bins are not important but you want to have a histogram of values of completely different orders of magnitude, you can use a logarithmic scaling along the x axis. This here gives you bars with equal widths
import numpy as np
import matplotlib.pyplot as plt
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99]
plt.hist(data,bins=10**np.linspace(-2,2,5))
plt.xscale('log')
plt.show()
When you have to use your bin values you can do
import numpy as np
import matplotlib.pyplot as plt
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99]
bins = [0.03,0.3,2,100]
plt.hist(data,bins=bins)
plt.xscale('log')
plt.show()
However, in this case the widths are not perfectly equal but still readable. If the widths must be equal and you have to use your bins I recommend #tom's solution.

Normalize a multiple data histogram

I have several arrays that I'm plotting a histogram of, like so:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(0,.5,1000)
y = np.random.normal(0,.5,100000)
plt.hist((x,y),normed=True)
Of course, however, this normalizes both of the arrays individually, so that they both have the same peak. I'm looking to normalize them to the total number of elements, so that the histogram of y will be visibly taller than that of x. Is there a handy way to do this in matplotlib or will I have to mess around in numpy? I haven't found anything about it.
Another way to put it is that if I were instead to make a cumulative plot of the two arrays, they shouldn't both top out at 1, but should add to 1.
Yes, you can compute the histogram with numpy and renormalise it.
x = np.random.normal(0,.5,1000)
y = np.random.normal(0,.5,100000)
xhist, xbins = np.histogram(x, normed=True)
yhist, ybins = np.histogram(x, normed=True)
And now, you apply your regularisation. For example, if you want x to be normalised to 1 and y proportional:
yhist *= len(y) / len(x)
Now, to plot the histogram:
def plot_histogram(data, edge_bins, **kwargs):
bins = edge_bins[:-1] + edge_bins[1:]
plt.step(bins, data, **kwargs)
plot_histogram(xhist, xbins, c='b')
plot_histogram(yhist, ybins, c='g')

Categories