Representing an experiment with two dices using matplotlib - wrong representation [duplicate] - python

I'm generating some histograms with matplotlib and I'm having some trouble figuring out how to get the xticks of a histogram to align with the bars.
Here's a sample of the code I use to generate the histogram:
from matplotlib import pyplot as py
py.hist(histogram_data, 49, alpha=0.75)
py.title(column_name)
py.xticks(range(49))
py.show()
I know that all of values in the histogram_data array are in [0,1,...,48]. Which, assuming I did the math right, means there are 49 unique values. I'd like to show a histogram of each of those values. Here's a picture of what's generated.
How can I set up the graph such that all of the xticks are aligned to the left, middle or right of each of the bars?

Short answer: Use plt.hist(data, bins=range(50)) instead to get left-aligned bins, plt.hist(data, bins=np.arange(50)-0.5) to get center-aligned bins, etc.
Also, if performance matters, because you want counts of unique integers, there are a couple of slightly more efficient methods (np.bincount) that I'll show at the end.
Problem Statement
As a stand-alone example of what you're seeing, consider the following:
import matplotlib.pyplot as plt
import numpy as np
# Generate a random array of integers between 0-9
# data.min() will be 0 and data.max() will be 9 (not 10)
data = np.random.randint(0, 10, 1000)
plt.hist(data, bins=10)
plt.xticks(range(10))
plt.show()
As you've noticed, the bins aren't aligned with integer intervals. This is basically because you asked for 10 bins between 0 and 9, which isn't quite the same as asking for bins for the 10 unique values.
The number of bins you want isn't exactly the same as the number of unique values. What you actually should do in this case is manually specify the bin edges.
To explain what's going on, let's skip matplotlib.pyplot.hist and just use the underlying numpy.histogram function.
For example, let's say you have the values [0, 1, 2, 3]. Your first instinct would be to do:
In [1]: import numpy as np
In [2]: np.histogram([0, 1, 2, 3], bins=4)
Out[2]: (array([1, 1, 1, 1]), array([ 0. , 0.75, 1.5 , 2.25, 3. ]))
The first array returned is the counts and the second is the bin edges (in other words, where bar edges would be in your plot).
Notice that we get the counts we'd expect, but because we asked for 4 bins between the min and max of the data, the bin edges aren't on integer values.
Next, you might try:
In [3]: np.histogram([0, 1, 2, 3], bins=3)
Out[3]: (array([1, 1, 2]), array([ 0., 1., 2., 3.]))
Note that the bin edges (the second array) are what you were expecting, but the counts aren't. That's because the last bin behaves differently than the others, as noted in the documentation for numpy.histogram:
Notes
-----
All but the last (righthand-most) bin is half-open. In other words, if
`bins` is::
[1, 2, 3, 4]
then the first bin is ``[1, 2)`` (including 1, but excluding 2) and the
second ``[2, 3)``. The last bin, however, is ``[3, 4]``, which *includes*
4.
Therefore, what you actually should do is specify exactly what bin edges you want, and either include one beyond your last data point or shift the bin edges to the 0.5 intervals. For example:
In [4]: np.histogram([0, 1, 2, 3], bins=range(5))
Out[4]: (array([1, 1, 1, 1]), array([0, 1, 2, 3, 4]))
Bin Alignment
Now let's apply this to the first example and see what it looks like:
import matplotlib.pyplot as plt
import numpy as np
# Generate a random array of integers between 0-9
# data.min() will be 0 and data.max() will be 9 (not 10)
data = np.random.randint(0, 10, 1000)
plt.hist(data, bins=range(11)) # <- The only difference
plt.xticks(range(10))
plt.show()
Okay, great! However, we now effectively have left-aligned bins. What if we wanted center-aligned bins to better reflect the fact that these are unique values?
The quick way is to just shift the bin edges:
import matplotlib.pyplot as plt
import numpy as np
# Generate a random array of integers between 0-9
# data.min() will be 0 and data.max() will be 9 (not 10)
data = np.random.randint(0, 10, 1000)
bins = np.arange(11) - 0.5
plt.hist(data, bins)
plt.xticks(range(10))
plt.xlim([-1, 10])
plt.show()
Similarly for right-aligned bins, just shift by -1.
Another approach
For the particular case of unique integer values, there's another, more efficient approach we can take.
If you're dealing with unique integer counts starting with 0, you're better off using numpy.bincount than using numpy.hist.
For example:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randint(0, 10, 1000)
counts = np.bincount(data)
# Switching to the OO-interface. You can do all of this with "plt" as well.
fig, ax = plt.subplots()
ax.bar(range(10), counts, width=1, align='center')
ax.set(xticks=range(10), xlim=[-1, 10])
plt.show()
There are two big advantages to this approach. One is speed. numpy.histogram (and therefore plt.hist) basically runs the data through numpy.digitize and then numpy.bincount. Because you're dealing with unique integer values, there's no need to take the numpy.digitize step.
However, the bigger advantage is more control over display. If you'd prefer thinner rectangles, just use a smaller width:
import matplotlib.pyplot as plt
import numpy as np
data = np.random.randint(0, 10, 1000)
counts = np.bincount(data)
# Switching to the OO-interface. You can do all of this with "plt" as well.
fig, ax = plt.subplots()
ax.bar(range(10), counts, width=0.8, align='center')
ax.set(xticks=range(10), xlim=[-1, 10])
plt.show()

What you are looking for is to know the edges of each bin and use it as xtick.
Say you have some numbers in x to generate a histogram.
import matplotlib.pyplot as plt
import numpy as np
import random
n=1000
x=np.zeros(1000)
for i in range(n):
x[i]=random.uniform(0,100)
Now let's create the histogram.
n, bins, edges = plt.hist(x,bins=5,ec="red",alpha=0.7)
n is the array with the no. of items in each bin
bins is the array with the values in edges of the bins
edges is list of patch objects
Now since you have the location of the edge of bins starting from left to right, display it as the xticks.
plt.xticks(bins)
plt.show()

If comment bins.append(sorted(set(labels))[-1]):
bins = [i_bin - 0.5 for i_bin in set(labels)]
# bins.append(sorted(set(labels))[-1])
plt.hist(labels, bins)
plt.show()
If not:
bins = [i_bin - 0.5 for i_bin in set(labels)]
bins.append(sorted(set(labels))[-1])
plt.hist(labels, bins)
plt.show()

Using the OO interface to configure ticks has the advantage of centering the labels while preserving the xticks. Also, it works with any plotting function and doesn't depend on np.bincount() or ax.bar()
import matplotlib.pyplot as plt
import matplotlib.ticker as tkr
data = np.random.randint(0, 10, 1000)
mybins = range(11)
fig, ax = plt.subplots()
ax.hist(data, bins=mybins, rwidth=0.8)
ax.set_xticks(mybins)
ax.xaxis.set_minor_locator(tkr.AutoMinorLocator(n=2))
ax.xaxis.set_minor_formatter(tkr.FixedFormatter(mybins))
ax.xaxis.set_major_formatter(tkr.NullFormatter())
for tick in ax.xaxis.get_minor_ticks():
tick.tick1line.set_markersize(0)
(source: pbrd.co)

I think the best way is to use the patches and bins returned from matplotlib.hist. Below is a simple example.
import numpy as np
import matplotlib.pyplot as plt
data = np.random.randint(10, 60, 1000)
height, bins, patches = plt.hist(data, bins=15, ec='k')
ticks = [(patch.get_x() + (patch.get_x() + patch.get_width()))/2 for patch in patches] ## or ticklabels
ticklabels = (bins[1:] + bins[:-1]) / 2 ## or ticks
plt.xticks(ticks, np.round(ticklabels, 2), rotation=90)
plt.show()

Related

How can I create a list of the values on the y-axis without having to plot a graph in Python?

I have a piece of code that plots a random walk with a specified number of bins on my y-axis. Is there a way in Python to replicate/recreate the values on my y-axis, without having to plot the graph? Below is the code I've been working on and the method I've tried is to divide the min-max range by the number of
wanted bins and thereafter create a list with these values. However, I find my method far from optimal and not close to the results I get by using the below code.
I am greatful for any help on this matter!
import matplotlib.pyplot as plt
import numpy as np
import random
dims = 1
step_n = 2000
step_set = [-1, 0, 1]
origin = np.zeros((1,dims))
random.seed(30)
step_shape = (step_n,dims)
steps = np.random.choice(a=step_set, size=step_shape)
path = np.concatenate([origin, steps]).cumsum(0)
# create subplot
fig, ax = plt.subplots(1,1, figsize=(20, 11))
img = ax.plot(path)
plt.locator_params(axis='y', nbins=20)
y_values = ax.get_yticks() # y_values is a numpy array with my y values
I am not sure, if I understood your problem correctly.
Matplotlib defines the differences between the ticks in a way, that I assume are mostly multiples of 5.
But a general approach could be, to calculate a padding based on the bins you want and add/subtract it. For your given example the following gives the same result as ax.get_yticks()
bins = 19
padding = np.ceil((np.max(path) - np.min(path)) / bins)
np.linspace(np.min(path) - padding, np.max(path) + padding, bins, dtype=np.int32)

How to assign a specific value to a bin in histogram in python?

Dear Computer Scientist Family
I was wondering is it possible to assign any value I give to a certain bin, in a histogram. If you notice in my code it will produce a histogram with 2 bins populated with a quantity of 1.
# -*- coding: utf-8 -*-
"""
Created on Sat May 9 20:23:51 2020
#author: DeAngelo
"""
import matplotlib.pyplot as plt
import numpy as np
import math
fig,ax = plt.subplots(1,1)
a = np.array([11,75])
ax.hist(a, bins = [0,25,50,75,100])
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
plt.show()
First could you theoretically tell the computer that you want to place the value that was assigned in the 75-100 bin. And move it to the 0-25 bin. Meaning I now would now have 2 entries in the 0-25 bin. But my array would still be a=[11,75]
Also another example would be say I have an array 'b=np.array[3]' and I plotted this on my histogram. I know that this would be given the bin 0-25, but could I tell the computer to put this in the 75-100 bin?
If so how?
Second what about the average I know you can use np.mean(a) to calculate the average. But say I wanted to put that value in the bin that corresponds to 75-100. Would that be possible?
I looked at this code How to assign a number to a value falling in a certain bin , but that was in Ancient Egyptian hieroglyphics and unfortunately my degree is in physics and not that.
If you can help me with this it would mean a lot to me.<3
A histogram is just represented as a bar chart, so you could manipulate the bar values. Here you can precalculate the histogram and plot it as a bar chart:
import matplotlib.pyplot as plt
import numpy as np
import math
a = np.array([11,75])
# calculate histogram values
vals, bins = np.histogram(a, bins = [0,25,50,75,100])
width = np.ediff1d(bins)
fig,ax = plt.subplots(1,1)
# plot histogram values as bar chart
ax.bar(bins[:-1] + width/2, vals, width)
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
plt.show()
This gives you your example. You can now however manipulate the bar values before plotting should you so wish:
# the bin values
vals
>>> array([1, 0, 0, 1])
# bin edges
bins
>>> array([ 0, 25, 50, 75, 100])
# do manipulation -> remove one count from 75-100 bin and put in 0-25 bin
vals[-1] -= 1
vals[0] += 1
# plot new graph
fig,ax = plt.subplots(1,1)
# plot histogram values as bar chart
ax.bar(bins[:-1] + width/2, vals, width)
ax.set_title("histogram of result")
ax.set_xticks([0,25,50,75,100])
ax.set_xlabel('marks')
ax.set_ylabel('no. of students')
plt.show()
I have to comment, what are your reasons for doing this? In your example you want to calculate the average and put it in the wrong bin. You can certainly do that as I have shown above, but I'm not sure what it means at that point?
Yes, this is possible. You can catch the return value of the histogram function by assigning it to a variable:
h = ax.hist(a, bins = [0, 25, 50, 75, 100])
h
(array([1., 0., 0., 1.]),
array([ 0, 25, 50, 75, 100]),
<a list of 4 Patch objects>)
As the documentation says, this "is a tuple (n, bins, patches)".
We are only interested in the counts and bins, so let's assign them to individual variables:
counts, bins, _ = h
Now you can manipulate the counts in any way you like, e.g. move one count from the fourth to the first bin:
counts[3] -= 1
counts[0] += 1
counts
array([2., 0., 0., 0.])
We can turn these data into a histogram plot as shown in the documentation under the weights parameter:
plt.hist(bins[:-1], bins, weights=counts);

How to re-scale the counts in a matplotlib histogram

I have a matplotlib histogram that works fine.
hist_bin_width = 4
on_hist = plt.hist(my_data,bins=range(-100, 200,hist_bin_width),alpha=.3,color='#6e9bd1',label='on')
All I want to do is to rescale by a factor of, say, 2. I don't want to change the bin width, or to change the y axis labels. I want to take the counts in all the bins (e.g. bin 1 has 17 counts) and multiply by 2 so that bin 1 now has 34 counts in it.
Is this possible?
Thank you.
As it's just a simple rescaling of the y-axis, this must be possible. The complication arises because Matplotlib's hist computes and draws the histogram, making it difficult to intervene. However, as the documentation also notes, you can use the weights parameter to "draw a histogram of data that has already been binned". You can bin the data in a first step with Numpy's histogram function. Applying the scaling factor is then straightforward:
from matplotlib import pyplot
import numpy
numpy.random.seed(0)
data = numpy.random.normal(50, 20, 10000)
(counts, bins) = numpy.histogram(data, bins=range(101))
factor = 2
pyplot.hist(bins[:-1], bins, weights=factor*counts)
pyplot.show()
pyplot.hist's weights argument can be used to weight each data point with a factor like
import matplotlib.pyplot as plt
import numpy as np; np.random.seed(42)
data = np.random.normal(50, 20, 10000)
factor = 2
hist_bin_width = 40
plt.hist(data, bins=range(-100, 200, hist_bin_width),
weights=factor*np.ones_like(data))
plt.show()

display a histogram with very non-uniform bin widths

Here is the histogram
To generate this plot, I did:
bins = np.array([0.03, 0.3, 2, 100])
plt.hist(m, bins = bins, weights=np.zeros_like(m) + 1. / m.size)
However, as you noticed, I want to plot the histogram of the relative frequency of each data point with only 3 bins that have different sizes:
bin1 = 0.03 -> 0.3
bin2 = 0.3 -> 2
bin3 = 2 -> 100
The histogram looks ugly since the size of the last bin is extremely large relative to the other bins. How can I fix the histogram? I want to change the width of the bins but I do not want to change the range of each bin.
As #cel pointed out, this is no longer a histogram, but you can do what you are asking using plt.bar and np.histogram. You then just need to set the xticklabels to a string describing the bin edges. For example:
import numpy as np
import matplotlib.pyplot as plt
bins = [0.03,0.3,2,100] # your bins
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99] # random data
hist, bin_edges = np.histogram(data,bins) # make the histogram
fig,ax = plt.subplots()
# Plot the histogram heights against integers on the x axis
ax.bar(range(len(hist)),hist,width=1)
# Set the ticks to the middle of the bars
ax.set_xticks([0.5+i for i,j in enumerate(hist)])
# Set the xticklabels to a string that tells us what the bin edges were
ax.set_xticklabels(['{} - {}'.format(bins[i],bins[i+1]) for i,j in enumerate(hist)])
plt.show()
EDIT
If you update to matplotlib v1.5.0, you will find that bar now takes a kwarg tick_label, which can make this plotting even easier (see here):
hist, bin_edges = np.histogram(data,bins)
ax.bar(range(len(hist)),hist,width=1,align='center',tick_label=
['{} - {}'.format(bins[i],bins[i+1]) for i,j in enumerate(hist)])
If your actual values of the bins are not important but you want to have a histogram of values of completely different orders of magnitude, you can use a logarithmic scaling along the x axis. This here gives you bars with equal widths
import numpy as np
import matplotlib.pyplot as plt
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99]
plt.hist(data,bins=10**np.linspace(-2,2,5))
plt.xscale('log')
plt.show()
When you have to use your bin values you can do
import numpy as np
import matplotlib.pyplot as plt
data = [0.04,0.07,0.1,0.2,0.2,0.8,1,1.5,4,5,7,8,43,45,54,56,99]
bins = [0.03,0.3,2,100]
plt.hist(data,bins=bins)
plt.xscale('log')
plt.show()
However, in this case the widths are not perfectly equal but still readable. If the widths must be equal and you have to use your bins I recommend #tom's solution.

Plot a histogram such that the total height equals 1

This is a follow-up question to this answer. I'm trying to plot normed histogram, but instead of getting 1 as maximum value on y axis, I'm getting different numbers.
For array k=(1,4,3,1)
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(1,4,3,1)
plt.hist(k, normed=1)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
I get this histogram, that doesn't look like normed.
For a different array k=(3,3,3,3)
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(3,3,3,3)
plt.hist(k, normed=1)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
I get this histogram with max y-value is 10.
For different k I get different max value of y even though normed=1 or normed=True.
Why the normalization (if it works) changes based on the data and how can I make maximum value of y equals to 1?
UPDATE:
I am trying to implement Carsten König answer from plotting histograms whose bar heights sum to 1 in matplotlib and getting very weird result:
import numpy as np
def plotGraph():
import matplotlib.pyplot as plt
k=(1,4,3,1)
weights = np.ones_like(k)/len(k)
plt.hist(k, weights=weights)
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
plotGraph()
Result:
What am I doing wrong?
When plotting a normalized histogram, the area under the curve should sum to 1, not the height.
In [44]:
import matplotlib.pyplot as plt
k=(3,3,3,3)
x, bins, p=plt.hist(k, density=True) # used to be normed=True in older versions
from numpy import *
plt.xticks( arange(10) ) # 10 ticks on x axis
plt.show()
In [45]:
print bins
[ 2.5 2.6 2.7 2.8 2.9 3. 3.1 3.2 3.3 3.4 3.5]
Here, this example, the bin width is 0.1, the area underneath the curve sums up to one (0.1*10).
x stores the height for each bins. p stores each of those individual bins objects (actually, they are patches. So we just sum up x and modify the height of each bin object.
To have the sum of height to be 1, add the following before plt.show():
for item in p:
item.set_height(item.get_height()/sum(x))
You could use the solution outlined here:
weights = np.ones_like(myarray)/float(len(myarray))
plt.hist(myarray, weights=weights)
One way is to get the probabilities on your own, and then plot with plt.bar:
In [91]: from collections import Counter
...: c=Counter(k)
...: print c
Counter({1: 2, 3: 1, 4: 1})
In [92]: plt.bar(c.keys(), c.values())
...: plt.show()
result:
A normed histogram is defined such that the sum of products of width and height of each column is equal to the total count. That's why you are not getting your max equal to one.
However, if you still want to force it to be 1, you could use numpy and matplotlib.pyplot.bar in the following way
sample = np.random.normal(0,10,100)
#generate bins boundaries and heights
bin_height,bin_boundary = np.histogram(sample,bins=10)
#define width of each column
width = bin_boundary[1]-bin_boundary[0]
#standardize each column by dividing with the maximum height
bin_height = bin_height/float(max(bin_height))
#plot
plt.bar(bin_boundary[:-1],bin_height,width = width)
plt.show()
I found it very easy to use plotly express. Here is my code for your example:
import plotly.express as px
k= [1,4,3,1]
px.histogram(k,nbins=10,range_x=[0,10],histnorm='probability')
Which gives the normalize histogram the way that you want it. If you want to use percentage instead of probability you can simply change the last line of code to
px.histogram(k,nbins=10,range_x=[0,10],histnorm='percent')
If you don't want to manually set the range_x and nbins to make sure area of histogram is always one, you can use the following codes:
x_min=int(min(k))-1
x_max=int(max(k))+1
x_bins = x_max-x_min
px.histogram(k,nbins=x_bins,range_x=[x_min,x_max],histnorm='probability')

Categories