matplotlib reducing y axis by a factor to represent percent frequency - python

so i have 3 lists of fractions and i used a histogram to show how often each fraction showed up. The problem is that there are 100000 of each and i need to reduce the y vaues by that much to get a frequency percentage. Here is my code now
bins = numpy.linspace(0, 1, 50)
z = np.linspace(0,1,50)
g = (lambda z: 2 * np.exp((-2)*(z**2)*(1000000000)))
w = g(z)
plt.plot(z,w)
pyplot.hist(Vrand, bins, alpha=0.5)
pyplot.hist(Vfirst, bins, alpha=0.5)
pyplot.hist(Vmin, bins, alpha=0.2)
pyplot.show()
it is the last chunk of code i need the y axis divided by 100000
Update:
when i try to divide by 100000 using np histograms all the values =0 except the line above
bins = numpy.linspace(0, 1, 50)
z = np.linspace(0,1,50)
g = (lambda z: 2 * np.exp((-2)*(z**2)*(100000)))
w = g(z)
plt.plot(z,w)
hist, bins = np.histogram(Vrand, bins)
hist /= 100000.0
widths = np.diff(bins)
pyplot.bar(bins[:-1], hist, widths)

matplotlib histogram has a "normed" parameter that you can use to scale everything to [0,1] interval
pyplot.hist(Vrand, bins, normed=1)
or use weights parameter to scale it by different coefficient.
You can also use the retuning value of numpy histogram and scale it whatever you want (tested in python 3.x)
hist, bins = np.histogram(Vrand, bins)
hist /= 100000.0
widths = np.diff(bins)
pyplot.bar(bins[:-1], hist, widths)
First two solutions are in my opinion better, as we should not "reinvent the wheel" and implement by hand what is already done in library.

Firstly I would recommend you think about your style, use either plt or pyplot not both and you should include in example code some fake data to illustrate the problem and your imports.
So, the issue is that in the following example the counts are very large:
bins = np.linspace(0, 1, 50)
data = np.random.normal(0.5, 0.1, size=100000)
plt.hist(data, bins)
plt.show()
You tried to fix this by dividing the bin count by an integer:
hist, bins = plt.histogram(data, bins)
hist_divided = hist/10000
The issue here is that hist is an array of int's and dividing integers is tricky. For example
>>> 2/3
0
>>> 3/2
1
This is what gives you a row of 0's if you pick too large a value to divide by. Instead you can divide by a float as suggested by #lejlot, notice you need to divide by 10000.0 and not 10000.
Or the other suggestion made by #lejlot just use the normed argument in the call to 'hist'. This rescales all the numbs in hist such that the sum of their squares is 1, very useful when comparing values.
I also notice you appear to be having this issue because your plotting a line plot on the same axis as the histogram, if this line plot is outside of the [0,1] range you will again encounter the same issue, instead of rescale the histogram axis you should twin the x axis.

Related

plotting with a logscale distribution and 0

I'm trying to plot a probability distribution (say probability of k events). It should be plotted as a logscale on the horizontal axis since the behavior at large values of k looks like k^{-alpha}. So it's a straight line for large k on a logscale plot.
But 0 happens.
I want to plot this in a way that is easy to interpret.
For an example, consider a probability defined so that p_0 = 0.5 and for k= 1, 2, 3, ... we set p_k = Ck^{-2} where if I've calculated correctly C=3/pi^2. This should sum to 1 and produce a nice straight line for k>0, but obviously, I can't stick in 0. Nevertheless it's important that the person looking at the image understand that 0 exists and has significant probability.
I'm using matplotlib (in python), but really I'm interested in how we could visualize this. The implementation can be sorted later.
In order to put 0 into the plot, you have apply symlog to x axis and log to y axis. I am putting some code here in case you are not familiar with matplotlib, then you can start with code below. For details, pls check doc.
import numpy as np
import matplotlib.pyplot as plt
n = 100
x = np.arange(0, n)
y = 3/(np.pi*np.pi)/(x[1:])**2
y = np.concatenate([[0.5], y])
fig, ax = plt.subplots(1, 1, figsize=(7.2, 7.2))
ax.plot(x, y, 'x')
ax.set_xlim(-1, n)
ax.set_xscale('symlog')
ax.set_yscale('log')

How does matplotlib calculate the density for historgram

Reading through the matplotlib plt.hist documentations , there is a density parameter that can be set to true.The documentation says
density : bool, optional
If ``True``, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
the area (or integral) under the histogram will sum to 1.
This is achieved by dividing the count by the number of
observations times the bin width and not dividing by the total
number of observations. If *stacked* is also ``True``, the sum of
the histograms is normalized to 1.
The line This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations
I tried replicating this with the sample data.
**Using matplotlib inbuilt calculations** .
ser = pd.Series(np.random.normal(size=1000))
ser.hist(density = 1, bins=100)
**Manual calculation of the density** :
arr_hist , edges = np.histogram( ser, bins =100)
samp = arr_hist / ser.shape[0] * np.diff(edges)
plt.bar(edges[0:-1] , samp )
plt.grid()
Both the plots are completely different on the y-axis scales , could someone point what exactly is going wrong and how to replicate the density calculation manually ?
That is an ambiguity in the language. The sentence
This is achieved by dividing the count by the number of observations times the bin width
needs to be read like
This is achieved by dividing (the count) by (the number of observations times the bin width)
i.e.
count / (number of observations * bin width)
Complete code:
import numpy as np
import matplotlib.pyplot as plt
arr = np.random.normal(size=1000)
fig, (ax1, ax2) = plt.subplots(2)
ax1.hist(arr, density = True, bins=100)
ax1.grid()
arr_hist , edges = np.histogram(arr, bins =100)
samp = arr_hist / (arr.shape[0] * np.diff(edges))
ax2.bar(edges[0:-1] , samp, width=np.diff(edges) )
ax2.grid()
plt.show()

How to bin a 2D data along the x-axis with Python

I have two arrays of corresponding data (x and y) that I plot as above on a log-log plot. The data is currently too granular and I would like to bin them to get a smoother relationship. Could I get some guidance on how I can bin along the x-axis, in exponential bin sizes, so that it appears linear on the log-log scale?
For example, if the first bin is of range x = 10^0 to 10^1, I want to collect all y-values with corresponding x in that range and average them into one value for that bin. I don't think np.hist or plt.hist quite does the trick, since they do binning by counting occurrences.
Edit: For context, if it helps, the above plot is an assortativity plot that plots the in vs out degree of a certain network.
You may use scipy.stats.binned_statistic to get the mean of the data in each bin. The bins would best be created via numpy.logspace. You may then plot those means e.g. as horiziontal lines spanning the bin width or as scatter at the mean position.
import numpy as np; np.random.seed(42)
from scipy.stats import binned_statistic
import matplotlib.pyplot as plt
x = np.logspace(0,5,300)
y = np.logspace(0,5,300)+np.random.rand(300)*1.e3
fig, ax = plt.subplots()
ax.scatter(x,y, s=9)
s, edges, _ = binned_statistic(x,y, statistic='mean', bins=np.logspace(0,5,6))
ys = np.repeat(s,2)
xs = np.repeat(edges,2)[1:-1]
ax.hlines(s,edges[:-1],edges[1:], color="crimson", )
for e in edges:
ax.axvline(e, color="grey", linestyle="--")
ax.scatter(edges[:-1]+np.diff(edges)/2, s, c="limegreen", zorder=3)
ax.set_xscale("log")
ax.set_yscale("log")
plt.show()
You can achieve this with pandas. The idea is to assign each X value to an interval using np.digitize. Since you are using a log scale, it makes sense to use np.logspace to choose intervals of exponentially changing lengths. Finally, you can group X values in each interval and compute mean Y values.
import pandas as pd
import numpy as np
x_max = 10
xs = np.exp(x_max * np.random.rand(1000))
ys = np.exp(np.random.rand(1000))
df = pd.DataFrame({
'X': xs,
'Y': ys,
})
df['Xbins'] = np.digitize(df.X, np.logspace(0, x_max, 30, base=np.exp(1)))
df['Ymean'] = df.groupby('Xbins').Y.transform('mean')
df.plot(kind='scatter', x='X', y='Ymean')

Normalize a multiple data histogram

I have several arrays that I'm plotting a histogram of, like so:
import numpy as np
import matplotlib.pyplot as plt
x = np.random.normal(0,.5,1000)
y = np.random.normal(0,.5,100000)
plt.hist((x,y),normed=True)
Of course, however, this normalizes both of the arrays individually, so that they both have the same peak. I'm looking to normalize them to the total number of elements, so that the histogram of y will be visibly taller than that of x. Is there a handy way to do this in matplotlib or will I have to mess around in numpy? I haven't found anything about it.
Another way to put it is that if I were instead to make a cumulative plot of the two arrays, they shouldn't both top out at 1, but should add to 1.
Yes, you can compute the histogram with numpy and renormalise it.
x = np.random.normal(0,.5,1000)
y = np.random.normal(0,.5,100000)
xhist, xbins = np.histogram(x, normed=True)
yhist, ybins = np.histogram(x, normed=True)
And now, you apply your regularisation. For example, if you want x to be normalised to 1 and y proportional:
yhist *= len(y) / len(x)
Now, to plot the histogram:
def plot_histogram(data, edge_bins, **kwargs):
bins = edge_bins[:-1] + edge_bins[1:]
plt.step(bins, data, **kwargs)
plot_histogram(xhist, xbins, c='b')
plot_histogram(yhist, ybins, c='g')

making the y-axis of a histogram probability, python

I have plotted a histogram in python, using matplotlib and I need the y-axis to be the probability, I cannot find how to do this. For example i want it to look similar to this http://www.mathamazement.com/images/Pre-Calculus/10_Sequences-Series-and-Summation-Notation/10_07_Probability/10-coin-toss-histogram.JPG
Here is my code, I will attached my plot aswell if needed
plt.figure(figsize=(10,10))
mu = np.mean(a) #mean of distribution
sigma = np.std(a) # standard deviation of distribution
n, bins,patches=plt.hist(a,bin, normed=True, facecolor='white')
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins,y,'r--')
print np.sum(n*np.diff(bins))# proved the intergal over bars is unity
plt.show()
Just divide all your sample counts by the total number of samples. This gives the probability rather than the count.
As #SteveBarnes points out, divide the sample counts by the total number of samples to get the probabilities for each bin. To get a plot like the one you linked to, your "bins" should just be the integers from 0 to 10. A simple way to compute the histogram for a sample from a discrete distribution is np.bincount.
Here's a snippet that creates a plot like the one you linked to:
import numpy as np
import matplotlib.pyplot as plt
n = 10
num_samples = 10000
# Generate a random sample.
a = np.random.binomial(n, 0.5, size=num_samples)
# Count the occurrences in the sample.
b = np.bincount(a, minlength=n+1)
# p is the array of probabilities.
p = b / float(b.sum())
plt.bar(np.arange(len(b)) - 0.5, p, width=1, facecolor='white')
plt.xlim(-0.5, n + 0.5)
plt.xlabel("Number of heads (k)")
plt.ylabel("P(k)")
plt.show()

Categories