making the y-axis of a histogram probability, python - python

I have plotted a histogram in python, using matplotlib and I need the y-axis to be the probability, I cannot find how to do this. For example i want it to look similar to this http://www.mathamazement.com/images/Pre-Calculus/10_Sequences-Series-and-Summation-Notation/10_07_Probability/10-coin-toss-histogram.JPG
Here is my code, I will attached my plot aswell if needed
plt.figure(figsize=(10,10))
mu = np.mean(a) #mean of distribution
sigma = np.std(a) # standard deviation of distribution
n, bins,patches=plt.hist(a,bin, normed=True, facecolor='white')
y = mlab.normpdf(bins, mu, sigma)
plt.plot(bins,y,'r--')
print np.sum(n*np.diff(bins))# proved the intergal over bars is unity
plt.show()

Just divide all your sample counts by the total number of samples. This gives the probability rather than the count.

As #SteveBarnes points out, divide the sample counts by the total number of samples to get the probabilities for each bin. To get a plot like the one you linked to, your "bins" should just be the integers from 0 to 10. A simple way to compute the histogram for a sample from a discrete distribution is np.bincount.
Here's a snippet that creates a plot like the one you linked to:
import numpy as np
import matplotlib.pyplot as plt
n = 10
num_samples = 10000
# Generate a random sample.
a = np.random.binomial(n, 0.5, size=num_samples)
# Count the occurrences in the sample.
b = np.bincount(a, minlength=n+1)
# p is the array of probabilities.
p = b / float(b.sum())
plt.bar(np.arange(len(b)) - 0.5, p, width=1, facecolor='white')
plt.xlim(-0.5, n + 0.5)
plt.xlabel("Number of heads (k)")
plt.ylabel("P(k)")
plt.show()

Related

How does matplotlib calculate the density for historgram

Reading through the matplotlib plt.hist documentations , there is a density parameter that can be set to true.The documentation says
density : bool, optional
If ``True``, the first element of the return tuple will
be the counts normalized to form a probability density, i.e.,
the area (or integral) under the histogram will sum to 1.
This is achieved by dividing the count by the number of
observations times the bin width and not dividing by the total
number of observations. If *stacked* is also ``True``, the sum of
the histograms is normalized to 1.
The line This is achieved by dividing the count by the number of observations times the bin width and not dividing by the total number of observations
I tried replicating this with the sample data.
**Using matplotlib inbuilt calculations** .
ser = pd.Series(np.random.normal(size=1000))
ser.hist(density = 1, bins=100)
**Manual calculation of the density** :
arr_hist , edges = np.histogram( ser, bins =100)
samp = arr_hist / ser.shape[0] * np.diff(edges)
plt.bar(edges[0:-1] , samp )
plt.grid()
Both the plots are completely different on the y-axis scales , could someone point what exactly is going wrong and how to replicate the density calculation manually ?
That is an ambiguity in the language. The sentence
This is achieved by dividing the count by the number of observations times the bin width
needs to be read like
This is achieved by dividing (the count) by (the number of observations times the bin width)
i.e.
count / (number of observations * bin width)
Complete code:
import numpy as np
import matplotlib.pyplot as plt
arr = np.random.normal(size=1000)
fig, (ax1, ax2) = plt.subplots(2)
ax1.hist(arr, density = True, bins=100)
ax1.grid()
arr_hist , edges = np.histogram(arr, bins =100)
samp = arr_hist / (arr.shape[0] * np.diff(edges))
ax2.bar(edges[0:-1] , samp, width=np.diff(edges) )
ax2.grid()
plt.show()

Generate 1000 independent random variables with normal distribution

I have some problems about this question. Generate 1000 independent random variables with normal distribution N(9,10) and plot the result in a histogram.
For the histogram use the intervals with length 1,namely[-1,0),[0,1),[1,2)...etc
import numpy as np
import matplotlib.pyplot as plt
def demo1():
mu ,sigma = 9, 10
sampleNo = 1000
s = np.random.normal(mu, sigma, sampleNo)
plt.hist(s, bins=100, density=True)
plt.show()
demo1()
I wonder how to choose bins number and I don't know how to adjust intervals with length 1.
You can feed a sequence to the bins argument. For example, if you give bins = np.arange(100) the histogram will have 100 bins of length 1 going as [0,1),[1,2)...
Another example, you can write plt.hist(s, bins=np.linspace(-50,50,100), density=True) which would plot the histogram of range (-50,50) with bin length 1

Plotting data points on where they fall in a distribution

Lets say I have a large data set to where I can manipulate it all in some sort analysis. Which can be looking at values in a probability distribution.
Now that I have this large data set, I then want to compare known, actual data to it. Primarily, how many of the values in my data set have the same value or property with the known data. For example:
This is a cumulative distribution. The continuous lines are from generated data from simulations and the decreasing intensities are just predicted percentages. The stars are then observational (known) data, plotted against generated data.
Another example I have made is how visually the points could possibly be projected on a histogram:
I'm having difficulty marking where the known data points fall in the generated data set and plot it cumulatively along side the distribution of the generated data.
If I were to try and retrieve the number of points that fall in the vicinity of the generated data, I would start out like this (its not right):
def SameValue(SimData, DefData, uncert):
numb = [(DefData-uncert) < i < (DefData+uncert) for i in SimData]
return sum(numb)
But I am having trouble accounting for the points falling in the value ranges and then having it all set up to where I can plot it. Any idea on how to gather this data and project this onto a cumulative distribution?
The question is pretty chaotic with lots of irrelevant information but staying vague at the essetial points. I will try interprete it the best I can.
I think what you are after is the following: Given a finite sample from an unknown distribution, what is the probability to obtain a new sample at a fixed value?
I'm not sure if there is a general answer to it, but in any case that would be a question to be asked to statistics or mathematics people. My guess is that you would need to make some assumptions about the distribution itself.
For the practical case however, it might be sufficient to find out in which bin of the sampled distribution the new value would lie.
So assuming we have a distribution x, which we divide into bins. We can compute the histogram h, using numpy.histogram. The probability to find a value in each bin is then given by h/h.sum().
Having a value v=0.77, of which we want to know the probability according to the distribution, we can find out the bin in which it would belong by looking for the index ind in the bin array where this value would need to be inserted for the array to stay sorted. This can be done using numpy.searchsorted.
import numpy as np; np.random.seed(0)
x = np.random.rayleigh(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
prob = h/float(h.sum())
ind = np.searchsorted(bins, 0.77, side="right")
print prob[ind] # which prints 0.058
So the probability is 5.8% to sample a value in the bin around 0.77.
A different option would be to interpolate the histogram between the bin centers, as to find the the probability.
In the code below we plot a distribution similar to the one from the picture in the question and use both methods, the first for the frequency histogram, the second for the cumulative distribution.
import numpy as np; np.random.seed(0)
import matplotlib.pyplot as plt
x = np.random.rayleigh(size=1000)
y = np.random.normal(size=1000)
bins = np.linspace(0,4,41)
h, bins_ = np.histogram(x, bins=bins)
hcum = np.cumsum(h)/float(np.cumsum(h).max())
points = [[.77,-.55],[1.13,1.08],[2.15,-.3]]
markers = [ur'$\u2660$',ur'$\u2665$',ur'$\u263B$']
colors = ["k", "crimson" , "gold"]
labels = list("ABC")
kws = dict(height_ratios=[1,1,2], hspace=0.0)
fig, (axh, axc, ax) = plt.subplots(nrows=3, figsize=(6,6), gridspec_kw=kws, sharex=True)
cbins = np.zeros(len(bins)+1)
cbins[1:-1] = bins[1:]-np.diff(bins[:2])[0]/2.
cbins[-1] = bins[-1]
hcumc = np.linspace(0,1, len(cbins))
hcumc[1:-1] = hcum
axc.plot(cbins, hcumc, marker=".", markersize="2", mfc="k", mec="k" )
axh.bar(bins[:-1], h, width=np.diff(bins[:2])[0], alpha=0.7, ec="C0", align="edge")
ax.scatter(x,y, s=10, alpha=0.7)
for p, m, l, c in zip(points, markers, labels, colors):
kw = dict(ls="", marker=m, color=c, label=l, markeredgewidth=0, ms=10)
# plot points in scatter distribution
ax.plot(p[0],p[1], **kw)
#plot points in bar histogram, find bin in which to plot point
# shift by half the bin width to plot it in the middle of bar
pix = np.searchsorted(bins, p[0], side="right")
axh.plot(bins[pix-1]+np.diff(bins[:2])[0]/2., h[pix-1]/2., **kw)
# plot in cumulative histogram, interpolate, such that point is on curve.
yi = np.interp(p[0], cbins, hcumc)
axc.plot(p[0],yi, **kw)
ax.legend()
plt.tight_layout()
plt.show()

How to random sample lognormal data in Python using the inverse CDF and specify target percentiles?

I'm trying to generate random samples from a lognormal distribution in Python, the application is for simulating network traffic. I'd like to generate samples such that:
The modal sample result is 320 (~10^2.5)
80% of the samples lie within the range 100 to 1000 (10^2 to 10^3)
My strategy is to use the inverse CDF (or Smirnov transform I believe):
Use the PDF for a normal distribution centred around 2.5 to calculate the PDF for 10^x where x ~ N(2.5,sigma).
Calculate the CDF for the above distribution.
Generate random uniform data along the interval 0 to 1.
Use the inverse CDF to transform the random uniform data into the required range.
The problem is, when I calculate the 10 and 90th percentile at the end, I have completely the wrong numbers.
Here is my code:
%matplotlib inline
import matplotlib
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats
from scipy.stats import norm
# find value of mu and sigma so that 80% of data lies within range 2 to 3
mu=2.505
sigma = 1/2.505
norm.ppf(0.1, loc=mu,scale=sigma),norm.ppf(0.9, loc=mu,scale=sigma)
# output: (1.9934025, 3.01659743)
# Generate normal distribution PDF
x = np.arange(16,128000, 16) # linearly spaced here, with extra range so that CDF is correctly scaled
x_log = np.log10(x)
mu=2.505
sigma = 1/2.505
y = norm.pdf(x_log,loc=mu,scale=sigma)
fig, ax = plt.subplots()
ax.plot(x_log, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
x2 = (10**x_log) # x2 should be linearly spaced, so that cumsum works (later)
fig, ax = plt.subplots()
ax.plot(x2, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
ax.set_xlim(0,2000)
# Calculate CDF
y_CDF = np.cumsum(y) / np.cumsum(y).max()
fig, ax = plt.subplots()
ax.plot(x2, y_CDF, 'r-', lw=2, alpha=0.6, label='norm pdf')
ax.set_xlim(0,8000)
# Generate random uniform data
input = np.random.uniform(size=10000)
# Use CDF as lookup table
traffic = x2[np.abs(np.subtract.outer(y_CDF, input)).argmin(0)]
# Discard highs and lows
traffic = traffic[(traffic >= 32) & (traffic <= 8000)]
# Check percentiles
np.percentile(traffic,10),np.percentile(traffic,90)
Which produces the output:
(223.99999999999997, 2480.0000000000009)
... and not the (100, 1000) that I would like to see. Any advice appreciated!
First, I'm not sure about Use the PDF for a normal distribution centred around 2.5. After all, log-normal is about base e logarithm (aka natural log), which means 320 = 102.5 = e5.77.
Second, I would approach problem in a different way. You need m and s to sample from Log-Normal.
If you look at wiki article above, you could see that it is two-parametric distribution. And you have exactly two conditions:
Mode = exp(m - s*s) = 320
80% samples in [100,1000] => CDF(1000,m,s) - CDF(100,m,s) = 0.8
where CDF is expressed via error function (which is pretty much common function found in any library)
So two non-linear equations for two parameters. Solve them, find m and s and put it into any standard log-normal sampling
Severin's approach is much leaner than my original attempt using the Smirnov transform. This is the code that worked for me (using fsolve to find s, although its quite trivial to do it manually):
# Find lognormal distribution, with mode at 320 and 80% of probability mass between 100 and 1000
# Use fsolve to find the roots of the non-linear equation
%matplotlib inline
import matplotlib
import numpy as np
import matplotlib.pyplot as plt
from scipy.optimize import fsolve
from scipy.stats import lognorm
import math
target_modal_value = 320
# Define function to find roots of
def equation(s):
# From Wikipedia: Mode = exp(m - s*s) = 320
m = math.log(target_modal_value) + s**2
# Get probability mass from CDF at 100 and 1000, should equal to 0.8.
# Rearange equation so that =0, to find root (value of s)
return (lognorm.cdf(1000,s=s, scale=math.exp(m)) - lognorm.cdf(100,s=s, scale=math.exp(m)) -0.8)
# Solve non-linear equation to find s
s_initial_guess = 1
s = fsolve(equation, s_initial_guess)
# From s, find m
m = math.log(target_modal_value) + s**2
print('m='+str(m)+', s='+str(s)) #(m,s))
# Plot
x = np.arange(0,2000,1)
y = lognorm.pdf(x,s=s, scale=math.exp(m))
fig, ax = plt.subplots()
ax.plot(x, y, 'r-', lw=5, alpha=0.6, label='norm pdf')
plt.plot((100,100), (0,1), 'k--')
plt.plot((320,320), (0,1), 'k-.')
plt.plot((1000,1000), (0,1), 'k--')
plt.ylim(0,0.0014)
plt.savefig('lognormal_100_320_1000.png')

cumulative distribution plots python

I am doing a project using python where I have two arrays of data. Let's call them pc and pnc. I am required to plot a cumulative distribution of both of these on the same graph. For pc it is supposed to be a less than plot i.e. at (x,y), y points in pc must have value less than x. For pnc it is to be a more than plot i.e. at (x,y), y points in pnc must have value more than x.
I have tried using histogram function - pyplot.hist. Is there a better and easier way to do what i want? Also, it has to be plotted on a logarithmic scale on the x-axis.
You were close. You should not use plt.hist as numpy.histogram, that gives you both the values and the bins, than you can plot the cumulative with ease:
import numpy as np
import matplotlib.pyplot as plt
# some fake data
data = np.random.randn(1000)
# evaluate the histogram
values, base = np.histogram(data, bins=40)
#evaluate the cumulative
cumulative = np.cumsum(values)
# plot the cumulative function
plt.plot(base[:-1], cumulative, c='blue')
#plot the survival function
plt.plot(base[:-1], len(data)-cumulative, c='green')
plt.show()
Using histograms is really unnecessarily heavy and imprecise (the binning makes the data fuzzy): you can just sort all the x values: the index of each value is the number of values that are smaller. This shorter and simpler solution looks like this:
import numpy as np
import matplotlib.pyplot as plt
# Some fake data:
data = np.random.randn(1000)
sorted_data = np.sort(data) # Or data.sort(), if data can be modified
# Cumulative counts:
plt.step(sorted_data, np.arange(sorted_data.size)) # From 0 to the number of data points-1
plt.step(sorted_data[::-1], np.arange(sorted_data.size)) # From the number of data points-1 to 0
plt.show()
Furthermore, a more appropriate plot style is indeed plt.step() instead of plt.plot(), since the data is in discrete locations.
The result is:
You can see that it is more ragged than the output of EnricoGiampieri's answer, but this one is the real histogram (instead of being an approximate, fuzzier version of it).
PS: As SebastianRaschka noted, the very last point should ideally show the total count (instead of the total count-1). This can be achieved with:
plt.step(np.concatenate([sorted_data, sorted_data[[-1]]]),
np.arange(sorted_data.size+1))
plt.step(np.concatenate([sorted_data[::-1], sorted_data[[0]]]),
np.arange(sorted_data.size+1))
There are so many points in data that the effect is not visible without a zoom, but the very last point at the total count does matter when the data contains only a few points.
After conclusive discussion with #EOL, I wanted to post my solution (upper left) using a random Gaussian sample as a summary:
import numpy as np
import matplotlib.pyplot as plt
from math import ceil, floor, sqrt
def pdf(x, mu=0, sigma=1):
"""
Calculates the normal distribution's probability density
function (PDF).
"""
term1 = 1.0 / ( sqrt(2*np.pi) * sigma )
term2 = np.exp( -0.5 * ( (x-mu)/sigma )**2 )
return term1 * term2
# Drawing sample date poi
##################################################
# Random Gaussian data (mean=0, stdev=5)
data1 = np.random.normal(loc=0, scale=5.0, size=30)
data2 = np.random.normal(loc=2, scale=7.0, size=30)
data1.sort(), data2.sort()
min_val = floor(min(data1+data2))
max_val = ceil(max(data1+data2))
##################################################
fig = plt.gcf()
fig.set_size_inches(12,11)
# Cumulative distributions, stepwise:
plt.subplot(2,2,1)
plt.step(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.step(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian distribution (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Cumulative distributions, smooth:
plt.subplot(2,2,2)
plt.plot(np.concatenate([data1, data1[[-1]]]), np.arange(data1.size+1), label='$\mu=0, \sigma=5$')
plt.plot(np.concatenate([data2, data2[[-1]]]), np.arange(data2.size+1), label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian (cumulative)')
plt.ylabel('Count')
plt.xlabel('X-value')
plt.legend(loc='upper left')
plt.xlim([min_val, max_val])
plt.ylim([0, data1.size+1])
plt.grid()
# Probability densities of the sample points function
plt.subplot(2,2,3)
pdf1 = pdf(data1, mu=0, sigma=5)
pdf2 = pdf(data2, mu=2, sigma=7)
plt.plot(data1, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(data2, pdf2, label='$\mu=2, \sigma=7$')
plt.title('30 samples from a random Gaussian')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
# Probability density function
plt.subplot(2,2,4)
x = np.arange(min_val, max_val, 0.05)
pdf1 = pdf(x, mu=0, sigma=5)
pdf2 = pdf(x, mu=2, sigma=7)
plt.plot(x, pdf1, label='$\mu=0, \sigma=5$')
plt.plot(x, pdf2, label='$\mu=2, \sigma=7$')
plt.title('PDFs of Gaussian distributions')
plt.legend(loc='upper left')
plt.xlabel('X-value')
plt.ylabel('probability density')
plt.xlim([min_val, max_val])
plt.grid()
plt.show()
In order to add my own contribution to the community, here I share my function for plotting histograms. This is how I understood the question, plotting the histogram and the cumulative histograme at the same time :
def hist(data, bins, title, labels, range = None):
fig = plt.figure(figsize=(15, 8))
ax = plt.axes()
plt.ylabel("Proportion")
values, base, _ = plt.hist( data , bins = bins, normed=True, alpha = 0.5, color = "green", range = range, label = "Histogram")
ax_bis = ax.twinx()
values = np.append(values,0)
ax_bis.plot( base, np.cumsum(values)/ np.cumsum(values)[-1], color='darkorange', marker='o', linestyle='-', markersize = 1, label = "Cumulative Histogram" )
plt.xlabel(labels)
plt.ylabel("Proportion")
plt.title(title)
ax_bis.legend();
ax.legend();
plt.show()
return
if anyone wonders how it looks like, please take a look (with seaborn activated):
Also, concerning the double grid (the white lines), I always used to struggle to have nice double grid. Here is an interesting way to circumvent the problem: How to put grid lines from the secondary axis behind the primary plot?
The simplest way to generate this graph is with seaborn:
import seaborn as sns
sns.ecdfplot()
Here is the documentation

Categories