SciPy Cumulative Distribution Function Plotting - python

I am having troubles plotting a Cumulative Distribution Function.
So far I Have found this:
scipy.stats.beta.cdf(0.2,6,7)
But that only gives me a point.
This will be what I use to plot:
pylab.plot()
pylab.show()
What I want it to look like is this:
File:Binomial distribution cdf.svg
with p = .2 and the bounds stopping once y = 1 or close to 1.

The first argument to cdf can be an array of values, rather than a single value. It will then return an array of values.
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,20,100)
cdf = stats.binom.cdf
plt.plot(x,cdf(x, 50, 0.2))
plt.show()

I don't think the user above, ubuntu, has suggested the right function to use.
Actually his answer is very much misleading and incorrect at large.
Note that binom.cdf() is a function to calculate the cdf of a binomial distribution specified by n and p, Binomial(n,p). That's to say it returns values of the cdf of that random variable for each value in x, rather than the actual cdf function for the discrete distribution specified by vector x.
To calculate cdf for any distribution defined by vector x, just use the histogram() function:
import numpy as np
hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
cdf = cumsum(hist)
or, just use the hist() plotting function from matplotlib.

Related

Integration of KDE with strange behavior of from scipy.integrate.quad and the setted bandwith

I was looking for a way to obtaining the mean value (Expected Value) from a drawn distribution that I used to fit a Kernel Density Estimation from scipy.stats.gaussian_kde. I remember from my statistics class that the Expected Value is just the Integral over the pdf(x) * x from -infinity to infinity:
I used the the scipy.integrate.quad function to do this task in my code, but I ran into this apperently strange behavior (that might have something to do with the bandwith parameter from the KDE).
Problem
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import norm, gaussian_kde
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity
np.random.seed(42)
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
kde = gaussian_kde(test_array,bw_method=0.5)
X_range = np.arange(-16,20,0.1)
y_list = []
for X in X_range:
pdf = lambda x : kde.evaluate([[x]])
y_list.append(pdf(X))
y = np.array(y_list)
_ = plt.plot(X_range,y)
# Integrate over pdf * x to obtain the mean
mean_integration_low_bw = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]
# Calculate the cdf at point of the mean
zero_int_low = quad(lambda x: pdf(x), a=-np.inf, b=mean_integration_low_bw)[0]
print("The mean after integration: {}\n".format(round(mean_integration_low_bw,4)))
print("F({}): {}".format(round(mean_integration_low_bw,4),round(zero_int_low,4)))
plt.axvline(x=mean_integration_low_bw,color ="r")
plt.show()
If I execute this code I get a strange behavior of the result for the integrated mean and the cumulative distribution function at the point of the calculated mean:
First Question:
In my opinion it should always show: F(Mean) = 0.5 or am I wrong here? (Does this only apply to symetric distributions?)
Second Question:
The more stranger thing ist, that the value for the integrated mean does not change for the bandwith parameter. In my opinion the mean should change too if the shape of the underlying distribution differs. If i set the bandwith to 5 I got the following graph:
Why is the mean value still the same if the curve now has a different shape (due to the wider bandwith)?
I hope those question not only arise due to my flawed understanding of statistics ;)
Your initial data is generate here
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
So you have 500 samples from a distribution with mean 4 and 100 samples from a distribution with mean -10, you can predict the expected average (500*4-10*100)/(500+100) = 1.66666.... that's pretty close to the result given by your code, and also very consistent with the result obtained from the with the first plot.

Calculate values pf CDF in Python efficiently

I'd like to find the CDF values for points in an series. The points in the series can be thought of as a distribution between -10 and 10.
My first attempt was to rank the values of the series, and then use the ranks to get the CDF values. For instance;
rankedSeries = mySeries.rank()
CDF = rankedSeries/len(mySeries)
But is there a faster way with any built in functions? I'll be doing this lots of times with large amounts of data so speed is important
By means of numpy.histogram, generate histogram of the array. numpy.cumsum calculates the CDF of the generated histogram. For large arrays, its more efficient than sorting, in terms of processing time:
import numpy as np
import matplotlib.pyplot as plt
data = (np.random.rand(100)*20) - 10
bins = 20
hist, bin_edges = np.histogram (data, bins = bins)
cdf = np.cumsum(hist)
plt.plot(bin_edges[1:], cdf/cdf[-1])
plt.show()
If you're interested in the Empirical Distribution Function (EDF) instead of the CDF for use in Kolmogorov Smirnov, Anderson Darling, or other goodness of fit tests, the following code may help:
import numpy as np
import matplotlib.pyplot as plt
data = (np.random.rand(100)*20-10) ++(np.random.rand(100)*20-10) + (np.random.rand(100)*20-10)
data.sort()
plt.plot(data,np.arange(len(data)))
plt.show()```

Generating non-random normally distributed values between two points

I've stumbled across this code in an answer to a question and I'd like to automate the process of getting the distribution to fit neatly between two bounds.
import numpy as np
from scipy import stats
bounds = [0, 100]
n = np.mean(bounds)
# your distribution:
distribution = stats.norm(loc=n, scale=20)
# percentile point, the range for the inverse cumulative distribution function:
bounds_for_range = distribution.cdf(bounds)
# Linspace for the inverse cdf:
pp = np.linspace(*bounds_for_range, num=1000)
x = distribution.ppf(pp)
# And just to check that it makes sense you can try:
from matplotlib import pyplot as plt
plt.hist(x)
plt.show()
Let's say I have the values [720, 965], or any other bounds, that I would like to fit my distribution across. Is there a way to soft-code the adjustment of scale in stats.norm to fit this distribution across my bounds without any unreasonable gaps? Or are there any functions that have this type of functionality?
A scale of ~20 works well for the example code, but I have to adjust it to ~50 for the example of [720, 965]
I am not sure, but truncated normal distribution should be exactly what you are looking for.
from scipy.stats import truncnorm
distr_ab = truncnorm(a, b) # truncated normal distribution in the interval [a, b]
distr_ab.rvs(size=100) # get 100 samples from the distribution
# distr_ab.cdf, distr_ab.ppf etc... all accessible

calculate percentile of 2D array

i have size classes and for each size class i have measured counts:
import numpy as np
from matplotlib import pyplot as plt
from scipy.stats import norm
size_class = np.linspace(0,9,10)
counts = norm.pdf(size_class, 5,1) # synthetic data
counts_cumulative_normalised = np.cumsum(counts)/counts.sum() # summing up and normalisation
plt.plot(size_class,counts_cumulative_normalised)
plt.show()
so if i would like to calculate the percentiles of the size i would have to interpolate my desired size.
Is there a build in function that takes these two vectors as arguments and gives me the desired percentiles ?
If you don't know if the data is normally distributed, and you want to get the percentiles based on the Empirical Cumulative Distribution Function, you can use a interpolation approach.
In [63]:
plt.plot(size_class,counts_cumulative_normalised)
Out[63]:
[<matplotlib.lines.Line2D at 0x10c72d3d0>]
In [69]:
#what percentile does size 4 correspond to ?
from scipy import interpolate
intp=interpolate.interp1d(size_class, counts_cumulative_normalised, kind='cubic')
intp(4)
Out[69]:
array(0.300529305241782)
I know you are presenting just a synthetic data, but do notice that the way you are doing underestimated the Cumulative Distribution Functions, as you only takes a few sample points, see this comparison:
plt.plot(size_class,counts_cumulative_normalised)
plt.plot(size_class,norm.cdf(size_class, 5, 1))

Truncating SciPy random distributions

Does anyone have suggestions for efficiently truncating the SciPy random distributions. For example, if I generate random values like so:
import scipy.stats as stats
print stats.logistic.rvs(loc=0, scale=1, size=1000)
How would I go about constraining the output values between 0 and 1 without changing the original parameters of the distribution and without changing the sample size, all while minimizing the amount of work the machine has to do?
Your question is more of a statistics question than a scipy question. In general, you would need to be able to normalize over the interval you are interested in and compute the CDF for this interval analytically to create an efficient sampling method. Edit: And it turns out that this is possible (rejection sampling is not needed):
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
#plot the original distribution
xrng=np.arange(-10,10,.1)
yrng=stats.logistic.pdf(xrng)
plt.plot(xrng,yrng)
#plot the truncated distribution
nrm=stats.logistic.cdf(1)-stats.logistic.cdf(0)
xrng=np.arange(0,1,.01)
yrng=stats.logistic.pdf(xrng)/nrm
plt.plot(xrng,yrng)
#sample using the inverse cdf
yr=rnd.rand(100000)*(nrm)+stats.logistic.cdf(0)
xr=stats.logistic.ppf(yr)
plt.hist(xr,density=True)
plt.show()
What are you trying to achieve? Logistic distribution by definition has infinite range. If you truncate the results in any way, their distribution will change. If you just wanna random numbers in range, there's random.random().
You could normalise your results to the maximum returned value:
>>> dist = stats.logistic.rvs(loc=0, scale=1, size=1000)
>>> norm_dist = dist / np.max(dist)
This will keep the 'shape' the same, and the values between 0 and 1. But if you're doing repeated draws from a distribution, be sure to normalise all the draws to the same value (max from all draws).
However, you want to be pretty careful if your doing this kind of thing that it makes sense within the context of what you are trying to achieve (which I don't have enough info to comment on...)

Categories