I've stumbled across this code in an answer to a question and I'd like to automate the process of getting the distribution to fit neatly between two bounds.
import numpy as np
from scipy import stats
bounds = [0, 100]
n = np.mean(bounds)
# your distribution:
distribution = stats.norm(loc=n, scale=20)
# percentile point, the range for the inverse cumulative distribution function:
bounds_for_range = distribution.cdf(bounds)
# Linspace for the inverse cdf:
pp = np.linspace(*bounds_for_range, num=1000)
x = distribution.ppf(pp)
# And just to check that it makes sense you can try:
from matplotlib import pyplot as plt
plt.hist(x)
plt.show()
Let's say I have the values [720, 965], or any other bounds, that I would like to fit my distribution across. Is there a way to soft-code the adjustment of scale in stats.norm to fit this distribution across my bounds without any unreasonable gaps? Or are there any functions that have this type of functionality?
A scale of ~20 works well for the example code, but I have to adjust it to ~50 for the example of [720, 965]
I am not sure, but truncated normal distribution should be exactly what you are looking for.
from scipy.stats import truncnorm
distr_ab = truncnorm(a, b) # truncated normal distribution in the interval [a, b]
distr_ab.rvs(size=100) # get 100 samples from the distribution
# distr_ab.cdf, distr_ab.ppf etc... all accessible
Related
I was looking for a way to obtaining the mean value (Expected Value) from a drawn distribution that I used to fit a Kernel Density Estimation from scipy.stats.gaussian_kde. I remember from my statistics class that the Expected Value is just the Integral over the pdf(x) * x from -infinity to infinity:
I used the the scipy.integrate.quad function to do this task in my code, but I ran into this apperently strange behavior (that might have something to do with the bandwith parameter from the KDE).
Problem
import matplotlib.pyplot as plt
import numpy as np
import random
from scipy.stats import norm, gaussian_kde
from scipy.integrate import quad
from sklearn.neighbors import KernelDensity
np.random.seed(42)
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
kde = gaussian_kde(test_array,bw_method=0.5)
X_range = np.arange(-16,20,0.1)
y_list = []
for X in X_range:
pdf = lambda x : kde.evaluate([[x]])
y_list.append(pdf(X))
y = np.array(y_list)
_ = plt.plot(X_range,y)
# Integrate over pdf * x to obtain the mean
mean_integration_low_bw = quad(lambda x: x * pdf(x), a=-np.inf, b=np.inf)[0]
# Calculate the cdf at point of the mean
zero_int_low = quad(lambda x: pdf(x), a=-np.inf, b=mean_integration_low_bw)[0]
print("The mean after integration: {}\n".format(round(mean_integration_low_bw,4)))
print("F({}): {}".format(round(mean_integration_low_bw,4),round(zero_int_low,4)))
plt.axvline(x=mean_integration_low_bw,color ="r")
plt.show()
If I execute this code I get a strange behavior of the result for the integrated mean and the cumulative distribution function at the point of the calculated mean:
First Question:
In my opinion it should always show: F(Mean) = 0.5 or am I wrong here? (Does this only apply to symetric distributions?)
Second Question:
The more stranger thing ist, that the value for the integrated mean does not change for the bandwith parameter. In my opinion the mean should change too if the shape of the underlying distribution differs. If i set the bandwith to 5 I got the following graph:
Why is the mean value still the same if the curve now has a different shape (due to the wider bandwith)?
I hope those question not only arise due to my flawed understanding of statistics ;)
Your initial data is generate here
# Generating sample data
test_array = np.concatenate([np.random.normal(loc=-10, scale=.8, size=100),\
np.random.normal(loc=4,scale=2.0,size=500)])
So you have 500 samples from a distribution with mean 4 and 100 samples from a distribution with mean -10, you can predict the expected average (500*4-10*100)/(500+100) = 1.66666.... that's pretty close to the result given by your code, and also very consistent with the result obtained from the with the first plot.
I refered to Fitting empirical distribution to theoretical ones with Scipy (Python)? and generated the best fit distribution to my sample data. I wish to generate random numbers according to the best fit distribution. See image below.
However, in https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.f.html#numpy.random.f, there is only 3 parameters, dfnum, dfden, size=None, where should I insert loc and scale. By the way, the dnd and dfd in best fit distribution are float and in numpy.random, it wants integer.
If I use only dnd and dfd in the code df_members['bd'] = df_members.bd.apply(lambda x: np.rint((np.random.f(dfnum=1441, dfden=19))) if x==-999 else x ) ,such values will be generated, which is false.
You can generate use from the scipy.stats module the f distribution and ask random values from it given the parameters you already found using the f.rvs method which accepts the four parameters plus the size (number of draws you want).
from scipy.stats import f
import matplotlib.pyplot as plt
values = f.rvs(1441.41, 19.1, -0.24, 26.5, 100000)
values is a 100000 length array with draws from the given distribution. You can see it as follows
plt.hist(values, bins=25)
plt.show()
I have a question concerning fitting and getting random numbers.
Situation is as such:
Firstly I have a histogram from data points.
import numpy as np
"""create random data points """
mu = 10
sigma = 5
n = 1000
datapoints = np.random.normal(mu,sigma,n)
""" create normalized histrogram of the data """
bins = np.linspace(0,20,21)
H, bins = np.histogram(data,bins,density=True)
I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.
Thanks for your help
You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.
Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.
Example:
import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot
# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])
# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
u = random.uniform( x[0], x[-1] )
b[i] = inverse_density_function( u )
# plot both
pyplot.hist(a, 100)
pyplot.hist(b, 100)
pyplot.show()
This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.
P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.
I am having troubles plotting a Cumulative Distribution Function.
So far I Have found this:
scipy.stats.beta.cdf(0.2,6,7)
But that only gives me a point.
This will be what I use to plot:
pylab.plot()
pylab.show()
What I want it to look like is this:
File:Binomial distribution cdf.svg
with p = .2 and the bounds stopping once y = 1 or close to 1.
The first argument to cdf can be an array of values, rather than a single value. It will then return an array of values.
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0,20,100)
cdf = stats.binom.cdf
plt.plot(x,cdf(x, 50, 0.2))
plt.show()
I don't think the user above, ubuntu, has suggested the right function to use.
Actually his answer is very much misleading and incorrect at large.
Note that binom.cdf() is a function to calculate the cdf of a binomial distribution specified by n and p, Binomial(n,p). That's to say it returns values of the cdf of that random variable for each value in x, rather than the actual cdf function for the discrete distribution specified by vector x.
To calculate cdf for any distribution defined by vector x, just use the histogram() function:
import numpy as np
hist, bin_edges = np.histogram(np.random.randint(0,10,100), normed=True)
cdf = cumsum(hist)
or, just use the hist() plotting function from matplotlib.
Does anyone have suggestions for efficiently truncating the SciPy random distributions. For example, if I generate random values like so:
import scipy.stats as stats
print stats.logistic.rvs(loc=0, scale=1, size=1000)
How would I go about constraining the output values between 0 and 1 without changing the original parameters of the distribution and without changing the sample size, all while minimizing the amount of work the machine has to do?
Your question is more of a statistics question than a scipy question. In general, you would need to be able to normalize over the interval you are interested in and compute the CDF for this interval analytically to create an efficient sampling method. Edit: And it turns out that this is possible (rejection sampling is not needed):
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
#plot the original distribution
xrng=np.arange(-10,10,.1)
yrng=stats.logistic.pdf(xrng)
plt.plot(xrng,yrng)
#plot the truncated distribution
nrm=stats.logistic.cdf(1)-stats.logistic.cdf(0)
xrng=np.arange(0,1,.01)
yrng=stats.logistic.pdf(xrng)/nrm
plt.plot(xrng,yrng)
#sample using the inverse cdf
yr=rnd.rand(100000)*(nrm)+stats.logistic.cdf(0)
xr=stats.logistic.ppf(yr)
plt.hist(xr,density=True)
plt.show()
What are you trying to achieve? Logistic distribution by definition has infinite range. If you truncate the results in any way, their distribution will change. If you just wanna random numbers in range, there's random.random().
You could normalise your results to the maximum returned value:
>>> dist = stats.logistic.rvs(loc=0, scale=1, size=1000)
>>> norm_dist = dist / np.max(dist)
This will keep the 'shape' the same, and the values between 0 and 1. But if you're doing repeated draws from a distribution, be sure to normalise all the draws to the same value (max from all draws).
However, you want to be pretty careful if your doing this kind of thing that it makes sense within the context of what you are trying to achieve (which I don't have enough info to comment on...)