Getting draws from a fitted distribution - python

I refered to Fitting empirical distribution to theoretical ones with Scipy (Python)? and generated the best fit distribution to my sample data. I wish to generate random numbers according to the best fit distribution. See image below.
However, in https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.f.html#numpy.random.f, there is only 3 parameters, dfnum, dfden, size=None, where should I insert loc and scale. By the way, the dnd and dfd in best fit distribution are float and in numpy.random, it wants integer.
If I use only dnd and dfd in the code df_members['bd'] = df_members.bd.apply(lambda x: np.rint((np.random.f(dfnum=1441, dfden=19))) if x==-999 else x ) ,such values will be generated, which is false.

You can generate use from the scipy.stats module the f distribution and ask random values from it given the parameters you already found using the f.rvs method which accepts the four parameters plus the size (number of draws you want).
from scipy.stats import f
import matplotlib.pyplot as plt
values = f.rvs(1441.41, 19.1, -0.24, 26.5, 100000)
values is a 100000 length array with draws from the given distribution. You can see it as follows
plt.hist(values, bins=25)
plt.show()

Related

Generate random sample for exponentiated Weibull distribution using Python

For a distribution for running this scipy function to detect the best fit as Exponentiated Weibull distribution and the function outputs 4 parameter values. But how to generate a sample list of data of size n that honours this kind of distribution.
I don't want to re-write function. Any python package which does this, would be helpful.
Usually you will use a ppf to generate from a rando seed.
For a simple completely fake example, let's say we fit a uniform random variable (with values from 0 to 15) to a Weibull distribution. Then, create a seed random variable (from 0 to 1 because it's the value of the quantiles that you will get) and put it in the ppf function.
import scipy.stats as st
import numpy as np
# fitting part
samples = np.random.rand(10000)*15
dist = st.exponweib.fit(samples)
# generation part
sample_seed = np.random.rand(10000)
random_exponweib_samples = st.exponweib.ppf(sample_seed, *dist)
# plotting
import matplotlib.pyplot as plt
plt.hist(samples, label="uniform", alpha=.5)
plt.hist(random_exponweib_samples, label="exponweib", alpha=.5)
plt.legend()
plt.show()
You'll have something like the following.
Please be careful and check for the documentation of the ppf concerning the weibull distrubiton. In my function st.exponweib.ppf(sample_seed, *dist) I just use *dist but it might be the case that the parameters should be sent differently, you can see it with the form of the orange plot which might not be correct.

Generating non-random normally distributed values between two points

I've stumbled across this code in an answer to a question and I'd like to automate the process of getting the distribution to fit neatly between two bounds.
import numpy as np
from scipy import stats
bounds = [0, 100]
n = np.mean(bounds)
# your distribution:
distribution = stats.norm(loc=n, scale=20)
# percentile point, the range for the inverse cumulative distribution function:
bounds_for_range = distribution.cdf(bounds)
# Linspace for the inverse cdf:
pp = np.linspace(*bounds_for_range, num=1000)
x = distribution.ppf(pp)
# And just to check that it makes sense you can try:
from matplotlib import pyplot as plt
plt.hist(x)
plt.show()
Let's say I have the values [720, 965], or any other bounds, that I would like to fit my distribution across. Is there a way to soft-code the adjustment of scale in stats.norm to fit this distribution across my bounds without any unreasonable gaps? Or are there any functions that have this type of functionality?
A scale of ~20 works well for the example code, but I have to adjust it to ~50 for the example of [720, 965]
I am not sure, but truncated normal distribution should be exactly what you are looking for.
from scipy.stats import truncnorm
distr_ab = truncnorm(a, b) # truncated normal distribution in the interval [a, b]
distr_ab.rvs(size=100) # get 100 samples from the distribution
# distr_ab.cdf, distr_ab.ppf etc... all accessible

numpy.random.normal different distribution: selecting values from distribution

I have a power-law distribution of energies and I want to pick n random energies based on the distribution. I tried doing this manually using random numbers but it is too inefficient for what I want to do. I'm wondering is there a method in numpy (or other) that works like numpy.random.normal, except instead of a using normal distribution, the distribution may be specified. So in my mind an example might look like (similar to numpy.random.normal):
import numpy as np
# Energies from within which I want values drawn
eMin = 50.
eMax = 2500.
# Amount of energies to be drawn
n = 10000
photons = []
for i in range(n):
# Method that I just made up which would work like random.normal,
# i.e. return an energy on the distribution based on its probability,
# but take a distribution other than a normal distribution
photons.append(np.random.distro(eMin, eMax, lambda e: e**(-1.)))
print(photons)
Printing photons should give me a list of length 10000 populated by energies in this distribution. If I were to histogram this it would have much greater bin values at lower energies.
I am not sure if such a method exists but it seems like it should. I hope it is clear what I want to do.
EDIT:
I have seen numpy.random.power but my exponent is -1 so I don't think this will work.
Sampling from arbitrary PDFs well is actually quite hard. There are large and dense books just about how to efficiently and accurately sample from the standard families of distributions.
It looks like you could probably get by with a custom inversion method for the example that you gave.
If you want to sample from an arbitrary distribution you need the inverse of the cumulative density function (not the pdf).
You then sample a probability uniformly from range [0,1] and feed this into the inverse of the cdf to get the corresponding value.
It is often not possible to obtain the cdf from the pdf analytically.
However, if you're happy to approximate the distribution, you could do so by calculating f(x) at regular intervals over its domain, then doing a cumsum over this vector to get an approximation of the cdf and from this approximate the inverse.
Rough code snippet:
import matplotlib.pyplot as plt
import numpy as np
import scipy.interpolate
def f(x):
"""
substitute this function with your arbitrary distribution
must be positive over domain
"""
return 1/float(x)
#you should vary inputVals to cover the domain of f (for better accurracy you can
#be clever about spacing of values as well). Here i space them logarithmically
#up to 1 then at regular intervals but you could definitely do better
inputVals = np.hstack([1.**np.arange(-1000000,0,100),range(1,10000)])
#everything else should just work
funcVals = np.array([f(x) for x in inputVals])
cdf = np.zeros(len(funcVals))
diff = np.diff(funcVals)
for i in xrange(1,len(funcVals)):
cdf[i] = cdf[i-1]+funcVals[i-1]*diff[i-1]
cdf /= cdf[-1]
#you could also improve the approximation by choosing appropriate interpolator
inverseCdf = scipy.interpolate.interp1d(cdf,inputVals)
#grab 10k samples from distribution
samples = [inverseCdf(x) for x in np.random.uniform(0,1,size = 100000)]
plt.hist(samples,bins=500)
plt.show()
Why don't you use eval and put the distribution in a string?
>>> cmd = "numpy.random.normal(500)"
>>> eval(cmd)
you can manipulate the string as you wish to set the distribution.

probability density function from histogram in python to fit another histrogram

I have a question concerning fitting and getting random numbers.
Situation is as such:
Firstly I have a histogram from data points.
import numpy as np
"""create random data points """
mu = 10
sigma = 5
n = 1000
datapoints = np.random.normal(mu,sigma,n)
""" create normalized histrogram of the data """
bins = np.linspace(0,20,21)
H, bins = np.histogram(data,bins,density=True)
I would like to interpret this histogram as probability density function (with e.g. 2 free parameters) so that I can use it to produce random numbers AND also I would like to use that function to fit another histogram.
Thanks for your help
You can use a cumulative density function to generate random numbers from an arbitrary distribution, as described here.
Using a histogram to produce a smooth cumulative density function is not entirely trivial; you can use interpolation for example scipy.interpolate.interp1d() for values in between the centers of your bins and that will work fine for a histogram with a reasonably large number of bins and items. However you have to decide on the form of the tails of the probability function, ie for values less than the smallest bin or greater than the largest bin. You could give your distribution gaussian tails based on for example fitting a gaussian to your histogram), or any other form of tail appropriate to your problem, or simply truncate the distribution.
Example:
import numpy
import scipy.interpolate
import random
import matplotlib.pyplot as pyplot
# create some normally distributed values and make a histogram
a = numpy.random.normal(size=10000)
counts, bins = numpy.histogram(a, bins=100, density=True)
cum_counts = numpy.cumsum(counts)
bin_widths = (bins[1:] - bins[:-1])
# generate more values with same distribution
x = cum_counts*bin_widths
y = bins[1:]
inverse_density_function = scipy.interpolate.interp1d(x, y)
b = numpy.zeros(10000)
for i in range(len( b )):
u = random.uniform( x[0], x[-1] )
b[i] = inverse_density_function( u )
# plot both
pyplot.hist(a, 100)
pyplot.hist(b, 100)
pyplot.show()
This doesn't handle tails, and it could handle bin edges better, but it would get you started on using a histogram to generate more values with the same distribution.
P.S. You could also try to fit a specific known distribution described by a few values (which I think is what you had mentioned in the question) but the above non-parametric approach is more general-purpose.

Truncating SciPy random distributions

Does anyone have suggestions for efficiently truncating the SciPy random distributions. For example, if I generate random values like so:
import scipy.stats as stats
print stats.logistic.rvs(loc=0, scale=1, size=1000)
How would I go about constraining the output values between 0 and 1 without changing the original parameters of the distribution and without changing the sample size, all while minimizing the amount of work the machine has to do?
Your question is more of a statistics question than a scipy question. In general, you would need to be able to normalize over the interval you are interested in and compute the CDF for this interval analytically to create an efficient sampling method. Edit: And it turns out that this is possible (rejection sampling is not needed):
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
#plot the original distribution
xrng=np.arange(-10,10,.1)
yrng=stats.logistic.pdf(xrng)
plt.plot(xrng,yrng)
#plot the truncated distribution
nrm=stats.logistic.cdf(1)-stats.logistic.cdf(0)
xrng=np.arange(0,1,.01)
yrng=stats.logistic.pdf(xrng)/nrm
plt.plot(xrng,yrng)
#sample using the inverse cdf
yr=rnd.rand(100000)*(nrm)+stats.logistic.cdf(0)
xr=stats.logistic.ppf(yr)
plt.hist(xr,density=True)
plt.show()
What are you trying to achieve? Logistic distribution by definition has infinite range. If you truncate the results in any way, their distribution will change. If you just wanna random numbers in range, there's random.random().
You could normalise your results to the maximum returned value:
>>> dist = stats.logistic.rvs(loc=0, scale=1, size=1000)
>>> norm_dist = dist / np.max(dist)
This will keep the 'shape' the same, and the values between 0 and 1. But if you're doing repeated draws from a distribution, be sure to normalise all the draws to the same value (max from all draws).
However, you want to be pretty careful if your doing this kind of thing that it makes sense within the context of what you are trying to achieve (which I don't have enough info to comment on...)

Categories