Does anyone have suggestions for efficiently truncating the SciPy random distributions. For example, if I generate random values like so:
import scipy.stats as stats
print stats.logistic.rvs(loc=0, scale=1, size=1000)
How would I go about constraining the output values between 0 and 1 without changing the original parameters of the distribution and without changing the sample size, all while minimizing the amount of work the machine has to do?
Your question is more of a statistics question than a scipy question. In general, you would need to be able to normalize over the interval you are interested in and compute the CDF for this interval analytically to create an efficient sampling method. Edit: And it turns out that this is possible (rejection sampling is not needed):
import scipy.stats as stats
import matplotlib.pyplot as plt
import numpy as np
import numpy.random as rnd
#plot the original distribution
xrng=np.arange(-10,10,.1)
yrng=stats.logistic.pdf(xrng)
plt.plot(xrng,yrng)
#plot the truncated distribution
nrm=stats.logistic.cdf(1)-stats.logistic.cdf(0)
xrng=np.arange(0,1,.01)
yrng=stats.logistic.pdf(xrng)/nrm
plt.plot(xrng,yrng)
#sample using the inverse cdf
yr=rnd.rand(100000)*(nrm)+stats.logistic.cdf(0)
xr=stats.logistic.ppf(yr)
plt.hist(xr,density=True)
plt.show()
What are you trying to achieve? Logistic distribution by definition has infinite range. If you truncate the results in any way, their distribution will change. If you just wanna random numbers in range, there's random.random().
You could normalise your results to the maximum returned value:
>>> dist = stats.logistic.rvs(loc=0, scale=1, size=1000)
>>> norm_dist = dist / np.max(dist)
This will keep the 'shape' the same, and the values between 0 and 1. But if you're doing repeated draws from a distribution, be sure to normalise all the draws to the same value (max from all draws).
However, you want to be pretty careful if your doing this kind of thing that it makes sense within the context of what you are trying to achieve (which I don't have enough info to comment on...)
Related
For a distribution for running this scipy function to detect the best fit as Exponentiated Weibull distribution and the function outputs 4 parameter values. But how to generate a sample list of data of size n that honours this kind of distribution.
I don't want to re-write function. Any python package which does this, would be helpful.
Usually you will use a ppf to generate from a rando seed.
For a simple completely fake example, let's say we fit a uniform random variable (with values from 0 to 15) to a Weibull distribution. Then, create a seed random variable (from 0 to 1 because it's the value of the quantiles that you will get) and put it in the ppf function.
import scipy.stats as st
import numpy as np
# fitting part
samples = np.random.rand(10000)*15
dist = st.exponweib.fit(samples)
# generation part
sample_seed = np.random.rand(10000)
random_exponweib_samples = st.exponweib.ppf(sample_seed, *dist)
# plotting
import matplotlib.pyplot as plt
plt.hist(samples, label="uniform", alpha=.5)
plt.hist(random_exponweib_samples, label="exponweib", alpha=.5)
plt.legend()
plt.show()
You'll have something like the following.
Please be careful and check for the documentation of the ppf concerning the weibull distrubiton. In my function st.exponweib.ppf(sample_seed, *dist) I just use *dist but it might be the case that the parameters should be sent differently, you can see it with the form of the orange plot which might not be correct.
I've stumbled across this code in an answer to a question and I'd like to automate the process of getting the distribution to fit neatly between two bounds.
import numpy as np
from scipy import stats
bounds = [0, 100]
n = np.mean(bounds)
# your distribution:
distribution = stats.norm(loc=n, scale=20)
# percentile point, the range for the inverse cumulative distribution function:
bounds_for_range = distribution.cdf(bounds)
# Linspace for the inverse cdf:
pp = np.linspace(*bounds_for_range, num=1000)
x = distribution.ppf(pp)
# And just to check that it makes sense you can try:
from matplotlib import pyplot as plt
plt.hist(x)
plt.show()
Let's say I have the values [720, 965], or any other bounds, that I would like to fit my distribution across. Is there a way to soft-code the adjustment of scale in stats.norm to fit this distribution across my bounds without any unreasonable gaps? Or are there any functions that have this type of functionality?
A scale of ~20 works well for the example code, but I have to adjust it to ~50 for the example of [720, 965]
I am not sure, but truncated normal distribution should be exactly what you are looking for.
from scipy.stats import truncnorm
distr_ab = truncnorm(a, b) # truncated normal distribution in the interval [a, b]
distr_ab.rvs(size=100) # get 100 samples from the distribution
# distr_ab.cdf, distr_ab.ppf etc... all accessible
I refered to Fitting empirical distribution to theoretical ones with Scipy (Python)? and generated the best fit distribution to my sample data. I wish to generate random numbers according to the best fit distribution. See image below.
However, in https://docs.scipy.org/doc/numpy-1.13.0/reference/generated/numpy.random.f.html#numpy.random.f, there is only 3 parameters, dfnum, dfden, size=None, where should I insert loc and scale. By the way, the dnd and dfd in best fit distribution are float and in numpy.random, it wants integer.
If I use only dnd and dfd in the code df_members['bd'] = df_members.bd.apply(lambda x: np.rint((np.random.f(dfnum=1441, dfden=19))) if x==-999 else x ) ,such values will be generated, which is false.
You can generate use from the scipy.stats module the f distribution and ask random values from it given the parameters you already found using the f.rvs method which accepts the four parameters plus the size (number of draws you want).
from scipy.stats import f
import matplotlib.pyplot as plt
values = f.rvs(1441.41, 19.1, -0.24, 26.5, 100000)
values is a 100000 length array with draws from the given distribution. You can see it as follows
plt.hist(values, bins=25)
plt.show()
I'm pretty new to PyMC and I'm trying desperately to infer the parameters of an underlying gaussian distribution that best fits a distribution of observed data that I have, not with a pre-build normal distrubution, but with a more general method using histograms of the simulated data to build pdfs. But so far I can't get my code to converge, and I don't know why...
So here's a summary of what my code does.
I have a dataset of 5000 points distributed normally (mean=5,sigma=2). I want to retrieve these values (mean, sigma) with a bayesian inference (using MCMC).
I have a data simulator that generates for each iteration of the MCMC process a normal distribution of 5000 points with a random mean and sigma (uniform prior)
From the simulated distribution of points I compute a numpy histogram normed to 1 representing the pdf of the distribution (Nbins=int(sqrt(5000))). I then compute the mean and standard deviation of this distribution.
What I want is the set of parameters that will allow me to build a simulated distribution that best fits the observed data.
I use the most general definition of the log likelihood, that is:
ln L(θ|x)=∑ln(f(xi|θ)) (the likelihood function being defined as the probability distribution of the observed data given the parameters θ)
Then I interpolate linearly the histogram values for every bin center. Therefore I have a continuous pdf for the simulated distribution. So here f is the interpolated function I made from the histogram of the simulation.
I sum the log(f(xi)) contributions for every (real) data point and return the loglikelihood value at the end.
But some (real) data points are so far off the mean of the simulated distribution that f(xi)=0. For these points the code raises a math domain error (Reminder: log(0)=-inf). So I artificially set the pdf to a small epsilon for the points where it's usually set to 0.
But here's the thing. The loglikelihood is not computed for every iteration. And actually it is not computed at all, in the present architecture of my code. So that's why the MCMC process is not converging. But... I don't know why.
Turns out that building custom likelihood functions does not seem to be very casual approach in the PyMC community, where people usually prefer to used pre-built distributions. I'm having troubles to find some help on these matters, so ideas and suggestions will be deeply appreciated :)
import numpy as np
import matplotlib.pyplot as plt
import math
import pymc as pm
from scipy.interpolate import InterpolatedUnivariateSpline
# Generate the data
np.random.seed(0)
N=5000
true_mean=5.
true_sigma = 2.
data = np.random.normal(true_mean,true_sigma,N)
#prior
m=pm.Uniform('m', lower=4, upper=6)
s=pm.Uniform('s', lower=1, upper=3)
#pm.deterministic
def data_simulator(mean_input=m,sig_input=s):
out=np.empty(4,dtype=object)
datasim = np.random.normal(mean_input,sig_input,N)
hist, bin_edges = np.histogram(datasim, bins=int(math.sqrt(len(datasim))), density=True)
bin_centers = (bin_edges[:-1] + bin_edges[1:])/2
m_sim=np.mean(datasim)
s_sim=np.std(datasim)
out[0]=m_sim
out[1]=s_sim
out[2]=bin_centers
out[3]=hist
return out
#pm.stochastic(observed=True)
def logp(value=data,mean_output=data_simulator.value[0],sigma_output=data_simulator.value[1],bin_centers_sim=data_simulator.value[2],hist_sim=data_simulator.value[3]):
interp_sim=InterpolatedUnivariateSpline(bin_centers_sim,hist_sim,k=1,ext=0) #returns the extrapolated values
logp=np.sum(np.log(interp_sim(value)))
print 'logp=',logp
return logp
model = pm.Model({"mean": m,"sigma":s,"data_simulator":data_simulator,"loglikelihood":loglikelihood})
#Run the MCMC sampler
mcmc = pm.MCMC(model)
mcmc.sample(iter=10000, burn=5000)
#Plot the marginals
pm.Matplot.plot(mcmc)
What function can I use in Python if I want to sample a truncated integer power law?
That is, given two parameters a and m, generate a random integer x in the range [1,m) that follows a distribution proportional to 1/x^a.
I've been searching around numpy.random, but I haven't found this distribution.
AFAIK, neither NumPy nor Scipy defines this distribution for you. However, using SciPy it is easy to define your own discrete distribution function using scipy.rv_discrete:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
def truncated_power_law(a, m):
x = np.arange(1, m+1, dtype='float')
pmf = 1/x**a
pmf /= pmf.sum()
return stats.rv_discrete(values=(range(1, m+1), pmf))
a, m = 2, 10
d = truncated_power_law(a=a, m=m)
N = 10**4
sample = d.rvs(size=N)
plt.hist(sample, bins=np.arange(m)+0.5)
plt.show()
I don't use Python, so rather than risk syntax errors I'll try to describe the solution algorithmically. This is a brute-force discrete inversion. It should translate quite easily into Python. I'm assuming 0-based indexing for the array.
Setup:
Generate an array cdf of size m with cdf[0] = 1 as the first entry, cdf[i] = cdf[i-1] + 1/(i+1)**a for the remaining entries.
Scale all entries by dividing cdf[m-1] into each -- now they actually are CDF values.
Usage:
Generate your random values by generating a Uniform(0,1) and
searching through cdf[] until you find an entry greater than your
uniform. Return the index + 1 as your x-value.
Repeat for as many x-values as you want.
For instance, with a,m = 2,10, I calculate the probabilities directly as:
[0.6452579827864142, 0.16131449569660355, 0.07169533142071269, 0.04032862392415089, 0.02581031931145657, 0.017923832855178172, 0.013168530260947229, 0.010082155981037722, 0.007966147935634743, 0.006452579827864143]
and the CDF is:
[0.6452579827864142, 0.8065724784830177, 0.8782678099037304, 0.9185964338278814, 0.944406753139338, 0.9623305859945162, 0.9754991162554634, 0.985581272236501, 0.9935474201721358, 1.0]
When generating, if I got a Uniform outcome of 0.90 I would return x=4 because 0.918... is the first CDF entry larger than my uniform.
If you're worried about speed you could build an alias table, but with a geometric decay the probability of early termination of a linear search through the array is quite high. With the given example, for instance, you'll terminate on the first peek almost 2/3 of the time.
Use numpy.random.zipf and just reject any samples greater than or equal to m