Custom PDF from scipy.stats.rv_continuous unwanted upper-bound - python

I am attempting to generate a random probability density function of QSO's of certain luminosity with the form:
1/( (L/L_B^* )^alpha + (L/L_B^* )^beta )
where L_B^*, alpha, and beta are all constants. To do this, the following code is used:
import scipy.stats as st
logLbreak = 43.88
alpha = 3.4
beta = 1.6
class my_pdf(st.rv_continuous):
def _pdf(self,l_L):
#"l_L" in this is always log L
L = 10**(l_L/logLbreak)
D = 1/(L**alpha + L**beta)
return D
dist_Log_L = my_pdf(momtype = 0, a = 0,name='l_L_dist')
distro = dist_Log_L.rvs(size = 10000)
(L/L^* is rased to a power of 10 since everything is being done in a log scale)
The distribution is supposed to produce a graph that approximates this, trailing off to infinity, but in reality the graph it produces looks like this (10,000 samples). The upper bound is the same regardless of the amount of samples that are used. Is there a reason it is being restricted in the way it is?

Your PDF is not properly normalized. The integral of a PDF over the domain must be 1. Your PDF integrates to approximately 3.4712:
In [72]: from scipy.integrate import quad
In [73]: quad(dist_Log_L._pdf, 0, 100)
Out[73]: (3.4712183965415373, 2.0134487716044682e-11)
In [74]: quad(dist_Log_L._pdf, 0, 800)
Out[74]: (3.4712184965748905, 2.013626296581202e-11)
In [75]: quad(dist_Log_L._pdf, 0, 1000)
Out[75]: (3.47121849657489, 8.412130378805368e-10)
This will break the class's implementation of inverse transform sampling. It will only generate samples from the domain up to where the integral of the PDF from 0 to x first reaches 1.0, which in your case is about 2.325
In [81]: quad(dist_Log_L._pdf, 0, 2.325)
Out[81]: (1.0000875374350238, 1.1103202107010366e-14)
That is, in fact, what you see in your histogram.
As a quick fix to verify the issue, I modified the return statement of the _pdf() method to:
return D/3.47121849657489
and ran your script again. (In a real fix, that value will be a function of the other parameters.) Then the commands
In [85]: import matplotlib.pyplot as plt
In [86]: plt.hist(distro, bins=31)
generates this plot:

Related

How can I use rv_continuous to generate random samples from a gaussian kde?

I've been stuck on this for awhile now. Here's my problem: I have a set of observed data. I want to turn this data into a pmf, use a gaussian kernel density estimator to estimate a continuous pdf from this observed data, and then be able to sample from that pdf.
I am using Scipy, and have managed to get the kernel density estimation to work. I think what I need to sample from it is to subclass rv_continuous and overwrite the _pdf method by returning the evaluation of my kernel density estimation. However, when I try to define the kernel density estimation in my rv_continuous class' init method, I am unable to call .rvs() on the resulting class to sample from it. But when I define a function separately and call this function independently in the rv_continuous class' _pdf method, it works fine.
This sounds confusing, I know, but see what I'm talking about in the code below.
from scipy.stats import gaussian_kde, rv_continuous
import numpy.random as npr
# Create fake data just to test if this works
test_data = [0, 8, 12, 35, 40, 4, 1, 0, 0, 0]
output = []
for entry in test_data:
number_obs = entry
for i in range(number_obs):
mao = npr.uniform()
output.append(mao)
# First, what I would like to work
class mao_pdf(rv_continuous):
"""
Class for creating a pdf, round-by-round, from which samples may be drawn.
"""
def __init__(self, data):
super(rv_continuous, self).__init__()
self.kde = gaussian_kde(data, bw_method = 0.18)
def _pdf(self, x):
return self.kde.evaluate(x)[0]
pdf = mao_pdf(output)
print(pdf.rvs()) # This does not work
# Now, what paradoxically works (but is really the exact same thing
# just in a convoluted way)
test_kde = gaussian_kde(output, bw_method = 0.18)
def f(x):
return test_kde.evaluate(x)[0]
class test_pdf(rv_continuous):
def _pdf(self, x):
return f(x)
pdf = test_pdf(a = 0, b = 1)
print(pdf.rvs()) # This one works
So it seems like it might have something to do with the bounds (setting a = 0, b = 1), but for the life of me I cannot figure out why this is so critical or how to even implmement this in my class mao_pdf. I tried just defining self.a = 0 and self.b = 1 in the __init__() method of my mao_pdf class, but that did not work.
This really should not be so complicated, I'm just trying to turn actual observed data into a sample-able continuous probability density function. Any help is greatly appreciated.
Actually, it's very simple to sample from a KDE distribution, without having to calculate the PDF for the KDE:
Choose a data point uniformly at random (with replacement).
Add a normally distributed random number to the data point, with a mean of 0 and a standard deviation equal to the bandwidth.
In fact, this distribution can be sampled even without SciPy, as long as you know the bandwidth:
import numpy
gen=numpy.random.Generator(npr.PCG64())
test_data = [0, 8, 12, 35, 40, 4, 1, 0, 0, 0]
# Choose 10 data points from `test_data` at random.
c=gen.integers(0, len(test_data), size=10)
c=numpy.asarray([float(test_data[d]) for d in c])
# Add a Gaussian jitter.
# Use the bandwidth factor in your example.
c+=gen.normal(0, 0.18)
Note that the code above requires NumPy 1.17 or later, which includes an improved system for random number generation. If you must still use NumPy 1.16 or earlier, the following code can be used:
import numpy
gen=numpy.random.Generator(npr.PCG64())
test_data = [0, 8, 12, 35, 40, 4, 1, 0, 0, 0]
c=numpy.random.uniform(0, 10, size=10)
c=numpy.asarray([float(test_data[int(d)]) for d in c])
c+=numpy.random.normal(0, 0.18) # Use the bandwidth factor in your example

Confidence Interval for Sample Mean in Python (Different from Manual)

I'm trying to create some material for introductory statistics for a seminar. The above code computes a 95% confidence interval for estimating the mean, but the result is not the same from the one implemented in Python. Is there something wrong with my math / code? Thanks.
EDIT:
Data was sampled from here
import pandas as pd
import numpy as np
x = np.random.normal(60000,15000,200)
income = pd.DataFrame()
income = pd.DataFrame()
income['Data Scientist'] = x
# Manual Implementation
sample_mean = income['Data Scientist'].mean()
sample_std = income['Data Scientist'].std()
standard_error = sample_std / (np.sqrt(income.shape[0]))
print('Mean',sample_mean)
print('Std',sample_std)
print('Standard Error',standard_error)
print('(',sample_mean-2*standard_error,',',sample_mean+2*standard_error,')')
# Python Library
import scipy.stats as st
se = st.sem(income['Data Scientist'])
a = st.t.interval(0.95, len(income['Data Scientist'])-1, loc=sample_mean, scale=se)
print(a)
print('Standard Error from this code block',se)
You've got 2 errors.
First, you are using 2 for the multiplier for the CI. The more accurate value is 1.96. "2" is just a convenient estimator. That is making your CI generated manually too fat.
Second, you are comparing a normal distribution to the t-distribution. This probably isn't causing more than decimal dust in difference because you have 199 degrees of freedom for the t-dist, which is basically the normal.
Below is the z-score of 1.96 and computation of CI with apples-to-apples comparison to the norm distribution vs. t.
In [45]: st.norm.cdf(1.96)
Out[45]: 0.9750021048517795
In [46]: print('(',sample_mean-1.96*standard_error,',',sample_mean+1.96*standard_error,')')
( 57558.007862202685 , 61510.37559873406 )
In [47]: st.norm.interval(0.95, loc=sample_mean, scale=se)
Out[47]: (57558.044175045005, 61510.33928589174)

How to vectorize a function with array lookups

I'm trying to vectorize my fitness function for a Minimum Vector Cover genetic algorithm, but I'm at a loss about how to do it.
As it stands now:
vert_cover_fitness = [1 if self.dna[edge[0]] or self.dna[edge[1]] else -num_edges for edge in edges]
The dna is a one-dimensional binary array of size [0..n], where each index corresponds to a vertex, and its value indicates if we have chosen it or not. edges is a two dimensional positive integer array, where each value corresponds to a vertex (index) in dna. Both are ndarrays.
Simply explained - if one of the vertices connected by an edge is "selected", then we get a score of one. If not, the function is penalized by -num_edges.
I have tried np.vectorize as an attempt to get away cheap with a lambda function:
fit_func = np.vectorize(lambda edge: 1 if self.dna[edge[0]] or self.dna[edge[1]] else -num_edges)
vert_cover_fitness = fit_func(edges)
This returns IndexError: invalid index to scalar variable., as this function is applied to each value, and not each row.
To fix this I tried np.apply_along_axis. This works but it's just a wrapper for a loop so I'm not getting any speedups.
If any Numpy wizards can see some obvious way to do this, I would much appreciate your help. I'm guessing a problem lies with the representation of the problem, and that changing either the dna or edges shapes could help. I'm just not skilled enough to see what I should do.
I came up with this bit of numpy code, it runs 30x faster than your for loop on my randomly generated data.
import numpy as np
num_vertices = 1000
num_edges = 500
dna = np.random.choice([0, 1], num_vertices)
edges = np.random.randint(0, num_vertices, num_edges * 2).reshape(-1, 2)
vert_cover_fitness1 = [1 if dna[edge[0]] or dna[edge[1]] else -num_edges for edge in edges]
vert_cover_fitness2 = np.full([num_edges], -num_edges)
mask = (dna[edges[:, 0]] | dna[edges[:, 1]]).astype(bool)
vert_cover_fitness2[mask] = 1.0
print((vert_cover_fitness1 == vert_cover_fitness2).all()) # this shows it's correct
Here is the timeit code used to measure the speedup.
import timeit
setup = """
import numpy as np
num_vertices = 1000
num_edges = 500
dna = np.random.choice([0, 1], num_vertices)
edges = np.random.randint(0, num_vertices, num_edges*2).reshape(-1, 2)
"""
python_loop = "[1 if dna[edge[0]] or dna[edge[1]] else -num_edges for edge in edges]"
print(timeit.timeit(python_loop, setup, number=1000))
vectorised="""
vert_cover_fitness2 = np.full([num_edges], -num_edges)
mask = (dna[edges[:, 0]] | dna[edges[:, 1]]).astype(bool)
vert_cover_fitness2[mask] = 1.0
"""
print(timeit.timeit(vectorised, setup, number=1000))
# prints:
# 0.375906624016352
# 0.012783741112798452

Errors about using scipy.stats.uniform

I am using python 3.4 with Spyder 2.3.4
I want to use uniform distribution function from scipy.stats.uniform. I wrote two different codes for importing this function, only one work. I don't know why. I also have the question with the results from the working one. Since my 3 questions are all based on the same code, it is better to ask them all in one post.
The code that does not work (Code 1)
import scipy as sp
a = 0
b = 1
mean = (a+b)/2
variance = 1/12*(b-a)**2
std = sp.sqrt(variance)
rv = sp.stats.uniform.rvs(loc = mean, scale = std, size = 10)
a = rv(>=0.5)*1
print(a)
when I run the program, it says "AttributeError: 'module' object has no attribute 'stats'".
The code that works (Code 2)
import scipy as sp
from scipy.stats import uniform
# form a unifrom distribution bewteen 0 and 1.
a = 0
b = 1
mean = (a+b)/2
variance = 1/12*(b-a)**2
std = sp.sqrt(variance)
rv = uniform.rvs(loc = mean, scale = std, size = 10)
a = (rv>=0.5)*1
print(a)
I am sure there is no syntax error. I just can run this on my computer.
My questions:
Since the "stats" is in "scipy", why sp.stats.uniform in Code 1 does not work? The difference between Code 1 and Code 2 is: in Code 1, I only import scipy as sp, and then tried to call function uniform by sp.stats.uniform; while in Code 2, I first from scipy.stats import uniform, then just call function uniform by uniform.
In the Code 2, the following syntax is OK.
rv = sp.stats.uniform.rvs(loc = mean, scale = std, size = 6)
It seems that if I called from scipy.stats import uniform (as in the beginning of Code 2), then I can use sp.stats.uniform. Why?
As you can see in Code 2, I want to generate 10 random points from uniform distribution between 0 and 1. According to the theory, its mean is 0.5, and standard deviation is sqrt(1/12). What I expect is the generated 10 points will have more or less 5 points less than 0.5 and 5 points larger than 0.5. But the resulting "a" contains only 1, meaning "rv" is always larger than 0.5. Did I use the "uniform" function correctly?

How to truncate a numpy/scipy exponential distribution in an efficient way?

I'm currently building a neuroscience experiment. Basically, a stimulus is presented for 3 seconds every x seconds (x = inter-trial interval). I would like x to be rather short (mean = 2.5) and unpredictable.
My idea is to draw random samples from an exponential distribution truncated at 1 (lower bound) and 10 (upper bound). I would like the resulting bounded exponential distr. to have an expected mean of 2.5. How could I do that in an efficient way?
There are two ways to do this:
The first is to generate an exponentially distributed random variable and then limit the values into (1,10).
In [14]:
import matplotlib.pyplot as plt
import scipy.stats as ss
Lambda = 2.5 #expected mean of exponential distribution is lambda in Scipy's parameterization
Size = 1000
trc_ex_rv = ss.expon.rvs(scale=Lambda, size=Size)
trc_ex_rv = trc_ex_rv[(trc_ex_rv>1)&(trc_ex_rv<10)]
In [15]:
plt.hist(trc_ex_rv)
plt.xlim(0, 12)
Out[15]:
(0, 12)
In [16]:
trc_ex_rv
Out[16]:
array([...]) #a lot of numbers
Of course, the problem is you are not going to get the exact number of random numbers (defined by Size here).
The other way to do it is to use Inverse transform sampling, and you will get the exact number of replicates as specified:
In [17]:
import numpy as np
def trunc_exp_rv(low, high, scale, size):
rnd_cdf = np.random.uniform(ss.expon.cdf(x=low, scale=scale),
ss.expon.cdf(x=high, scale=scale),
size=size)
return ss.expon.ppf(q=rnd_cdf, scale=scale)
In [18]:
plt.hist(trunc_exp_rv(1, 10, Lambda, Size))
plt.xlim(0, 12)
Out[18]:
(0, 12)
If you want the resulting bounded distribution to have an expected mean of a given value, say 2.5, you need to solve for the scale parameter that resulting the expected mean.
import scipy.optimize as so
def solve_for_l(low, high, ept_mean):
A = np.array([low, high])
return 1/so.fmin(lambda L: ((np.diff(np.exp(-A*L)*(A*L+1)/L)/np.diff(np.exp(-A*L)))-ept_mean)**2,
x0=0.5,
full_output=False, disp=False)
def F(low, high, ept_mean, size):
return trunc_exp_rv(low, high,
solve_for_l(low, high, ept_mean),
size)
rv_data = F(1, 10, 2.5, 1e5)
plt.hist(rv_data, bins=50)
plt.xlim(0, 12)
print rv_data.mean()
Result:
2.50386617882
In addition to #CT Zhu's great answer, it appears that scipy now has a truncated exponential distribution built-in.
from scipy.stats import truncexpon
r = truncexpon.rvs(b, size=1000)

Categories