Vectorized sampling of multiple binomial random variables - python

I would like to sample a few hundred binomially distributed random variables, each with a different n and p (using the argument names as defined in the numpy.random.binomial docs). I'll be doing this repeatedly, so I'd like to vectorize the code if possible. Here's an example:
import numpy as np
# Made up parameters
N_random_variables = 500
n_vals = np.random.random_integers(100, 200, N_random_variables)
p_vals = np.random.random_sample(N_random_variables)
# Can this portion be vectorized?
results = np.empty(N_random_variables)
for i in xrange(N_random_variables):
results[i] = np.random.binomial(n_vals[i], p_vals[i])
In the special case that n and p are the same for each random variable, I can do:
import numpy as np
# Made up parameters
N_random_variables = 500
n_val = 150
p_val = 0.5
# Vectorized code
results = np.random.binomial(n_val, p_val, N_random_variables)
Can this be generalized to the case when n and p take different values for each random variable?

Here you go,
import numpy as np
# Made up parameters
N_random_variables = 500
n_vals = np.random.random_integers(100, 200, N_random_variables)
p_vals = np.random.random_sample(N_random_variables)
# Can this portion be vectorized? Yes
results = np.empty(N_random_variables)
results = np.random.binomial(n_vals, p_vals)

Related

How to generate random numbers at the tails of an exponential distribution?

I want to generate random numbers like from np.random.exponential but clipped / truncated at values a,b. For example, if a=100, b=500 then I want the function to generate random numbers following e^(-x) in the range [100, 500].
An inefficient way would be:
rands = np.random.exponential(size=10**7)
rands = rands[(rands>a) and (rands<b)]
Is there an existing package that can do this for me? Ideally for various distributions, not just exponential.
If we clip the values after using the exponential generator, there are two problems with approach proposed in the question.
First, we lose values (For example, if we wanted 10**7 values, we might only get 10^6 values)
Second, np.random.exponential() returns values between 0 and 1, so we can't simply use 100 and 500 as the lower and upper bounds. We must scale the generated random numbers before scaling.
I wrote the workaround using exp(uniform). I tested your solution using smaller values of a and b (so that we don't get empty arrays). A timed approach shows this is faster by around 50%
import time
import numpy as np
import matplotlib.pyplot as plt
def truncated_exp_OP(a,b, how_many):
rands = np.random.exponential(size=how_many)
rands = rands[(rands>a) & (rands<b)]
return rands
def truncated_exp_NK(a,b, how_many):
a = -np.log(a)
b = -np.log(b)
rands = np.exp(-(np.random.rand(how_many)*(b-a) + a))
return rands
timeTakenOP = []
for i in range(20):
startTime = time.time()
r = truncated_exp_OP(0.001,0.39, 10**7)
endTime = time.time()
timeTakenOP.append(endTime - startTime)
print ("OP solution: ", np.mean(timeTakenOP))
plt.hist(r.flatten(), 300);
plt.show()
timeTakenNK = []
for i in range(20):
startTime = time.time()
r = truncated_exp_NK(100,500, 10**7)
endTime = time.time()
timeTakenNK.append(endTime - startTime)
print ("NK solution: ", np.mean(timeTakenNK))
plt.hist(r.flatten(), 300);
plt.show()
Average run time :
OP solution: 0.28491891622543336 vs
NK solution: 0.1437338709831238
The histogram plots of the random numbers are shown below:
OP's approach:
This approach:

Creating a vector of values based off a test using a for loop

This feels like it should be a simple problem but I am newer to python, in R i would use a foreach loop that gave me an option to combine.
I have tried a for loop that lets me print out all the values i need but i want them collected into a vector of values that i can use later.
from scipy.stats import gamma
import scipy.stats as stats
import numpy as np
import random
data2 = np.random.gamma(1,2, size = 500)
gammT = np.log(data2 + 1)
mean = np.mean(gammT)
sd = np.std(gammT)
a = (mean/ sd)**2
b = (sd**2)/ mean
for i in range(1,100):
gammT = random.sample(list(gammT), 500)
gamm = np.random.gamma(a,b, size = len(gammT))
s = stats.anderson_ksamp([gammT,gamm])
s = s[2]
print(s)
So i am able to print all the values i want but i want them all to be gathered together in a vector of values. I have tried to append and make lists but am not able to get them together.
from scipy.stats import gamma
import scipy.stats as stats
import numpy as np
import random
gammT = np.log(data2.iScore + 1)
mean = np.mean(gammT)
sd = np.std(gammT)
a = (mean/ sd)**2
b = (sd**2)/ mean
#initialize empty list
result=[]
for i in range(100):
# removed (1,100) you only need range(100) for 100 elements
gammT = random.sample(list(gammT), 500)
gamm = np.random.gamma(a,b, size = len(gammT))
s = stats.anderson_ksamp([gammT,gamm])
s = s[2]
#append calculation to list
result.append(s)
print(s)
print(result)

Use Python lmfit with a variable number of parameters in function

I am trying to deconvolve complex gas chromatogram signals into individual gaussian signals. Here is an example, where the dotted line represents the signal I am trying to deconvolve.
I was able to write the code to do this using scipy.optimize.curve_fit; however, once applied to real data the results were unreliable. I believe being able to set bounds to my parameters will improve my results, so I am attempting to use lmfit, which allows this. I am having a problem getting lmfit to work with a variable number of parameters. The signals I am working with may have an arbitrary number of underlying gaussian components, so the number of parameters I need will vary. I found some hints here, but still can't figure it out...
Creating a python lmfit Model with arbitrary number of parameters
Here is the code I am currently working with. The code will run, but the parameter estimates do not change when the model is fit. Does anyone know how I can get my model to work?
import numpy as np
from collections import OrderedDict
from scipy.stats import norm
from lmfit import Parameters, Model
def add_peaks(x_range, *pars):
y = np.zeros(len(x_range))
for i in np.arange(0, len(pars), 3):
curve = norm.pdf(x_range, pars[i], pars[i+1]) * pars[i+2]
y = y + curve
return(y)
# generate some fake data
x_range = np.linspace(0, 100, 1000)
peaks = [50., 40., 60.]
a = norm.pdf(x_range, peaks[0], 5) * 2
b = norm.pdf(x_range, peaks[1], 1) * 0.1
c = norm.pdf(x_range, peaks[2], 1) * 0.1
fake = a + b + c
param_dict = OrderedDict()
for i in range(0, len(peaks)):
param_dict['pk' + str(i)] = peaks[i]
param_dict['wid' + str(i)] = 1.
param_dict['mult' + str(i)] = 1.
# In case, you'd like to see the plot of fake data
#y = add_peaks(x_range, *param_dict.values())
#plt.plot(x_range, y)
#plt.show()
# Initialize the model and fit
pmodel = Model(add_peaks)
params = pmodel.make_params()
for i in param_dict.keys():
params.add(i, value=param_dict[i])
result = pmodel.fit(fake, params=params, x_range=x_range)
print(result.fit_report())
I think you would be better off using lmfits ability to build composite model.
That is, with a single peak defined with
from scipy.stats import norm
def peak(x, amp, center, sigma):
return amp * norm.pdf(x, center, sigma)
(see also lmfit.models.GaussianModel), you can build a model with many peaks:
npeaks = 3
model = Model(peak, prefix='p1_')
for i in range(1, npeaks):
model = model + Model(peak, prefix='p%d_' % (i+1))
params = model.make_params()
Now model will be a sum of 3 Gaussian functions, and the params created for that model will have names like p1_amp, p1_center, p2_amp, ..., which you can add sensible initial values and/or bounds and/or constraints.
Given your example data, you could pass in initial values to make_params like
params = model.make_params(p1_amp=2.0, p1_center=50., p1_sigma=2,
p2_amp=0.2, p2_center=40., p2_sigma=2,
p3_amp=0.2, p3_center=60., p3_sigma=2)
result = model.fit(fake, params, x=x_range)
I was able to find a solution here:
https://lmfit.github.io/lmfit-py/builtin_models.html#example-3-fitting-multiple-peaks-and-using-prefixes
Building on the code above, the following accomplishes what I was trying to do...
from lmfit.models import GaussianModel
gauss1 = GaussianModel(prefix='g1_')
gauss2 = GaussianModel(prefix='g2_')
gauss3 = GaussianModel(prefix='g3_')
gauss4 = GaussianModel(prefix='g4_')
gauss5 = GaussianModel(prefix='g5_')
gauss = [gauss1, gauss2, gauss3, gauss4, gauss5]
prefixes = ['g1_', 'g2_', 'g3_', 'g4_', 'g5_']
mod = np.sum(gauss[0:len(peaks)])
pars = mod.make_params()
for i, prefix in zip(range(0, len(peaks)), prefixes[0:len(peaks)]):
pars[prefix + 'center'].set(peaks[i])
init = mod.eval(pars, x=x_range)
out = mod.fit(fake, pars, x=x_range)
print(out.fit_report(min_correl=0.5))
out.plot_fit()
plt.show()

How to specify size for bernoulli distribution with pymc3?

In trying to make my way through Bayesian Methods for Hackers, which is in pymc, I came across this code:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, size=N)
I've tried to translate this to pymc3 with the following, but it just returns a numpy array, rather than a tensor (?):
first_coin_flips = pm.Bernoulli("first_flips", 0.5).random(size=50)
The reason the size matters is that it's used later on in a deterministic variable. Here's the entirety of the code that I have so far:
import pymc3 as pm
import matplotlib.pyplot as plt
import numpy as np
import mpld3
import theano.tensor as tt
model = pm.Model()
with model:
N = 100
p = pm.Uniform("cheating_freq", 0, 1)
true_answers = pm.Bernoulli("truths", p)
print(true_answers)
first_coin_flips = pm.Bernoulli("first_flips", 0.5)
second_coin_flips = pm.Bernoulli("second_flips", 0.5)
# print(first_coin_flips.value)
# Create model variables
def calc_p(true_answers, first_coin_flips, second_coin_flips):
observed = first_coin_flips * true_answers + (1-first_coin_flips) * second_coin_flips
# NOTE: Where I think the size param matters, since we're dividing by it
return observed.sum() / float(N)
calced_p = pm.Deterministic("observed", calc_p(true_answers, first_coin_flips, second_coin_flips))
step = pm.Metropolis(model.free_RVs)
trace = pm.sample(1000, tune=500, step=step)
pm.traceplot(trace)
html = mpld3.fig_to_html(plt.gcf())
with open("output.html", 'w') as f:
f.write(html)
f.close()
And the output:
The coin flips and uniform cheating_freq output look correct, but the observed doesn't look like anything to me, and I think it's because I'm not translating that size param correctly.
The pymc3 way to specify the size of a Bernoulli distribution is by using the shape parameter, like:
first_coin_flips = pm.Bernoulli("first_flips", 0.5, shape=N)

Calculating tvalue using numpy

As part of an exercise i needed to check whether a given sample's true mean is 1.75 or not by generating tvalue using numpy and compare with the output from scipy.
Code:
import numpy as np
import scipy.stats as stats
import matplotlib.pyplot as plt
np.random.seed(seed=42) # make example reproducible
n = 100
x = np.random.normal(loc=1.78, scale=.1, size=n) # the sample is here
tval, pval = stats.ttest_1samp(x, 1.75)
var_x = x.var(ddof=1)
std_x = np.sqrt(var_x)
tval1 = (x.mean() - 1.75)/(std_x*np.sqrt(n))
print("Scipy: ",tval,"\nNumpy: ",tval1)
The output from Scipy is 2.1598800019529265,
while output from numpy is 0.021598800019529265
I guess the logic i used is incorrect, Please suggest.
You made a mistake in the denominator. It should be
tval1 = (x.mean() - 1.75)/(std_x / np.sqrt(n)) # (std_x divided by root n)
That's why you will find there is a factor of 100 difference ((1/10)/10 = 1/100) between your Scipy and numpy output.
Here is the Wiki of Student's t-test
An example using another sample size:
np.random.seed(seed=42)
n = 369
x = np.random.normal(loc=1.78, scale=.1, size=n) # the sample is here
tval, pval = stats.ttest_1samp(x, 1.75)
var_x = x.var(ddof=1)
std_x = np.sqrt(var_x)
tval1 = (x.mean() - 1.75)/(std_x / np.sqrt(n))
print("Scipy: ",tval,"\nNumpy: ",tval1)
# Output:
# Scipy: 6.306500305262841
# Numpy: 6.306500305262841

Categories