Bootstrapping: Is there a faster way?

Bootstrapping: Is there a faster way? - python

I'm trying to compute the bootstrap statistic of the total of an array and I'm wondering if this can be improved in terms of speed, please?
from numpy import sum
from numpy.random import choice
def bootstrap(observed_array: array, number_of_bootstraps: int = 10000) -> array:
number_of_elements = len(observed_array)
bootstrap_estimates = []
for _ in range(number_of_bootstraps):
indices = choice(number_of_elements, size=number_of_elements, replace=True)
bootstrap_sample = observed_array[indices]
bootstrap_estimate = bootstrap_sample.sum()
bootstrap_estimates.append(bootstrap_estimate)
return array(bootstrap_estimates)
Thanks for any suggestions here.

Related

Is there a better way to solve this MINLP in pyscipopt?

I'm trying to solve the following MINLP, basically attempting to maximize the likelihood of a certain portfolio reaching a "ceiling" performance. My first attempt at the code is below.
EDIT: Math says maximize, should say minimize
from pyscipopt import Model, quicksum
import numpy as np
import pandas as pd
from random import uniform, normalvariate
model=Model()
t=20000
stocks_portfolio = {}
stocks_df = pd.DataFrame(np.zeros((150,4)),columns = {'ids','Mean','cost','stdev'})
noptions = len(stocks_df)
stocks_df['ids'] = [i for i in range(noptions)]
stocks_df['Mean'] = [uniform(500,2500) for i in range(noptions)]
stocks_df['cost'] = [stocks_df.loc[i,'Mean']*uniform(50,250) for i in range(noptions)]
stocks_df['stdev'] = [stocks_df.loc[i,'Mean']*uniform(0.2,0.5) for i in range(noptions)]
cov_mat = np.array([[normalvariate(0,0.3) for i in range(noptions)] for j in range(noptions)])
for i in range(len(stocks_df)):
stocks_portfolio[i] = model.addVar(vtype='B')
model.addCons(quicksum(stocks_portfolio[i] for i in range(noptions))==15)
model.addCons(quicksum(stocks_df.loc[i, 'cost']*stocks_portfolio[i] for i in range(noptions)) <= 600000)
stand_in = model.addVar(vtype='C')
model.addCons(stand_in>=(t-quicksum(stocks_df.loc[i,'Mean']*stocks_portfolio[i] for i in range(noptions)))/((quicksum(stocks_portfolio[i]*stocks_df.loc[i,'stdev']**2 for i in range(noptions))+quicksum(2*stocks_portfolio[i]*stocks_portfolio[j]*cov_mat[i,j] for i in range(noptions) for j in range(noptions)))**0.5))
model.setObjective(stand_in,'minimize')
model.optimize()
model.getCondition()
portfolios = []
for i in range(noptions):
if model.getVal(stocks_portfolio[i]) > 0.9:
portfolios.append(i)
The performance here has been slow and unwieldy, and I was wondering if I'm thinking about the question all wrong.

How to generate random numbers at the tails of an exponential distribution?

I want to generate random numbers like from np.random.exponential but clipped / truncated at values a,b. For example, if a=100, b=500 then I want the function to generate random numbers following e^(-x) in the range [100, 500].
An inefficient way would be:
rands = np.random.exponential(size=10**7)
rands = rands[(rands>a) and (rands<b)]
Is there an existing package that can do this for me? Ideally for various distributions, not just exponential.

If we clip the values after using the exponential generator, there are two problems with approach proposed in the question.
First, we lose values (For example, if we wanted 10**7 values, we might only get 10^6 values)
Second, np.random.exponential() returns values between 0 and 1, so we can't simply use 100 and 500 as the lower and upper bounds. We must scale the generated random numbers before scaling.
I wrote the workaround using exp(uniform). I tested your solution using smaller values of a and b (so that we don't get empty arrays). A timed approach shows this is faster by around 50%
import time
import numpy as np
import matplotlib.pyplot as plt
def truncated_exp_OP(a,b, how_many):
rands = np.random.exponential(size=how_many)
rands = rands[(rands>a) & (rands<b)]
return rands
def truncated_exp_NK(a,b, how_many):
a = -np.log(a)
b = -np.log(b)
rands = np.exp(-(np.random.rand(how_many)*(b-a) + a))
return rands
timeTakenOP = []
for i in range(20):
startTime = time.time()
r = truncated_exp_OP(0.001,0.39, 10**7)
endTime = time.time()
timeTakenOP.append(endTime - startTime)
print ("OP solution: ", np.mean(timeTakenOP))
plt.hist(r.flatten(), 300);
plt.show()
timeTakenNK = []
for i in range(20):
startTime = time.time()
r = truncated_exp_NK(100,500, 10**7)
endTime = time.time()
timeTakenNK.append(endTime - startTime)
print ("NK solution: ", np.mean(timeTakenNK))
plt.hist(r.flatten(), 300);
plt.show()
Average run time :
OP solution: 0.28491891622543336 vs
NK solution: 0.1437338709831238
The histogram plots of the random numbers are shown below:
OP's approach:
This approach:

Creating a vector of values based off a test using a for loop

This feels like it should be a simple problem but I am newer to python, in R i would use a foreach loop that gave me an option to combine.
I have tried a for loop that lets me print out all the values i need but i want them collected into a vector of values that i can use later.
from scipy.stats import gamma
import scipy.stats as stats
import numpy as np
import random
data2 = np.random.gamma(1,2, size = 500)
gammT = np.log(data2 + 1)
mean = np.mean(gammT)
sd = np.std(gammT)
a = (mean/ sd)**2
b = (sd**2)/ mean
for i in range(1,100):
gammT = random.sample(list(gammT), 500)
gamm = np.random.gamma(a,b, size = len(gammT))
s = stats.anderson_ksamp([gammT,gamm])
s = s[2]
print(s)
So i am able to print all the values i want but i want them all to be gathered together in a vector of values. I have tried to append and make lists but am not able to get them together.

from scipy.stats import gamma
import scipy.stats as stats
import numpy as np
import random
gammT = np.log(data2.iScore + 1)
mean = np.mean(gammT)
sd = np.std(gammT)
a = (mean/ sd)**2
b = (sd**2)/ mean
#initialize empty list
result=[]
for i in range(100):
# removed (1,100) you only need range(100) for 100 elements
gammT = random.sample(list(gammT), 500)
gamm = np.random.gamma(a,b, size = len(gammT))
s = stats.anderson_ksamp([gammT,gamm])
s = s[2]
#append calculation to list
result.append(s)
print(s)
print(result)

How to optimize for loops for generating a new random Poisson array in python?

I want to read an grayscale image, say something with (248, 480, 3) shape, then use each element of it as the lam value for making a Poisson random value and do this for each element and make a new data set with the same shape. I want to do this as much as nscan, then I want to add them all together and put them in a new data set and plot it again to get something that is similar to the first image that I put in the beginning. This code is working but it is extremely slow, I was wondering if there is any way to make it faster?
import numpy as np
import matplotlib.pyplot as plt
my_image = plt.imread('myimage.png')
def genP(data):
new_data = np.zeros(data.shape)
for i in range(data.shape[0]):
for j in range(data.shape[1]):
for k in range(data.shape[2]):
new_data[i, j, k] = np.random.poisson(lam = data[i, j, k])
return new_data
def get_total(data, nscan = 1):
total = genP(data)
for i in range(nscan):
total += genP(data)
total = total/nscan
plt.imshow(total)
plt.show()
get_total(my_image, 100)

numpy.random.poisson can entirely replace your genP() function... This is basically guaranteed to be much faster.
If size is None (default), a single value is returned if lam is a scalar. Otherwise, np.array(lam).size samples are drawn
def get_total(data, nscan = 1):
total = np.random.poisson(lam=data)
for i in range(nscan):
total += np.random.poisson(lam=data)
total = total/nscan
plt.imshow(total)
plt.show()

Vectorized sampling of multiple binomial random variables

I would like to sample a few hundred binomially distributed random variables, each with a different n and p (using the argument names as defined in the numpy.random.binomial docs). I'll be doing this repeatedly, so I'd like to vectorize the code if possible. Here's an example:
import numpy as np
# Made up parameters
N_random_variables = 500
n_vals = np.random.random_integers(100, 200, N_random_variables)
p_vals = np.random.random_sample(N_random_variables)
# Can this portion be vectorized?
results = np.empty(N_random_variables)
for i in xrange(N_random_variables):
results[i] = np.random.binomial(n_vals[i], p_vals[i])
In the special case that n and p are the same for each random variable, I can do:
import numpy as np
# Made up parameters
N_random_variables = 500
n_val = 150
p_val = 0.5
# Vectorized code
results = np.random.binomial(n_val, p_val, N_random_variables)
Can this be generalized to the case when n and p take different values for each random variable?

Here you go,
import numpy as np
# Made up parameters
N_random_variables = 500
n_vals = np.random.random_integers(100, 200, N_random_variables)
p_vals = np.random.random_sample(N_random_variables)
# Can this portion be vectorized? Yes
results = np.empty(N_random_variables)
results = np.random.binomial(n_vals, p_vals)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Bootstrapping: Is there a faster way? - python

Related

Is there a better way to solve this MINLP in pyscipopt?

How to generate random numbers at the tails of an exponential distribution?

Creating a vector of values based off a test using a for loop

How to optimize for loops for generating a new random Poisson array in python?

Vectorized sampling of multiple binomial random variables

Categories

Resources