How sample from a linspace without replacement in batches - python

I'd like to sample n random numbers from a linspace without replacement and do so in batches. Thus, each sample in the batch should not have repeated numbers, but numbers may repeat across the batch.
The following code shows how I do it by calling Generator.choice repeatedly.
import numpy as np
low, high = 0, 10
sample_shape = (3,)
n = 5
rng = np.random.default_rng() # or previously instantiated RNG
space = np.linspace(start=low, stop=high, num=1000)
samples = np.stack(
[
rng.choice(space, size=n, replace=False)
for _ in range(np.prod(sample_shape, dtype=int))
]
)
samples = samples.reshape(sample_shape + (n,))
print(f"samples.shape: {samples.shape}")
print(samples)
Current output:
samples.shape: (3, 5)
[[4.15415415 5.56556557 1.38138138 7.78778779 7.03703704]
[1.48148148 6.996997 0.91091091 3.28328328 2.93293293]
[7.82782783 9.65965966 9.94994995 5.84584585 5.26526527]]
However, this procedure turns out to be a big bottleneck in my code. Is there a more efficient way of performing this?

Related

How can I vectorize this python for loop?

I am trying to count the number of events with various thresholds. So I used for loop to use it as thresholds but the number of events is too many so it takes too much time.
So I want to vectorize this macro and reduce compute time. Can I get some help?
array_ = np.array(bin_number)
for i in range(bin_number):
mask_1 = array_ML[:,0] > i
masked_array = array_ML[mask_1]
mask_2 = masked_array[:,2] == 0
masked_array = masked_array[mask_2]
array_[i] = masked_array.shape[0]
There may be a dedicated function in NumPy that does this for you, but otherwise, the following simplifications are likely to speed up your code significantly:
import numpy as np
# Create example data
array_ML = np.random.randint(0, 1000, (10000, 200))
array_ML[:, 2] = np.where(array_ML[:, 2] > 500, 0, 1)
bin_number = 100
array_ = np.zeros(bin_number, dtype=int)
# filter what we can, before the loop
mask = array_ML[:, 2] == 0
temp = array_ML[mask, 0]
# Just count, by summing the condition
for i in range(bin_number):
array_[i] = np.sum(temp > i)
With the above example data, my timings (using %%time in Jupyter notebook cells) reduce from 439 ms (original code) to 3.86 ms (code above).
Of course, the timing decreases are heavily dependent on your input data shape, distribution of data, and bin_number; my timings serve as an indication.

How to generate random numbers at the tails of an exponential distribution?

I want to generate random numbers like from np.random.exponential but clipped / truncated at values a,b. For example, if a=100, b=500 then I want the function to generate random numbers following e^(-x) in the range [100, 500].
An inefficient way would be:
rands = np.random.exponential(size=10**7)
rands = rands[(rands>a) and (rands<b)]
Is there an existing package that can do this for me? Ideally for various distributions, not just exponential.
If we clip the values after using the exponential generator, there are two problems with approach proposed in the question.
First, we lose values (For example, if we wanted 10**7 values, we might only get 10^6 values)
Second, np.random.exponential() returns values between 0 and 1, so we can't simply use 100 and 500 as the lower and upper bounds. We must scale the generated random numbers before scaling.
I wrote the workaround using exp(uniform). I tested your solution using smaller values of a and b (so that we don't get empty arrays). A timed approach shows this is faster by around 50%
import time
import numpy as np
import matplotlib.pyplot as plt
def truncated_exp_OP(a,b, how_many):
rands = np.random.exponential(size=how_many)
rands = rands[(rands>a) & (rands<b)]
return rands
def truncated_exp_NK(a,b, how_many):
a = -np.log(a)
b = -np.log(b)
rands = np.exp(-(np.random.rand(how_many)*(b-a) + a))
return rands
timeTakenOP = []
for i in range(20):
startTime = time.time()
r = truncated_exp_OP(0.001,0.39, 10**7)
endTime = time.time()
timeTakenOP.append(endTime - startTime)
print ("OP solution: ", np.mean(timeTakenOP))
plt.hist(r.flatten(), 300);
plt.show()
timeTakenNK = []
for i in range(20):
startTime = time.time()
r = truncated_exp_NK(100,500, 10**7)
endTime = time.time()
timeTakenNK.append(endTime - startTime)
print ("NK solution: ", np.mean(timeTakenNK))
plt.hist(r.flatten(), 300);
plt.show()
Average run time :
OP solution: 0.28491891622543336 vs
NK solution: 0.1437338709831238
The histogram plots of the random numbers are shown below:
OP's approach:
This approach:

Sequential Sampling

To sample from N(1,2) with sample size 100 and calculating the mean of this sample we can do this:
import numpy as np
s = np.random.normal(1, 2, 100)
mean = np.mean(s)
Now if we want to produce 10000 samples and save mean of each of them we can do:
sample_means = []
for x in range(10000):
sample = np.random.normal(1, 2, 100)
sample_means.append (sample.mean())
How can I do it when we want to sample sequentially from N(1,2) and estimate the distribution mean sequentially?
IIUC you meant accumulative
sample = np.random.normal(1,2,(10000, 100))
sample_mean = []
for i,_ in enumerate(sample):
sample_mean.append(sample[:i+1,:].ravel().mean())
Then sample_mean contains the accumulative samples mean
sample_mean[:10]
[1.1185342714036368,
1.3270808654923423,
1.3266440422140355,
1.2542028664103761,
1.179358517854582,
1.1224645540064788,
1.1416887857272255,
1.1156887336750463,
1.0894328800573165,
1.0878896099712452]
Maybe list comprehension?
sample_means = [np.random.normal(1, 2, 100).mean() for i in range(10000)]
TIP Use lower case to name variables in Python

numpy random array values between -1 and 1

what is the best way to create a NumPy array of a given size with values randomly and uniformly spread between -1 and 1?
I tried 2*np.random.rand(size)-1
I'm not sure. Try:
s = np.random.uniform(-1, 1, size)
reference: https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.uniform.html
I can use numpy.arange:
import numpy as np
print(np.arange(start=-1.0, stop=1.0, step=0.2, dtype=np.float))
The step parameter defines the size and the uniformity in the distribution of the elements.
In your solution the np.random.rand(size) returns random floats in the half-open interval [0.0, 1.0)
this means 2 * np.random.rand(size) - 1 returns numbers in the half open interval [0, 2) - 1 := [-1, 1), i.e. range including -1 but not 1.
If this is what you wish to do then it is okay.
But, if you wish to generate numbers in the open interval (-1, 1), i.e. between -1 and 1 and hence not including either -1 or 1, may I suggest the following -
from numpy.random import default_rng
rg = default_rng(2)
size = (5,5)
rand_arr = rg.random(size)
rand_signs = rg.choice([-1,1], size)
rand_arr = rand_arr * rand_signs
print(rand_arr)
I have used the new suggested Generator per numpy, see link https://numpy.org/devdocs/reference/random/index.html#quick-start
100% working Code:
a = np.random.uniform(-1,1)
print(a)

Vectorized sampling of multiple binomial random variables

I would like to sample a few hundred binomially distributed random variables, each with a different n and p (using the argument names as defined in the numpy.random.binomial docs). I'll be doing this repeatedly, so I'd like to vectorize the code if possible. Here's an example:
import numpy as np
# Made up parameters
N_random_variables = 500
n_vals = np.random.random_integers(100, 200, N_random_variables)
p_vals = np.random.random_sample(N_random_variables)
# Can this portion be vectorized?
results = np.empty(N_random_variables)
for i in xrange(N_random_variables):
results[i] = np.random.binomial(n_vals[i], p_vals[i])
In the special case that n and p are the same for each random variable, I can do:
import numpy as np
# Made up parameters
N_random_variables = 500
n_val = 150
p_val = 0.5
# Vectorized code
results = np.random.binomial(n_val, p_val, N_random_variables)
Can this be generalized to the case when n and p take different values for each random variable?
Here you go,
import numpy as np
# Made up parameters
N_random_variables = 500
n_vals = np.random.random_integers(100, 200, N_random_variables)
p_vals = np.random.random_sample(N_random_variables)
# Can this portion be vectorized? Yes
results = np.empty(N_random_variables)
results = np.random.binomial(n_vals, p_vals)

Categories