I would like to find the fastest way to generate ~10^9 poisson random numbers in python/numpy—for instance, say I have a mean Poisson parameter (calculated elsewhere) of shape (1000, 2000), and I need 500 independent samples. This is a bottleneck in my code, taking several minutes to complete. I have tried three methods, but am looking for something faster:
import numpy as np
# example parameters
nsamples = 500
nmeas = 2000
ninputs = 1000
lambdax = np.ones([ninputs, nmeas]) * 20
# numpy, one big array
sample0 = np.random.poisson(lam=lambdax, size=(nsamples, ninputs, nmeas))
# numpy, current version where other code happens in the loop
sample1 = np.zeros([nsamples, ninputs, nmeas])
for i in range(nsamples):
sample1[i, :, :] = np.random.poisson(lam=lambdax)
# scipy
from scipy.stats import poisson
sample2 = poisson.rvs(lambdax, size=(nsamples, ninputs, nmeas))
Results:
sample0: 1 m 16 s
sample1: 1 m 20 s
sample2: 1 m 50 s
Not shown here, I am also parallelizing the independent samples via multiprocessing, but the calculations are still pretty expensive for such large parameters. Is there a better way?
I have been in your shoes and here are my suggestions:
For large mean values, poisson works similar to uniform. check out this post (and probably more if you search) .
~1m runtime seems reasonable to generate such a large number of random numbers. I don't think you can top sample0 method by much via just coding. Now depending on what you want to do with random numbers,
if your issue is rerunning program multiple times, try saving sample0 into a file and reloading it in the next runs.
if not, I suggest creating lower number of randoms and reuse them. A lot of those random numbers in sample0 will be repeated in your sample, depending on your mean value. You might want to create smaller sample size and randomly choose from them. for example I would chose a random number from sample0 and reuse it for e.g. 100 times (since that number would appear in sample0 over 100 times anyways).
If you provide more information on what you intend to do with random numbers, we might be able to help more. Otherwise, coding-wise I am not sure if you can do much further.
Related
I need to sample multiple (M) times from N different normal distributions. This repeated sampling will happen in turn several thousand times. I want to do this in the most efficient way, because I would like to not die of old age before this process ends. The code would look something like this:
import numpy as np
# bunch of stuff that is unrelated to the problem
number_of_repeated_processes = 5000
number_of_samples_per_process = 20
# the normal distributions I'm sampling from are described by 2 vectors:
#
# myMEANS <- an numpy array of length 10 containing the means for the distributions
# myVAR <- an numpy array of length 10 containing the variance for the distributions
for i in range(number_of_repeated_processes):
# myRESULT is a list of arrays containing the results for the sampling
#
myRESULT = [np.random.normal(loc=myMEANS[j], scale=myVAR[j], size = number_of_samples_per_process) for j in range(10)]
#
# here do something with myRESULT
# end for loop
The questions is... is there a better way to obtain the myRESULT matrix
np.random.normal accepts means-var as an array directly and you can choose a size that covers all the sampling in one run without loops:
myRESULT = np.random.normal(loc=myMEANS, scale=myVAR, size = (number_of_samples_per_process, number_of_repeated_processes,myMEANS.size))
This will return a number_of_samples_per_process by number_of_repeated_processes column for each mean-var pair in your myMEANS-myVAR array. For example, to access your samples of myMEANS[i]-myVAR[i], use myRESULT[...,i]. This should boost your performance somewhat.
I have browsed SO extensively and I have found many questions about generating all possible permutations, but none regarding generating a specific number of permutations.
I developed, thanks to many SO questions, a decent permutation test routine. However I have to repeat it many times, and it is taking a too long time.
my code:
def exact_mc_perm_test(ys, nmc,boolean_selection):
# xs sample from a time series
# ys all time series
# print nmc
# sample difference in mean
mean_ys = np.mean(ys)
diff = np.abs(np.mean(ys[boolean_selection]) - mean_ys)
k=0
for j in np.arange(nmc):
# in place shuffling
np.random.shuffle(ys)
# difference now between fixed all time series and shuffled subsamplevalues
diff_shuffled = np.abs(np.mean(ys[boolean_selection]) - mean_ys)
k += diff < diff_shuffled
return k / nmc
I took this SO answer and modify it for my specific test.
I have to run it over a 3D array stored in an xarray. the dataset has (lon,lat,time) coordinates, I need to run it for each (lon,lat) position (along the time dimension)
I run it using chain.iteratools:
for ii in chain.from_iterable(zip(*dataset.variable())):
iis = ii[selected_position].values
ind_x =dataset.lon==ii.lon
ind_y =dataset.lat==ii.lat
dataset.perm_test[ind_y, ind_x] = exact_mc_perm_test1(iis, ii.values, 1000.,selected_position)
Ideally I want to run a permutation test with 20000 permutations. The two loops (within (lon,lat) and for 20000 shuffles) adds up.
I am looking to speed up the permutation test code.
Therefore I though about trying to generate a 2D array of shape (len(ys),20000) with essentially 20000 shuffled ys array, and then access them at ones and calculate the 20000 differences (diff in the code). (Or find a trade off between memory usage and the looping, so maybe do 5 loops for 4000 shuffles at the time).
I could not figure out or find a way to do this.
The permutations command from itertools generates all the possible permutations which in my case are too many to handle.
I have looked at the random library but couldn't find something that fits my need. Any suggestion?
Take a look at compress() and permutations() from the itertools:
for perm in compress(permutation(iterable, r=length), boolean_selection):
print(perm)
I am interested in learning if there has been published some type of code or package that can help me with the following problem:
An event takes place 30 times.
Each event can return 6 different values (0,1,2,3,4,5), each with their own unique probability.
I would like to estimate the probability of the total values -after all the scenarios have been simulated - is above X (e.g. 24).
The issue I have is that I can't - in a given event where the value is 3- multiply the probability of value 3*3 and add it together with the previous obtained values. Instead I need to simulate every single variation that is possible.
Is there any relatively simple solution to solve this issue?
First of all, what you're describing isn't scenario analysis. That said, Python can be used to estimate complex probabilities where an analytical solution might be hard or impossible to find.
Assuming an event takes place 30 times, with outcomes [0,1,2,3,4,5], and each outcome has a probability of occurring given by the list (for example) p =
[.1,.2,.2,.3,.1,.1], you can approximate the probability that the sum of all 30 events is greater than X with
import numpy as np
X = 80
np.mean([sum(np.random.choice(a=[0,1,2,3,4,5], size= 30,p=[.1,.2,.2,.3,.1,.1])) > X for i in range(10000)])
I have one var temp, say temp = 100. What I want to do is to generate 8 data points. These 8 points are displayed like shown in the figure. It looks like normal-distribution but I want to add lots of random values in these points so that they do not look like a perfect normal distribution. The final data (the area under the curve) should be summed to temp. Could someone advise how to do this easily and neatly in Python please?
I have tried to use the distribution function in numpy/matplot. However, I wonder how I can get 8 points like shown in the figure (x = 0,1,2,3,4...)? Also I can't figure out how I can sum them to 100?
By imposing the sum temp=100 you introduce a dependency between at least two data points, making it impossible to create a set of independently sampled random data points.
This answer on mathworks provides more detailed information.
An easier example:
Imagine one coin flip. The randomness in the system is exactly one binary outcome, or 1 bit.
Imagine two coin flips. The randomness in the system is exactly two binary outcomes or 2 bit.
Now imagine imposing a sum constraint on two coin flips, let's say you want the sum of coin flips in the system to equal exactly 1. Since the outcome of the second coin flip is determined by the outcome of the first binary decision, the randomness in the system shrinks.
Therefore you reduce the total randomness of the system from 2 bit to 1 bit.
Sampling 8 truly (pseudo)-random points from a normal distribution with a sum-constraint is therefore not possible.
Your best bet would be to sample 7 random points from a distribution with appropriate mean and then add a point to the dataset to absorb the difference:
>>> import numpy as np
>>> temp = 100.0
>>> datapoints = 8
>>> dev = 1
>>> data = np.random.normal(temp/datapoints, dev, datapoints-1)
>>> print(data)
[ 11.70369328 10.77010243 11.20507387 12.40637644 12.81099137
12.55329521 10.95809056]
>>> data = np.append(data,temp-sum(data))
>>> data
array([ 11.70369328, 10.77010243, 11.20507387, 12.40637644,
12.81099137, 12.55329521, 10.95809056, 17.59237685])
>>> sum(data)
100.0
Comparing list of lists has been posted about before but the python environment that I am working in cannot fully integrate all the methods and classes in numpy. I cannot import pandas either.
I am trying to compare lists within a big list and come up with roughly 8-10 lists that approximate all the other lists in the big list.
The approach I have works fine if I have <50 lists in the big list. However, I am trying to compare at least 20k lists and ideally 1million+. I am currently looking into itertools. What might be the fastest, most efficient approach for large data sets without using numpy or pandas?
I am able to use some of the methods and classes in numpy but not all. For example, numpy.allclose and numpy.all do not work properly and that is because of the environment that I am working in.
global rel_tol, avg_lists
rel_tol=.1
avg_lists=[]
#compare the lists in the big list and output ~8-10 lists that approximate the all the lists in the big list
for j in range(len(big_list)):
for k in range(len(big_list)):
array1=np.array(big_list[j])
array2=np.array(big_list[k])
if j!=k:
#if j is not k:
diff=np.subtract(array1, array2)
abs_diff=np.absolute(diff)
#cannot use numpy.allclose
#if the deviation for the largest value in the array is < 10%
if np.amax(abs_diff)<= rel_tol and big_list[k] not in avg_lists:
cntr+=1
avg_lists.append(big_list[k])
Fundamentally, it looks like what you're aiming at is a clustering operation (i.e. representing a set of N points via K < N cluster centers). I would suggest a K-Means clustering approach, where you increase K until the size of your clusters is below your desired threshold.
I'm not sure what you mean by "cannot fully integrate all the methods and classes in numpy", but if scikit-learn is available you could use its K-means estimator. If that's not possible, a simple version of the K-means algorithm is relatively easy to code from scratch, and you might use that.
Here's a k-means approach using scikit-learn:
# 100 lists of length 10 = 100 points in 10 dimensions
from random import random
big_list = [[random() for i in range(10)] for j in range(100)]
# compute eight representative points
from sklearn.cluster import KMeans
model = KMeans(n_clusters=8)
model.fit(big_list)
centers = model.cluster_centers_
print(centers.shape) # (8, 10)
# this is the sum of square distances of your points to the cluster centers
# you can adjust n_clusters until this is small enough for your purposes.
sum_sq_dists = model.inertia_
From here you can e.g. find the closest point in each cluster to its center and treat this as the average. Without more detail of the problem you're trying to solve, it's hard to say for sure. But a clustering approach like this will be the most efficient way to solve a problem like the one you stated in your question.