I have browsed SO extensively and I have found many questions about generating all possible permutations, but none regarding generating a specific number of permutations.
I developed, thanks to many SO questions, a decent permutation test routine. However I have to repeat it many times, and it is taking a too long time.
my code:
def exact_mc_perm_test(ys, nmc,boolean_selection):
# xs sample from a time series
# ys all time series
# print nmc
# sample difference in mean
mean_ys = np.mean(ys)
diff = np.abs(np.mean(ys[boolean_selection]) - mean_ys)
k=0
for j in np.arange(nmc):
# in place shuffling
np.random.shuffle(ys)
# difference now between fixed all time series and shuffled subsamplevalues
diff_shuffled = np.abs(np.mean(ys[boolean_selection]) - mean_ys)
k += diff < diff_shuffled
return k / nmc
I took this SO answer and modify it for my specific test.
I have to run it over a 3D array stored in an xarray. the dataset has (lon,lat,time) coordinates, I need to run it for each (lon,lat) position (along the time dimension)
I run it using chain.iteratools:
for ii in chain.from_iterable(zip(*dataset.variable())):
iis = ii[selected_position].values
ind_x =dataset.lon==ii.lon
ind_y =dataset.lat==ii.lat
dataset.perm_test[ind_y, ind_x] = exact_mc_perm_test1(iis, ii.values, 1000.,selected_position)
Ideally I want to run a permutation test with 20000 permutations. The two loops (within (lon,lat) and for 20000 shuffles) adds up.
I am looking to speed up the permutation test code.
Therefore I though about trying to generate a 2D array of shape (len(ys),20000) with essentially 20000 shuffled ys array, and then access them at ones and calculate the 20000 differences (diff in the code). (Or find a trade off between memory usage and the looping, so maybe do 5 loops for 4000 shuffles at the time).
I could not figure out or find a way to do this.
The permutations command from itertools generates all the possible permutations which in my case are too many to handle.
I have looked at the random library but couldn't find something that fits my need. Any suggestion?
Take a look at compress() and permutations() from the itertools:
for perm in compress(permutation(iterable, r=length), boolean_selection):
print(perm)
Related
I would like to find the fastest way to generate ~10^9 poisson random numbers in python/numpy—for instance, say I have a mean Poisson parameter (calculated elsewhere) of shape (1000, 2000), and I need 500 independent samples. This is a bottleneck in my code, taking several minutes to complete. I have tried three methods, but am looking for something faster:
import numpy as np
# example parameters
nsamples = 500
nmeas = 2000
ninputs = 1000
lambdax = np.ones([ninputs, nmeas]) * 20
# numpy, one big array
sample0 = np.random.poisson(lam=lambdax, size=(nsamples, ninputs, nmeas))
# numpy, current version where other code happens in the loop
sample1 = np.zeros([nsamples, ninputs, nmeas])
for i in range(nsamples):
sample1[i, :, :] = np.random.poisson(lam=lambdax)
# scipy
from scipy.stats import poisson
sample2 = poisson.rvs(lambdax, size=(nsamples, ninputs, nmeas))
Results:
sample0: 1 m 16 s
sample1: 1 m 20 s
sample2: 1 m 50 s
Not shown here, I am also parallelizing the independent samples via multiprocessing, but the calculations are still pretty expensive for such large parameters. Is there a better way?
I have been in your shoes and here are my suggestions:
For large mean values, poisson works similar to uniform. check out this post (and probably more if you search) .
~1m runtime seems reasonable to generate such a large number of random numbers. I don't think you can top sample0 method by much via just coding. Now depending on what you want to do with random numbers,
if your issue is rerunning program multiple times, try saving sample0 into a file and reloading it in the next runs.
if not, I suggest creating lower number of randoms and reuse them. A lot of those random numbers in sample0 will be repeated in your sample, depending on your mean value. You might want to create smaller sample size and randomly choose from them. for example I would chose a random number from sample0 and reuse it for e.g. 100 times (since that number would appear in sample0 over 100 times anyways).
If you provide more information on what you intend to do with random numbers, we might be able to help more. Otherwise, coding-wise I am not sure if you can do much further.
here is my problem:
I would like to define an array of persons and change the entries of this array in a for loop. Since I also would like to see the asymptotics of the resulting distribution, I want to repeat this simulation quiet a lot, thus I'm using a matrix to store the several array in each row. I know how to do this with two for loops:
import random
import numpy as np
nobs = 100
rep = 10**2
steps = 10**2
dmoney = 1
state = np.matrix([[10] * nobs] * rep)
for i in range(steps):
for j in range(rep)
sample = random.sample(range(state.shape[1]),2)
state[j,sample[0]] = state[j,sample[0]] + dmoney
state[j,sample[1]] = state[j,sample[1]] - dmoney
I thought I use the multiprocessing library but I don't know how to do it, because in my simple mind, the workers manipulate the same global matrix in parallel, which I read is not a good idea.
So, how can I do this, to speed up calculations?
Thanks in advance.
OK, so this might not be much use, I haven't profiled it to see if there's a speed-up, but list comprehensions will be a little faster than normal loops anyway.
...
y_ix = np.arange(rep) # create once as same for each loop
for i in range(steps):
# presumably the two locations in the population to swap need refreshing each loop
x_ix = np.array([np.random.choice(nobs, 2) for j in range(rep)])
state[y_ix, x_ix[:,0]] += dmoney
state[y_ix, x_ix[:,1]] -= dmoney
PS what numpy splits over multiple processors depends on what libraries have been included when compiled (BLAS etc). You will be able to find info on line about this.
EDIT I can confirm, after comparing the original with the numpy indexed version above, that the original method is faster!
Comparing list of lists has been posted about before but the python environment that I am working in cannot fully integrate all the methods and classes in numpy. I cannot import pandas either.
I am trying to compare lists within a big list and come up with roughly 8-10 lists that approximate all the other lists in the big list.
The approach I have works fine if I have <50 lists in the big list. However, I am trying to compare at least 20k lists and ideally 1million+. I am currently looking into itertools. What might be the fastest, most efficient approach for large data sets without using numpy or pandas?
I am able to use some of the methods and classes in numpy but not all. For example, numpy.allclose and numpy.all do not work properly and that is because of the environment that I am working in.
global rel_tol, avg_lists
rel_tol=.1
avg_lists=[]
#compare the lists in the big list and output ~8-10 lists that approximate the all the lists in the big list
for j in range(len(big_list)):
for k in range(len(big_list)):
array1=np.array(big_list[j])
array2=np.array(big_list[k])
if j!=k:
#if j is not k:
diff=np.subtract(array1, array2)
abs_diff=np.absolute(diff)
#cannot use numpy.allclose
#if the deviation for the largest value in the array is < 10%
if np.amax(abs_diff)<= rel_tol and big_list[k] not in avg_lists:
cntr+=1
avg_lists.append(big_list[k])
Fundamentally, it looks like what you're aiming at is a clustering operation (i.e. representing a set of N points via K < N cluster centers). I would suggest a K-Means clustering approach, where you increase K until the size of your clusters is below your desired threshold.
I'm not sure what you mean by "cannot fully integrate all the methods and classes in numpy", but if scikit-learn is available you could use its K-means estimator. If that's not possible, a simple version of the K-means algorithm is relatively easy to code from scratch, and you might use that.
Here's a k-means approach using scikit-learn:
# 100 lists of length 10 = 100 points in 10 dimensions
from random import random
big_list = [[random() for i in range(10)] for j in range(100)]
# compute eight representative points
from sklearn.cluster import KMeans
model = KMeans(n_clusters=8)
model.fit(big_list)
centers = model.cluster_centers_
print(centers.shape) # (8, 10)
# this is the sum of square distances of your points to the cluster centers
# you can adjust n_clusters until this is small enough for your purposes.
sum_sq_dists = model.inertia_
From here you can e.g. find the closest point in each cluster to its center and treat this as the average. Without more detail of the problem you're trying to solve, it's hard to say for sure. But a clustering approach like this will be the most efficient way to solve a problem like the one you stated in your question.
I have a large-ish array artist_topic_probs (112,312 item rows by ~100 feature columns), and I want to calculate the pairwise cosine similarity between a (large sample) of random pairs of rows from this array. Here's the relevant bits of my current code
# the number of random pairs to check (10 million here)
random_sample_size=10000000
# I want to make sure they're unique, and that I'm never comparing a row to itself
# so I generate my set of comparisons like so:
np.random.seed(99)
comps = set()
while len(comps)<random_sample_size:
a = np.random.randint(0,112312)
b= np.random.randint(0,112312)
if a!=b:
comp = tuple(sorted([a,b]))
comps.add(comp)
# convert to list at the end to ensure sort order
# not positive if this is needed...I've seen conflicting opinions
comps = list(sorted(comps))
This generates a list of tuples, where each are the two rows between which I'll calculate similarity. Then I just use a simple loop to calculate all the similarities:
c_dists = []
from scipy.spatial.distance import cosine
for a,b in comps:
c_dists.append(cosine(artist_topic_probs[a],artist_topic_probs[b]))
(of course, cosine here gives distance, not a similarity, but we can easily get that with sim = 1.0 - dist. I used similarity in the title because it's the more common term)
This works fine, but isn't too fast, and I need to repeat the procedure many times. I have 32 cores to work with, so parallelization seems like a good bet, but I'm not sure the best way to go about it. My idea was something like:
pool = mp.Pool(processes=32)
c_dists = [pool.apply(cosine, args=(artist_topic_probs[a],artist_topic_probs[b]))
for a,b in comps]
But testing this approach out on my laptop with some test data hasn't been working (it just hangs, or at least is taking so much longer than the simple loop that I got sick of waiting and killed it). My concern is the indexing of the matrix being some sort of bottleneck, but I'm not sure. Any ideas on how to effectively parallelize this (or otherwise speed up the process)?
First of all, you might want to use itertools.combinations and random.sample to get unique pairs in the future, but it won't work in this case due to memory issues. Then, multiprocessing is not multithreading, i.e. spawning a new process involves huge system overhead. There is little sense in spawning a process for each individual task. A task must be well worth the overhead to rationalise starting a new process, hence you'd better split all work into separate jobs (into as many pieces as the number of cores you want to use). Then, don't forget that multiprocessing implementation serialises the entire namespace and loads it into memory N times, where N is the number of processes. This can lead to intensive swapping if you don't have enough RAM to store N copies of your huge array. So you might want to reduce the number of cores.
Updated to restore initial order as you requested.
I made a test data-set of identical vectors, hence cosine must return a vector of zeros.
from __future__ import division, print_function
import math
import multiprocessing as mp
from scipy.spatial.distance import cosine
from operator import itemgetter
import itertools
def worker(enumerated_comps):
return [(ind, cosine(artist_topic_probs[a], artist_topic_probs[b])) for ind, (a, b) in enumerated_comps]
def slice_iterable(iterable, chunk):
"""
Slices an iterable into chunks of size n
:param chunk: the number of items per slice
:type chunk: int
:type iterable: collections.Iterable
:rtype: collections.Generator
"""
_it = iter(iterable)
return itertools.takewhile(
bool, (tuple(itertools.islice(_it, chunk)) for _ in itertools.count(0))
)
# Test data
artist_topic_probs = [range(10) for _ in xrange(10)]
comps = tuple(enumerate([(1, 2), (1, 3), (1, 4), (1, 5)]))
n_cores = 2
chunksize = int(math.ceil(len(comps)/n_cores))
jobs = tuple(slice_iterable(comps, chunksize))
pool = mp.Pool(processes=n_cores)
work_res = pool.map_async(worker, jobs)
c_dists = map(itemgetter(1), sorted(itertools.chain(*work_res.get())))
print(c_dists)
Output:
[2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16, 2.2204460492503131e-16]
These values are fairly close to zero.
P.S.
From the multiprocessing.Pool.apply docs
Equivalent of the apply() built-in function. It blocks until the
result is ready, so apply_async() is better suited for performing
work in parallel. Additionally, func is only executed in one of the
workers of the pool.
scipy.spatial.distance.cosine, as you can see following the link, introduces a significant overhead in your computations because for each invocation it computes the norm of the two vectors that you're analyzing at each invocation, for the size of your sample
this amounts to 20 millions norms computed, if you memorize the norms of your ~100 thousand vectors in advance you can save approximately 60% of your computation time because you have a dot product, u*v, and two norm calculations, and each of these three operations is roughly equivalent in terms of operations count.
Further, you're using explicit loops, if you could put your logic inside a vectorized numpy operator you could trim another large slice of your computational time.
Eventually, you talk about cosine similarity... consider that scipy.spatial.distance.cosine computes the cosine distance instead, the relationship is easy, cs = cd - 1 but I haven't seen this in your posted code.
I am looking for the most efficient way to randomly draw nelements in a list given a list of probabilities stating the probability of each element to be picked.
aList = [3,4,2,1,4,3,5,7,6,4]
MyProba = [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]
It means that at each draw, the first element (which is 3) has a probability of 0.1 to be drawn. Of course,
sum(MyProba) == 1 # always returns True
len(aList) == len(MyProba) # always returns True
Up to now I did the following:
def random_pick(some_list, proba):
x = random.uniform(0, 1)
cumulative_proba = 0.0
for item, item_proba in zip(some_list, proba):
cumulative_proba += item_proba
if x < cumulative_proba:
break
return item
nb_draws = 10
list_of_drawn_elements = []
for one_draw in range(nb_draws):
list_of_drawn_elements.append(random_pick(aList, MyProba))
It works but it is terribly slow for long lists and big values of nb_draws. How can I improve the speed of this process?
Note: In the special case I am facing, nb_draws always equals the length of aList.
The general idea (as outlined by others' answers as well) is that your method is inefficient because the preprocessing (the calculation of the cumulative distribution) is done every time you draw a sample, although it would be enough to do it once before the sampling and then use the preprocessed data to do the sampling.
The preprocessing and sampling could be done efficiently with Walker's alias method. I have implemented it a while ago; take a look at the source code. (Sorry for the external link, but I think it's too long to post it here). My version requires NumPy; if you don't want to use NumPy, there is a NumPy-free alternative as well (on which my version is based).
Edit: the explanation of Walker's alias method is to be found in the first link I provided. In a nutshell, imagine that you somehow managed to construct a rectangular "darts board" that is subdivided into parts such that each part corresponds to one of your original items, and the area of each part is proportional to the desired probability of selecting the corresponding element. You can then start throwing darts at random at the darts board (by generating two random numbers that specify the horizontal and vertical coordinate of where the dart ended up) and check which areas the darts hit. The items corresponding to the areas will be the items you have selected. Walker's alias method is simply a linear-time preprocessing that constructs the dart board. Drawing each element can then be done in constant time. In the end, drawing m elements out of n will have a cost of O(n) for preprocessing and O(m) for generating the samples, yielding a total complexity of O(n + m).
here's my lazy method... build a list with expected number of values for the desired distribution, and use random.choice() to pick a value from the list.
>>> import random
>>>
>>> value_probs = dict(zip([3,4,2,1,4,3,5,7,6,4], [0.1,0.1,0.2,0,0.1,0,0.2,0,0.2,0.1]))
>>> expected_dist = sum([[i] * int(prob * 100) for i, prob in value_probs.iteritems()], [])
>>> random.choice(expected_dist)
You might try to precalculate the cumulative probability range for each element and make a tree from these intervals. Then you will get a logarithmic complexity for looking up the element corresponding to the generated probability, instead of linear one that you have now.
You're calculating cumulative_proba each time when you call random_pick. I suggest to calculate it outside the method, and use a better data structure to store it, like a binary search tree, which will reduce the time complexity from O(n) to O(lgn).