Given an nxn array A of real positive numbers, I'm trying to find the minimum of the maximum of the element-wise minimum of all combinations of three rows of the 2-d array. Using for-loops, that comes out to something like this:
import numpy as np
n = 100
np.random.seed(2)
A = np.random.rand(n,n)
global_best = np.inf
for i in range(n-2):
for j in range(i+1, n-1):
for k in range(j+1, n):
# find the maximum of the element-wise minimum of the three vectors
local_best = np.amax(np.array([A[i,:], A[j,:], A[k,:]]).min(0))
# if local_best is lower than global_best, update global_best
if (local_best < global_best):
global_best = local_best
save_rows = [i, j, k]
print global_best, save_rows
In the case for n = 100, the output should be this:
Out[]: 0.492652949593 [6, 41, 58]
I have a feeling though that I could do this much faster using Numpy vectorization, and would certainly appreciate any help on doing this. Thanks.
This solution is 5x faster for n=100:
coms = np.fromiter(itertools.combinations(np.arange(n), 3), 'i,i,i').view(('i', 3))
best = A[coms].min(1).max(1)
at = best.argmin()
global_best = best[at]
save_rows = coms[at]
The first line is a bit convoluted but turns the result of itertools.combinations into a NumPy array which contains all possible [i,j,k] index combinations.
From there, it's a simple matter of indexing into A using all the possible index combinations, then reducing along the appropriate axes.
This solution consumes a lot more memory as it builds the concrete array of all possible combinations A[coms]. It saves time for smallish n, say under 250, but for large n the memory traffic will be very high and it may be slower than the original code.
Working by chunks allows to combine the speed of vectorized calculus while avoiding to run into Memory Errors. Below there is an example of converting the nested loops to vectorization by chunks.
Starting from the same variables as the question, a chunk length is defined, in order to vectorize calculations inside the chunk and loop only over chunks instead of over combinations.
chunk = 2000 # define chunk length, if to small, the code won't take advantage
# of vectorization, if it is too large, excessive memory usage will
# slow down execution, or Memory Error will be risen
combinations = itertools.combinations(range(n),3) # generate iterator containing
# all possible combinations of 3 columns
N = n*(n-1)*(n-2)//6 # number of combinations (length of combinations cannot be
# retrieved because it is an iterator)
# generate a list containing how many elements of combinations will be retrieved
# per iteration
n_chunks, remainder = divmod(N,chunk)
counts_list = [chunk for _ in range(n_chunks)]
if remainder:
counts_list.append(remainder)
# Iterate one chunk at a time, using vectorized code to treat the chunk
for counts in counts_list:
# retrieve combinations in current chunk
current_comb = np.fromiter(combinations,dtype='i,i,i',count=counts)\
.view(('i',3))
# maximum of element-wise minimum in current chunk
chunk_best = np.minimum(np.minimum(A[current_comb[:,0],:],A[current_comb[:,1],:]),
A[current_comb[:,2],:]).max(axis=1)
ravel_save_row = chunk_best.argmin() # minimum of maximums in current chunk
# check if current chunk contains global minimum
if chunk_best[ravel_save_row] < global_best:
global_best = chunk_best[ravel_save_row]
save_rows = current_comb[ravel_save_row]
print(global_best,save_rows)
I ran some performance comparisons with the nested loops, obtaining the following results (chunk_length = 1000):
n=100
Nested loops: 1.13 s ± 16.6 ms
Work by chunks: 108 ms ± 565 µs
n=150
Nested loops: 4.16 s ± 39.3 ms
Work by chunks: 523 ms ± 4.75 ms
n=500
Nested loops: 3min 18s ± 3.21 s
Work by chunks: 1min 12s ± 1.6 s
Note
After profiling the code, I found that the np.min was what took longest by calling np.maximum.reduce. I converted it directly to np.maximum which improved performance a bit.
Don't try to vectorize loops that are not simple to vectorize. Instead use a jit compiler like Numba or use Cython. Vectorized solutions are good if the resulting code is more readable, but in terms of performance a compiled solution is usually faster or in a worst case scenario as fast as a vectorized solution (except BLAS routines).
Single-threaded example
import numba as nb
import numpy as np
#Min and max library calls may be costly for only 3 values
#nb.njit()
def max_min_3(A,B,C):
max_of_min=-np.inf
for i in range(A.shape[0]):
loc_min=A[i]
if (B[i]<loc_min):
loc_min=B[i]
if (C[i]<loc_min):
loc_min=C[i]
if (max_of_min<loc_min):
max_of_min=loc_min
return max_of_min
#nb.njit()
def your_func(A):
n=A.shape[0]
save_rows=np.zeros(3,dtype=np.uint64)
global_best=np.inf
for i in range(n):
for j in range(i+1, n):
for k in range(j+1, n):
# find the maximum of the element-wise minimum of the three vectors
local_best = max_min_3(A[i,:], A[j,:], A[k,:])
# if local_best is lower than global_best, update global_best
if (local_best < global_best):
global_best = local_best
save_rows[0] = i
save_rows[1] = j
save_rows[2] = k
return global_best, save_rows
Performance of single-threaded version
n=100
your_version: 1.56s
compiled_version: 0.0168s (92x speedup)
n=150
your_version: 5.41s
compiled_version: 0.08122s (66x speedup)
n=500
your_version: 283s
compiled_version: 8.86s (31x speedup)
The first call has a constant overhead of about 0.3-1s. For performance measurement of the calculation time itself, call it once and then measure performance.
With a few code changes this task can also be parallelized.
Multi-threaded example
#nb.njit(parallel=True)
def your_func(A):
n=A.shape[0]
all_global_best=np.inf
rows=np.empty((3),dtype=np.uint64)
save_rows=np.empty((n,3),dtype=np.uint64)
global_best_Temp=np.empty((n),dtype=A.dtype)
global_best_Temp[:]=np.inf
for i in range(n):
for j in nb.prange(i+1, n):
row_1=0
row_2=0
row_3=0
global_best=np.inf
for k in range(j+1, n):
# find the maximum of the element-wise minimum of the three vectors
local_best = max_min_3(A[i,:], A[j,:], A[k,:])
# if local_best is lower than global_best, update global_best
if (local_best < global_best):
global_best = local_best
row_1 = i
row_2 = j
row_3 = k
save_rows[j,0]=row_1
save_rows[j,1]=row_2
save_rows[j,2]=row_3
global_best_Temp[j]=global_best
ind=np.argmin(global_best_Temp)
if (global_best_Temp[ind]<all_global_best):
rows[0] = save_rows[ind,0]
rows[1] = save_rows[ind,1]
rows[2] = save_rows[ind,2]
all_global_best=global_best_Temp[ind]
return all_global_best, rows
Performance of multi-threaded version
n=100
your_version: 1.56s
compiled_version: 0.0078s (200x speedup)
n=150
your_version: 5.41s
compiled_version: 0.0282s (191x speedup)
n=500
your_version: 283s
compiled_version: 2.95s (96x speedup)
Edit
In a newer Numba Version (installed through the Anaconda Python Distribution) I have to manually install tbb to get a working parallelization.
You can use combinations from itertools, that it's a python standard library, and it will help you to to remove all those nested loops.
from itertools import combinations
import numpy as np
n = 100
np.random.seed(2)
A = np.random.rand(n,n)
global_best = 1000000000000000.0
for i, j, k in combinations(range(n), 3):
local_best = np.amax(np.array([A[i,:], A[j,:], A[k,:]]).min(0))
if local_best < global_best:
global_best = local_best
save_rows = [i, j, k]
print global_best, save_rows
Related
The answer for three matrices was given in this question, but I'm not sure how to apply this logic to an arbitrary amount of pairwise connected matrices:
f(i, j, k, l, ...) = min(A(i, j), B(i,k), C(i,l), D(j,k), E(j,l), F(k,l), ...)
Where A,B,... are matrices and i,j,... are indices that range up to the respective dimensions of the matrices. If we consider n indices, there are n(n-1)/2 pairs and thus matrices. I would like to find (i,j,k,...) such that f(i,j,k,l,...) is maximized. I am currently doing that as follows:
import numpy as np
import itertools
# i j k l ...
dimensions = [50,50,50,50]
n_dims = len(dimensions)
pairs = list(itertools.combinations(range(n_dims), 2))
# Construct the matrices A(i,j), B(i,k), ...
matrices = [];
for pair in pairs:
matrices.append(np.random.rand(dimensions[pair[0]], dimensions[pair[1]]))
# All the different i,j,k,l... combinations
combinations = itertools.product(*list(map(np.arange,dimensions)))
combinations = np.asarray(list(combinations))
# Find the maximum minimum
vals = []
for i in range(len(pairs)):
pair = pairs[i]
matrix = matrices[i]
vals.append(matrix[combinations[:,pair[0]], combinations[:,pair[1]]])
f = np.min(vals,axis=0)
best_indices = combinations[np.argmax(f)]
print(best_indices, np.max(f))
[5 17 17 18] 0.932985854758534
This is faster than iterating over all (i, j, k, l, ...), but a lot of time is spent constructing the combinations and vals matrices. Is there an alternative way to do this where (1) the speed of numpy's matrix computation can be preserved and (2) I don't have to construct the memory-intensive vals matrices?
Here is a generalisation of the 3D solution. I assume there are other (better?) ways of organising the recursion but this works well enough. It does a 6D example (product of dims 9x10^6) in <10 ms
Sample run, note that occasionally the indices returned by the two methods do not match. This is because they are not always unique, sometimes different index combinations yield the same maximum of minima. Also note that in the very end we do a single run of a huge 6D 9x10^12 example. Brute force is no longer viable on that, the smart method takes about 10 seconds.
trial 1
results identical True
results compatible True
brute force 276.8830654968042 ms
branch cut 9.971900499658659 ms
trial 2
results identical True
results compatible True
brute force 273.444719001418 ms
branch cut 9.236706099909497 ms
trial 3
results identical True
results compatible True
brute force 274.2998780013295 ms
branch cut 7.31226220013923 ms
trial 4
results identical True
results compatible True
brute force 273.0268925006385 ms
branch cut 6.956217200058745 ms
HUGE (100, 150, 200, 100, 150, 200) 9000000000000
branch cut 10246.754082996631 ms
Code:
import numpy as np
import itertools as it
import functools as ft
def bf(dims,pairs):
dims,pairs = np.array(dims),np.array(pairs,object)
n,m = len(dims),len(pairs)
IDX = np.empty((m,n),object)
Y,X = np.triu_indices(n,1)
IDX[np.arange(m),Y] = slice(None)
IDX[np.arange(m),X] = slice(None)
idx = np.unravel_index(
ft.reduce(np.minimum,(p[(*i,)] for p,i in zip(pairs,IDX))).argmax(),dims)
return ft.reduce(np.minimum,(
p[I] for p,I in zip(pairs,it.combinations(idx,2)))),idx
def cut(dims,pairs,offs=None):
n = len(dims)
if n<3:
if n==2:
A = pairs[0] if offs is None else np.minimum(
pairs[0],np.minimum.outer(offs[0],offs[1]))
idx = np.unravel_index(A.argmax(),dims)
return A[idx],idx
else:
idx = offs[0].argmax()
return offs[0][idx],(idx,)
gmx = min(map(np.min,pairs))
gidx = n * (0,)
A = pairs[0] if offs is None else np.minimum(
pairs[0],np.minimum.outer(offs[0],offs[1]))
Y,X = np.unravel_index(A.argsort(axis=None)[::-1],dims[:2])
for y,x in zip(Y,X):
if A[y,x] <= gmx:
return gmx,gidx
coffs = [np.minimum(p1[y],p2[x])
for p1,p2 in zip(pairs[1:n-1],pairs[n-1:])]
if not offs is None:
coffs = [*map(np.minimum,coffs,offs[2:])]
cmx,cidx = cut(dims[2:],pairs[2*n-3:],coffs)
if cmx >= A[y,x]:
return A[y,x],(y,x,*cidx)
if gmx < cmx:
gmx = min(A[y,x],cmx)
gidx = y,x,*cidx
return gmx,gidx
from timeit import timeit
IDX = 10,15,20,10,15,20
for rep in range(4):
print("trial",rep+1)
pairs = [np.random.rand(i,j) for i,j in it.combinations(IDX,2)]
print("results identical",cut(IDX,pairs)==bf(IDX,pairs))
print("results compatible",cut(IDX,pairs)[1]==bf(IDX,pairs)[1])
print("brute force",timeit(lambda:bf(IDX,pairs),number=2)*500,"ms")
print("branch cut",timeit(lambda:cut(IDX,pairs),number=10)*100,"ms")
IDX = 100,150,200,100,150,200
pairs = [np.random.rand(i,j) for i,j in it.combinations(IDX,2)]
print("HUGE",IDX,np.prod(IDX))
print("branch cut",timeit(lambda:cut(IDX,pairs),number=1)*1000,"ms")
I'm concerned with the speed of the following function:
def cch(tau):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
Where "cartprod" is a variable for a list that looks like this:
cartprod = np.ndarray([[0.0123,0.0123],[0.0123,0.0459],...])
The length of this list is about 25 million. Basically, I'm trying to find a significantly faster way to return a list of differences for every pair list in that np.ndarray. Is there an algorithmic way or function that's faster than np.diff? Or, is np.diff the end all be all? I'm also open to anything else.
EDIT: Thank you all for your solutions!
I think you're hitting a wall by repeatedly returning multiple np.arrays of length ~25 million rather than np.diff being slow. I wrote an equivalent function that iterates over the array and tallies the results as it goes along. The function needs to be jitted with numba to be fast. I hope that is acceptable.
arr = np.random.rand(25000000, 2)
def cch(tau, cartprod):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
%timeit cch(0.01, arr)
#jit(nopython=True)
def cch_jit(tau, cartprod):
count = 0
tau = -tau
for i in range(cartprod.shape[0]):
count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
return count
%timeit cch_jit(0.01, arr)
produces
294 ms ± 2.82 ms
42.7 ms ± 483 µs
which is about ~6 times faster.
We can leverage multi-core with numexpr module for large data and to gain memory efficiency and hence performance with some help from array-slicing -
import numexpr as ne
def cch_numexpr(a, tau):
d = {'a0':a[:,0],'a1':a[:,1]}
return np.count_nonzero(ne.evaluate('abs(a0-a1-tau)<0.001',d))
Sample run and timings on 25M sized data -
In [83]: cartprod = np.random.rand(25000000,2)
In [84]: cch(cartprod, tau=0.5) == cch_numexpr(cartprod, tau=0.5)
Out[84]: True
In [85]: %timeit cch(cartprod, tau=0.5)
10 loops, best of 3: 150 ms per loop
In [86]: %timeit cch_numexpr(cartprod, tau=0.5)
10 loops, best of 3: 25.5 ms per loop
Around 6x speedup.
This was with 8 threads. Thus, with more number of threads available for compute, it should improve further. Related post on how to control multi-core functionality.
Just out of curiosity I compared the solutions of #Divakar numexpr and #alexdor numba.jit. The implementation numexpr.evaluate seems to be twice as fast as using numba's jit compiler. The results are shown for 100 runs each:
np.sum: 111.07543396949768
numexpr: 12.282189846038818
JIT: 6.2505223751068115
'np.sum' returns same result as 'numexpr'
'np.sum' returns same result as 'jit'
'numexpr' returns same result as 'jit'
Script so reproduce the results:
import numpy as np
import time
import numba
import numexpr
arr = np.random.rand(25000000, 2)
runs = 100
def cch(tau, cartprod):
return np.sum(abs(-1*np.diff(cartprod)-tau)<0.001)
def cch_ne(tau, cartprod):
d = {'a0':cartprod[:,0],'a1':cartprod[:,1], 'tau': tau}
count = np.count_nonzero(numexpr.evaluate('abs(a0-a1-tau)<0.001',d))
return count
#numba.jit(nopython=True)
def cch_jit(tau, cartprod):
count = 0
tau = -tau
for i in range(cartprod.shape[0]):
count += np.less(np.abs(tau - (cartprod[i, 1]- cartprod[i, 0])), 0.001)
return count
start = time.time()
for x in range(runs):
x1 = cch(0.01, arr)
print('np.sum:\t\t', time.time() - start)
start = time.time()
for x in range(runs):
x2 = cch_ne(0.01, arr)
print('numexpr:\t', time.time() - start)
x3 = cch_jit(0.01, arr)
start = time.time()
for x in range(runs):
x3 = cch_jit(0.01, arr)
print('JIT:\t\t', time.time() - start)
if x1 == x2: print('\'np.sum\' returns same result as \'numexpr\'')
if x1 == x3: print('\'np.sum\' returns same result as \'jit\'')
if x2 == x3: print('\'numexpr\' returns same result as \'jit\'')
I have a working code like this, but it is rather slow.
def halfconvolution(g,w,dz):
convo=np.zeros_like(g)
for i in range(0,len(g)):
sum=0
for j in range(0,i):
sum+=g[j]*w[(i-j)]*dz
convo[i] = -sum
return convo
I am trying to turn it into a list comprehension, but I am struggling.
I tried:
convo=[-g*w[i-j] for i in g for j in w]
I am not sure if this improves the performance, but it is a list comprehension as you asked
convo = [-sum(g[j] * w[i - j] * dz for j in range(0, i)) for i in range(0, len(g))]
A faster implementation using NumPy:
# make the matrices square
g = np.repeat(g, g.shape[0]).reshape(g.shape[0], g.shape[0], order='F')
w = np.repeat(w, w.shape[0]).reshape(w.shape[0], w.shape[0], order='F')
# take the lower half of g
g = np.tril(g, k=-1)
# shift each column by its index number
# see: https://stackoverflow.com/questions/20360675/roll-rows-of-a-matrix-independently
rows_w, column_indices_w = np.ogrid[:w.shape[0], :w.shape[1]]
shift = np.arange(w.shape[0])
shift[shift < 0] += w.shape[1]
w = w[rows_w, column_indices_w - shift[:,np.newaxis]].T
convo = np.sum(g * w, axis=1) * dz
For it to work it needs both w and g to be of the same size, but otherwise I'm sure a workaround can be found.
I hope this is a more acceptable speedup for you? Always try to rewrite your logic/problem into vector/matrix multiplications.
The inner loop can be replaced by the sum function (don't override it with a variable of the same name)
Then you append the outer loop to the end of that
[-sum(g[j]*w[i-j]*dz for j in range(i)) for i in range(len(g))]
Don't use list comprehensions for performance reasons
Use
Numba
Cython
Vectorized Numpy operations
Numba
import numba as nb
import numpy as np
import time
#nb.njit(fastmath=True)
def halfconvolution(g,w,dz):
convo=np.empty(g.shape[0],dtype=g.dtype)
for i in range(g.shape[0]):
sum=0.
for j in range(0,i):
sum+=g[j]*w[(i-j)]*dz
convo[i] = -sum
return convo
g=np.random.rand(1000)
w=np.random.rand(1000)
dz=0.15
t1=time.time()
for i in range(1000):
#res=halfconvolution(g,w,dz)
res=[-sum(g[j]*w[i-j]*dz for j in range(i)) for i in range(len(g))]
print(time.time()-t1)
print("Done")
Performance
List Comprehension: 0.27s per iteration
Numba Version: 0.6ms per iteration
So there is a factor 500 between this two versions. If you wan't to call this function on multiple arrays at once, you can also parallelize this problem easily and you should get at least another "Number of Cores" speed up.
Do you know if there is a way to get python's random.sample to work with a generator object. I am trying to get a random sample from a very large text corpus. The problem is that random.sample() raises the following error.
TypeError: object of type 'generator' has no len()
I was thinking that maybe there is some way of doing this with something from itertools but couldn't find anything with a bit of searching.
A somewhat made up example:
import random
def list_item(ls):
for item in ls:
yield item
random.sample( list_item(range(100)), 20 )
UPDATE
As per MartinPieters's request I did some timing of the currently proposed three methods. The results are as follows.
Sampling 1000 from 10000
Using iterSample 0.0163 s
Using sample_from_iterable 0.0098 s
Using iter_sample_fast 0.0148 s
Sampling 10000 from 100000
Using iterSample 0.1786 s
Using sample_from_iterable 0.1320 s
Using iter_sample_fast 0.1576 s
Sampling 100000 from 1000000
Using iterSample 3.2740 s
Using sample_from_iterable 1.9860 s
Using iter_sample_fast 1.4586 s
Sampling 200000 from 1000000
Using iterSample 7.6115 s
Using sample_from_iterable 3.0663 s
Using iter_sample_fast 1.4101 s
Sampling 500000 from 1000000
Using iterSample 39.2595 s
Using sample_from_iterable 4.9994 s
Using iter_sample_fast 1.2178 s
Sampling 2000000 from 5000000
Using iterSample 798.8016 s
Using sample_from_iterable 28.6618 s
Using iter_sample_fast 6.6482 s
So it turns out that the array.insert has a serious drawback when it comes to large sample sizes. The code I used to time the methods
from heapq import nlargest
import random
import timeit
def iterSample(iterable, samplesize):
results = []
for i, v in enumerate(iterable):
r = random.randint(0, i)
if r < samplesize:
if i < samplesize:
results.insert(r, v) # add first samplesize items in random order
else:
results[r] = v # at a decreasing rate, replace random items
if len(results) < samplesize:
raise ValueError("Sample larger than population.")
return results
def sample_from_iterable(iterable, samplesize):
return (x for _, x in nlargest(samplesize, ((random.random(), x) for x in iterable)))
def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
for _ in xrange(samplesize):
results.append(iterator.next())
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
if len(results) < samplesize:
raise ValueError("Sample larger than population.")
return results
if __name__ == '__main__':
pop_sizes = [int(10e+3),int(10e+4),int(10e+5),int(10e+5),int(10e+5),int(10e+5)*5]
k_sizes = [int(10e+2),int(10e+3),int(10e+4),int(10e+4)*2,int(10e+4)*5,int(10e+5)*2]
for pop_size, k_size in zip(pop_sizes, k_sizes):
pop = xrange(pop_size)
k = k_size
t1 = timeit.Timer(stmt='iterSample(pop, %i)'%(k_size), setup='from __main__ import iterSample,pop')
t2 = timeit.Timer(stmt='sample_from_iterable(pop, %i)'%(k_size), setup='from __main__ import sample_from_iterable,pop')
t3 = timeit.Timer(stmt='iter_sample_fast(pop, %i)'%(k_size), setup='from __main__ import iter_sample_fast,pop')
print 'Sampling', k, 'from', pop_size
print 'Using iterSample', '%1.4f s'%(t1.timeit(number=100) / 100.0)
print 'Using sample_from_iterable', '%1.4f s'%(t2.timeit(number=100) / 100.0)
print 'Using iter_sample_fast', '%1.4f s'%(t3.timeit(number=100) / 100.0)
print ''
I also ran a test to check that all the methods indeed do take an unbiased sample of the generator. So for all methods I sampled 1000 elements from 10000 100000 times and computed the average frequency of occurrence of each item in the population which turns out to be ~.1 as one would expect for all three methods.
While the answer of Martijn Pieters is correct, it does slow down when samplesize becomes large, because using list.insert in a loop may have quadratic complexity.
Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:
def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
try:
for _ in xrange(samplesize):
results.append(iterator.next())
except StopIteration:
raise ValueError("Sample larger than population.")
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
return results
The difference slowly starts to show for samplesize values above 10000. Times for calling with (1000000, 100000):
iterSample: 5.05s
iter_sample_fast: 2.64s
You can't.
You have two options: read the whole generator into a list, then sample from that list, or use a method that reads the generator one by one and picks the sample from that:
import random
def iterSample(iterable, samplesize):
results = []
for i, v in enumerate(iterable):
r = random.randint(0, i)
if r < samplesize:
if i < samplesize:
results.insert(r, v) # add first samplesize items in random order
else:
results[r] = v # at a decreasing rate, replace random items
if len(results) < samplesize:
raise ValueError("Sample larger than population.")
return results
This method adjusts the chance that the next item is part of the sample based on the number of items in the iterable so far. It doesn't need to hold more than samplesize items in memory.
The solution isn't mine; it was provided as part of another answer here on SO.
Just for the heck of it, here's a one-liner that samples k elements without replacement from the n items generated in O(n lg k) time:
from heapq import nlargest
def sample_from_iterable(it, k):
return (x for _, x in nlargest(k, ((random.random(), x) for x in it)))
I am trying to get a random sample from a very large text corpus.
Your excellent synthesis answer currently shows victory for iter_sample_fast(gen, pop). However, I tried Katriel's recommendation of random.sample(list(gen), pop) — and it's blazingly fast by comparison!
def iter_sample_easy(iterable, samplesize):
return random.sample(list(iterable), samplesize)
Sampling 1000 from 10000
Using iter_sample_fast 0.0192 s
Using iter_sample_easy 0.0009 s
Sampling 10000 from 100000
Using iter_sample_fast 0.1807 s
Using iter_sample_easy 0.0103 s
Sampling 100000 from 1000000
Using iter_sample_fast 1.8192 s
Using iter_sample_easy 0.2268 s
Sampling 200000 from 1000000
Using iter_sample_fast 1.7467 s
Using iter_sample_easy 0.3297 s
Sampling 500000 from 1000000
Using iter_sample_easy 0.5628 s
Sampling 2000000 from 5000000
Using iter_sample_easy 2.7147 s
Now, as your corpus gets very large, materializing the whole iterable into a list will use prohibitively large amounts of memory. But we can still exploit Python's blazing-fast-ness if we can chunk up the problem: basically, we pick a CHUNKSIZE that is "reasonably small," do random.sample on chunks of that size, and then use random.sample again to merge them together. We just have to get the boundary conditions right.
I see how to do it if the length of list(iterable) is an exact multiple of CHUNKSIZE and not bigger than samplesize*CHUNKSIZE:
def iter_sample_dist_naive(iterable, samplesize):
CHUNKSIZE = 10000
samples = []
it = iter(iterable)
try:
while True:
first = next(it)
chunk = itertools.chain([first], itertools.islice(it, CHUNKSIZE-1))
samples += iter_sample_easy(chunk, samplesize)
except StopIteration:
return random.sample(samples, samplesize)
However, the code above produces a non-uniform sampling when len(list(iterable)) % CHUNKSIZE != 0, and it runs out of memory as len(list(iterable)) * samplesize / CHUNKSIZE becomes "very large." Fixing these bugs is above my pay grade, I'm afraid, but a solution is described in this blog post and sounds quite reasonable to me. (Search terms: "distributed random sampling," "distributed reservoir sampling.")
Sampling 1000 from 10000
Using iter_sample_fast 0.0182 s
Using iter_sample_dist_naive 0.0017 s
Using iter_sample_easy 0.0009 s
Sampling 10000 from 100000
Using iter_sample_fast 0.1830 s
Using iter_sample_dist_naive 0.0402 s
Using iter_sample_easy 0.0103 s
Sampling 100000 from 1000000
Using iter_sample_fast 1.7965 s
Using iter_sample_dist_naive 0.6726 s
Using iter_sample_easy 0.2268 s
Sampling 200000 from 1000000
Using iter_sample_fast 1.7467 s
Using iter_sample_dist_naive 0.8209 s
Using iter_sample_easy 0.3297 s
Where we really win is when samplesize is very small relative to len(list(iterable)).
Sampling 20 from 10000
Using iterSample 0.0202 s
Using sample_from_iterable 0.0047 s
Using iter_sample_fast 0.0196 s
Using iter_sample_easy 0.0001 s
Using iter_sample_dist_naive 0.0004 s
Sampling 20 from 100000
Using iterSample 0.2004 s
Using sample_from_iterable 0.0522 s
Using iter_sample_fast 0.1903 s
Using iter_sample_easy 0.0016 s
Using iter_sample_dist_naive 0.0029 s
Sampling 20 from 1000000
Using iterSample 1.9343 s
Using sample_from_iterable 0.4907 s
Using iter_sample_fast 1.9533 s
Using iter_sample_easy 0.0211 s
Using iter_sample_dist_naive 0.0319 s
Sampling 20 from 10000000
Using iterSample 18.6686 s
Using sample_from_iterable 4.8120 s
Using iter_sample_fast 19.3525 s
Using iter_sample_easy 0.3162 s
Using iter_sample_dist_naive 0.3210 s
Sampling 20 from 100000000
Using iter_sample_easy 2.8248 s
Using iter_sample_dist_naive 3.3817 s
If the population size n is known, here is some memory efficient code that loops over a generator, extracting only the target samples:
from random import sample
from itertools import count, compress
targets = set(sample(range(n), k=10))
for selection in compress(pop, map(targets.__contains__, count())):
print(selection)
This outputs the selections in the order they are produced by the population generator.
The technique is to use the standard library random.sample() to randomly select the target indices for the selections. The second like determines whether a given index is among the targets and if so gives the corresponding value from the generator.
For example, given targets of {6, 2, 4}:
0 1 2 3 4 5 6 7 8 9 10 ... output of count()
F F T F T F T F F F F ... is the count in targets?
A B C D E F G H I J K ... output of the population generator
- - C - E - G - - - - ... selections emitted by compress
This technique is suitable for looping over a corpus too large to fit in memory (otherwise, you could just use sample() directly on the population).
If the number of items in the iterator is known (by elsewhere counting the items), another approach is:
def iter_sample(iterable, iterlen, samplesize):
if iterlen < samplesize:
raise ValueError("Sample larger than population.")
indexes = set()
while len(indexes) < samplesize:
indexes.add(random.randint(0,iterlen))
indexesiter = iter(sorted(indexes))
current = indexesiter.next()
ret = []
for i, item in enumerate(iterable):
if i == current:
ret.append(item)
try:
current = indexesiter.next()
except StopIteration:
break
random.shuffle(ret)
return ret
I find this quicker, especially when sampsize is small in relation to iterlen. When the whole, or near to the whole, sample is asked for however, there are issues.
iter_sample (iterlen=10000, samplesize=100) time: (1, 'ms')
iter_sample_fast (iterlen=10000, samplesize=100) time: (15, 'ms')
iter_sample (iterlen=1000000, samplesize=100) time: (65, 'ms')
iter_sample_fast (iterlen=1000000, samplesize=100) time: (1477, 'ms')
iter_sample (iterlen=1000000, samplesize=1000) time: (64, 'ms')
iter_sample_fast (iterlen=1000000, samplesize=1000) time: (1459, 'ms')
iter_sample (iterlen=1000000, samplesize=10000) time: (86, 'ms')
iter_sample_fast (iterlen=1000000, samplesize=10000) time: (1480, 'ms')
iter_sample (iterlen=1000000, samplesize=100000) time: (388, 'ms')
iter_sample_fast (iterlen=1000000, samplesize=100000) time: (1521, 'ms')
iter_sample (iterlen=1000000, samplesize=1000000) time: (25359, 'ms')
iter_sample_fast (iterlen=1000000, samplesize=1000000) time: (2178, 'ms')
Fastest method until proven otherwise when you have an idea about how long the generator is (and will be asymptotically uniformly distributed):
def gen_sample(generator_list, sample_size, iterlen):
num = 0
inds = numpy.random.random(iterlen) <= (sample_size * 1.0 / iterlen)
results = []
iterator = iter(generator_list)
gotten = 0
while gotten < sample_size:
try:
b = iterator.next()
if inds[num]:
results.append(b)
gotten += 1
num += 1
except:
num = 0
iterator = iter(generator_list)
inds = numpy.random.random(iterlen) <= ((sample_size - gotten) * 1.0 / iterlen)
return results
It is both the fastest on the small iterable as well as the huge iterable (and probably all in between then)
# Huge
res = gen_sample(xrange(5000000), 200000, 5000000)
timing: 1.22s
# Small
z = gen_sample(xrange(10000), 1000, 10000)
timing: 0.000441
Here's a radically different variation that uses a set as a bucket of items. It starts by priming the bucket with pool items, and then yield samples from the bucket, replacing them from the iterator, finally it drains what is left of the bucket.
HashWrapper serves to hide unhashable types from set.
class HashWrapper(tuple):
"""Wrap unhashable type."""
def __hash__(self):
return id(self)
def randomize_iterator(data: Iterator, pool=100) -> Iterator:
"""
Randomize an iterator.
"""
bucket = set()
iterator = iter(data)
# Prime the bucket
for _ in range(pool):
try:
bucket.add(HashWrapper(next(iterator)))
except StopIteration:
# We've drained the iterator
break
# Start picking from the bucket and replacing new items from the iterator
for item in iterator:
sample, = random.sample(bucket, 1)
yield sample
bucket.remove(sample)
bucket.add(HashWrapper(item))
# Drain the bucket
yield from random.sample(bucket, len(bucket))
I have a time series with about 150 million points. I need to zoom in on 3 million points. That is, I need to extract the 100 time points surrounding each of those 3 million areas of interest in this 150 million point time series.
Attempt:
def get_waveforms(data,spiketimes,lookback=100,lookahead=100):
answer = zeros((len(spiketimes),(lookback+lookahead)))
duration = len(data)
for i in xrange(len(spiketimes)):
if(spiketimes[i] - lookback) > 0 and spiketimes[i] + lookahead) < duration:
answer[i,:] = data[(spiketimes[i]-lookback):(spiketimes[i]+lookahead)]
return answer
This eats up all available memory on my Mac. It explodes if I try to pass and array of where len(array) > 100000. Is there a more memory efficient or (hopefully) more elegant approach to pull out parts of one array based on another?
Related
This answer is related. However, I'm not exactly sure how to apply it and avoid a loop. Would I, effectively, be indexing the time series vector over and over with the columns of a boolean matrix?
You are allocating an array of 200 * len(spiketimes) floats, so for your 100,000 item spiketimes should only be about 160 MB, which doesn't seem like much. On the other hand, if you go to 1,000,000 spiketimes, a 1.6 GB single array may be a stretch for some systems. If you have the memory, you can vectorize the extraction with something like this:
def get_waveforms(data, spiketimes, lookback=100, lookahead=100) :
offsets = np.arange(-lookback, lookahead)
indices = spiketimes + offsets[:, None]
ret = np.take(data, indices, mode='clip')
ret[:, spiketimes < lookback] = 0
ret[:, spiketimes + lookahead >= len(data)] = 0
return ret
The handling of the spiketimes too close to the edges of data mimics that in your function with loops.
The wise thing to do when you have so much data is to take views into it. That is harder to vectorize (or at least I haven't figured how to), but since you aren't copying any of the data, the python loop will not be much slower:
def get_waveforms_views(data, spiketimes, lookback=100, lookahead=100) :
ret = []
for j in spiketimes :
if j < lookback or j + lookahead >= len(data) :
ret.append(None)
else :
ret.append(data[j - lookback:j + lookahead])
return ret
With the following test data:
data_points, num_spikes = 1000000, 10000
data = np.random.rand(data_points)
spiketimes = np.random.randint(data_points, size=(num_spikes))
I get these timings:
In [2]: %timeit get_waveforms(data, spiketimes)
1 loops, best of 3: 320 ms per loop
In [3]: %timeit get_waveforms_views(data, spiketimes)
1 loops, best of 3: 313 ms per loop