Versions of this question have already been asked but I have not found a satisfactory answer.
Problem: given a large numpy vector, find indices of the vector elements which are duplicated (a variation of that could be comparison with tolerance).
So the problem is ~O(N^2) and memory bound (at least from the current algorithm point of view). I wonder why whatever I tried Python is 100x or more slower than an equivalent C code.
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = []
print("init done")
counter = 0
for i in range(N):
for j in range(i+1, N):
if vect[i] == vect[j]:
dupl.append(j)
counter += 1
print("counter =", counter)
print(dupl)
# For simplicity, this code ignores repeated indices
# which can be trimmed later. Ref output is
# counter = 3
# [2500, 5000, 5000]
I tried using numpy iterators but they are even worse (~ x4-5)
http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
Using N=10,000 I'm getting 0.1 sec in C, 12 sec in Python (code above), 40 sec in Python using np.nditer, 50 sec in Python using np.ndindex. I pushed it to N=160,000 and the timing scales as N^2 as expected.
Since the answers have stopped coming and none was totally satisfactory, for the record I post my own solution.
It is my understanding that it's the assignment which makes Python slow in this case, not the nested loops as I thought initially. Using a library or compiled code eliminates the need for assignments and performance improves dramatically.
from __future__ import print_function
import numpy as np
from numba import jit
N = 10000
vect = np.arange(N, dtype=np.float32)
vect[N/2] = 1
vect[N/4] = 1
dupl = np.zeros(N, dtype=np.int32)
print("init done")
# uncomment to enable compiled function
##jit
def duplicates(i, counter, dupl, vect):
eps = 0.01
ns = len(vect)
for j in range(i+1, ns):
# replace if to use approx comparison
#if abs(vect[i] - vect[j]) < eps:
if vect[i] == vect[j]:
dupl[counter] = j
counter += 1
return counter
counter = 0
for i in xrange(N):
counter = duplicates(i, counter, dupl, vect)
print("counter =", counter)
print(dupl[0:counter])
Tests
# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 10.135 s
# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 0.480 s
The performance of compiled version (with #jit uncommented) is close to C code performance ~0.1 - 0.2 sec. Perhaps eliminating the last loop could improve the performance even further. The difference in performance is even stronger when using approximate comparison using eps while there is very little difference for the compiled version.
# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 109.218 s
# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 0.506 s
This is ~ 200x difference. In the real code, I had to put both loops in the function as well as use a function template with variable types so it was a bit more complex but not very much.
Python itself is a highly-dynamic, slow, language. The idea in numpy is to use vectorization, and avoid explicit loops. In this case, you can use np.equal.outer. You can start with
a = np.equal.outer(vect, vect)
Now, for example, to find the sum:
>>> np.sum(a)
10006
To find the indices of i that are equal, you can do
np.fill_diagonal(a, 0)
>>> np.nonzero(np.any(a, axis=0))[0]
array([ 1, 2500, 5000])
Timing
def find_vec():
a = np.equal.outer(vect, vect)
s = np.sum(a)
np.fill_diagonal(a, 0)
return np.sum(a), np.nonzero(np.any(a, axis=0))[0]
>>> %timeit find_vec()
1 loops, best of 3: 214 ms per loop
def find_loop():
dupl = []
counter = 0
for i in range(N):
for j in range(i+1, N):
if vect[i] == vect[j]:
dupl.append(j)
counter += 1
return dupl
>>> % timeit find_loop()
1 loops, best of 3: 8.51 s per loop
This solution using the numpy_indexed package has complexity n Log n, and is fully vectorized; so not terribly different from C performance, in all likelihood.
import numpy_indexed as npi
dpl = np.flatnonzero(npi.multiplicity(vect) > 1)
The obvious question is why you want to do this in this way. NumPy arrays are intended to be opaque data structures – by this I mean NumPy arrays are intended to be created inside the NumPy system and then operations sent in to the NumPy subsystem to deliver a result. i.e. NumPy should be a black box into which you throw requests and out come results.
So given the code above I am not at all suprised that NumPy performance is worse than dreadful.
The following should be effectively what you want, I believe, but done the NumPy way:
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
print([np.where(a == vect)[0] for a in vect][1])
# Delivers [1, 2500, 5000]
Approach #1
You can simulate that iterator dependency criteria for a vectorized solution using a triangular matrix. This is based on this post that dealt with multiplication involving iterator dependency. For performing the elementwise equality of each element in vect against its all elements, we can use NumPy broadcasting. Finally, we can use np.count_nonzero to get the count, as it's supposed to be very efficient in summing purposes on boolean arrays.
So, we would have a solution like so -
mask = np.triu(vect[:,None] == vect,1)
counter = np.count_nonzero(mask)
dupl = np.where(mask)[1]
If you only care about the count counter, we could have two more approaches as listed next.
Approach #2
We can avoid the use of the triangular matrix and simply get the entire count and just subtract the contribution from diagonal elements and consider just one of either lower of upper triangular regions by just halving the remaining count as the contributions from either ones would be identical.
So, we would have a modified solution like so -
counter = (np.count_nonzero(vect[:,None] == vect) - vect.size)//2
Approach #3
Here's an entirely different approach that uses the fact the count of each unique element plays a cumsumed contribution to the final total.
So, with that idea in mind, we would have a third approach like so -
count = np.bincount(vect) # OR np.unique(vect,return_counts=True)[1]
idx = count[count>1]
id_arr = np.ones(idx.sum(),dtype=int)
id_arr[0] = 0
id_arr[idx[:-1].cumsum()] = -idx[:-1]+1
counter = np.sum(id_arr.cumsum())
As an alternative to Ami Tavory's answer, you can use a Counter from the collections package to detect duplicates. On my computer it seems to be even faster. See the function below which can also find different duplicates.
import collections
import numpy as np
def find_duplicates_original(x):
d = []
for i in range(len(x)):
for j in range(i + 1, len(x)):
if x[i] == x[j]:
d.append(j)
return d
def find_duplicates_outer(x):
a = np.equal.outer(x, x)
np.fill_diagonal(a, 0)
return np.flatnonzero(np.any(a, axis=0))
def find_duplicates_counter(x):
counter = collections.Counter(x)
values = (v for v, c in counter.items() if c > 1)
return {v: np.flatnonzero(x == v) for v in values}
n = 10000
x = np.arange(float(n))
x[n // 2] = 1
x[n // 4] = 1
>>>> find_duplicates_counter(x)
{1.0: array([ 1, 2500, 5000], dtype=int64)}
>>>> %timeit find_duplicates_original(x)
1 loop, best of 3: 12 s per loop
>>>> %timeit find_duplicates_outer(x)
10 loops, best of 3: 84.3 ms per loop
>>>> %timeit find_duplicates_counter(x)
1000 loops, best of 3: 1.63 ms per loop
This runs in 8 ms compared to 18 s for your code and doesn't use any strange libraries. It's similar to the approach by #vs0, but I like defaultdict more. It should be approximately O(N).
from collections import defaultdict
dupl = []
counter = 0
indexes = defaultdict(list)
for i, e in enumerate(vect):
indexes[e].append(i)
if len(indexes[e]) > 1:
dupl.append(i)
counter += 1
I wonder why whatever I tried Python is 100x or more slower than an equivalent C code.
Because Python programs are usually 100x slower than C programs.
You can either implement critical code paths in C and provide Python-C bindings, or change the algorithm. You can write an O(N) version by using a dict that reverses the array from value to index.
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = {}
print("init done")
counter = 0
for i in range(N):
e = dupl.get(vect[i], None)
if e is None:
dupl[vect[i]] = [i]
else:
e.append(i)
counter += 1
print("counter =", counter)
print([(k, v) for k, v in dupl.items() if len(v) > 1])
Edit:
If you need to test against an eps with abs(vect[i] - vect[j]) < eps you can then normalize the values up to eps
abs(vect[i] - vect[j]) < eps ->
abs(vect[i] - vect[j]) / eps < (eps / eps) ->
abs(vect[i]/eps - vect[j]/eps) < 1
int(abs(vect[i]/eps - vect[j]/eps)) = 0
Like this:
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = {}
print("init done")
counter = 0
eps = 0.01
for i in range(N):
k = int(vect[i] / eps)
e = dupl.get(k, None)
if e is None:
dupl[k] = [i]
else:
e.append(i)
counter += 1
print("counter =", counter)
print([(k, v) for k, v in dupl.items() if len(v) > 1])
Related
In the numpy library, one can pass a list into the numpy.searchsorted function, whereby it searched through a different list one element at a time and returns an array of the same sizes as the indices needed to preserve order. However, it seems to be wasting performance if both lists are sorted. For example:
m=[1,3,5,7,9]
n=[2,4,6,8,10]
numpy.searchsorted(m,n)
would return [1,2,3,4,5] which is the correct answer, but it looks like this would have complexity O(n ln(m)), whereby if one were to simply loop through m, and have some kind of pointer to n, it seems like the complexity is more like O(n+m)? Is there some kind of function in NumPy which does this?
AFAIK, this is not possible to do that in linear time only with Numpy without making additional assumptions on the inputs (eg. the integer are small and bounded). An alternative solution is to use Numba to do the merge manually:
import numba as nb
# Note: Numba requires a function signature with well defined array types
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted(a, b):
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < a.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
return result
a, b = np.cumsum(np.random.randint(0, 100, (2, 1000000)).astype(np.int64), axis=1)
result = search_both_sorted(a, b)
A faster implementation consists in using a branch-less approach so to remove the overhead of branch mis-prediction (especially on random/unpredictable inputs) when a and b are about the same size. Additionally, the O(n log m) algorithm can be faster when b is small so using np.searchsorted in that case is very efficient as pointed out by #MichaelSzczesny. Note that the Numba implementation of np.searchsorted can be a bit slower than the one of Numpy so it is better to pick the Numpy implementation. Here is the optimized version:
#nb.njit('int64[:](int64[::1], int64[::1])')
def search_both_sorted_opt_numba(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if sb < sa * 0.15:
# Use a version with branches because `a[i] < b[j]`
# should be most of the time true.
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
if a[i] < b[j]:
i += 1
else:
result[j] = i
j += 1
for k in range(j, b.size):
result[k] = i
else:
# Use a branchless approach to avoid miss-predictions
i, j = 0, 0
result = np.empty(b.size, np.int64)
while i < a.size and j < b.size:
tmp = a[i] < b[j]
result[j] = i
i += tmp
j += ~tmp
for k in range(j, b.size):
result[k] = i
return result
def search_both_sorted_opt(a, b):
sa, sb = a.size, b.size
# Choose the best algorithm
if 2 * sb * np.log2(sa) < sa + sb:
return np.searchsorted(a, b)
else:
return search_both_sorted_opt_numba(a, b)
searchsorted: 19.1 ms
snp_search: 11.8 ms
search_both_sorted: 6.5 ms
search_both_sorted_branchless: 4.3 ms
The optimized branchless Numba implementation is about 4.4 times faster than searchsorted which is pretty good considering that the code of searchsorted is already highly optimized. It can be even faster when a and b are huge because of cache locality.
You could use sortednp, unfortunately it does not give too much flexibility, In the code snippet below I used its merge tracking indices, but it produces three arrays, four times more memory than necessary is used, but it is faster than searchsorted.
import numpy as np
import sortednp as snp
a = np.cumsum(np.random.rand(1000000))
b = np.cumsum(np.random.rand(1000000))
def snp_search(a,b):
m, (ib, ia) = snp.merge(b, a, indices=True)
return ib - np.arange(len(ib))
assert(np.all(snp_search(a,b) == np.searchsorted(a,b)))
np.searchsorted(a, b); #58 ms
snp_search(a,b); # 22ms
np.searchsorted takes this into account already as can be seen from the source code:
/*
* Updating only one of the indices based on the previous key
* gives the search a big boost when keys are sorted, but slightly
* slows down things for purely random ones.
*/
if (cmp(last_key_val, key_val)) {
max_idx = arr_len;
}
else {
min_idx = 0;
max_idx = (max_idx < arr_len) ? (max_idx + 1) : arr_len;
}
Here min_idx, max_idx are used to perform binary search on the array. If last_key_val < key_val then only max_idx is reset to the array length, but min_idx remains at its current value, i.e. binary search starts at the same lower boundary as for the previous key.
import numpy as np
import matplotlib.pyplot as plt
from numpy import random
import time
from collections import Counter
def simulation(N):
I = 10
success = 0
M = 100
for i in range(I):
s = allocate(N,M)
M -= s
success += s
return success
def allocate(N,M):
count = Counter(random.randint(N,size = M))
success = sum(j for v,j in count.items() if j == 1)
return success
if __name__ == "__main__":
start = time.perf_counter()
SAMPLE_SIZE = 100000
N = np.linspace(5,45,41).astype(int)
Ps = []
for n in N:
ps = []
for _ in range(SAMPLE_SIZE):
ps.append(simulation(n)/100)
result = np.average(np.array(ps))
Ps.append(result)
elapsed = (time.perf_counter() - start)
print("Time used:",elapsed)
plt.scatter(N,Ps)
plt.show()
Here is my situation. The ultimate goal is to set SAMPLE_SIZE to 10^7. However, when I set it to 10^5, it already requires about 1000sec to run it. Is there any way to make it more efficient and faster? Thanks for giving me suggestions.
First of all, the implementation of allocate is not very efficient: you can use vectorized Numpy function to do that:
def allocate(N, M):
success = np.count_nonzero(np.bincount(random.randint(N, size=M)) == 1)
return success
The thing is most of the time comes from the overhead of Numpy functions performing some checks and create some temporary arrays. You can use Numba to fix this problem:
import numba as nb
#nb.njit('int_(int_, int_)')
def allocate(N,M):
tmp = np.zeros(N, np.int_)
for i in range(M):
rnd = np.random.randint(0, N)
tmp[rnd] += 1
count = 0
for i in range(N):
count += tmp[i] == 1
return count
Then, you can speed up the code a bit further by using the Numba decorator #nb.njit('int_(int_)') to the simulation function so to avoid the overhead of calling Numba functions from the CPython interpreter.
Finally, you can speed up the main loop by running it in parallel with Numba (and also avoid the use of slow lists). You can also recycle the tmp array so not to cause too many allocations (that are expensive and do not scale with the number of cores). Here is the resulting final code:
import numpy as np
import matplotlib.pyplot as plt
import time
import numba as nb
# Recycle the `tmp` buffer not to do many allocations
#nb.njit('int_(int_, int_, int_[::1])')
def allocate(N, M, tmp):
tmp.fill(0)
for i in range(M):
rnd = np.random.randint(0, N)
tmp[rnd] += 1
count = 0
for i in range(N):
count += tmp[i] == 1
return count
#nb.njit('int_(int_)')
def simulation(N):
I = 10
success = 0
M = 100
tmp = np.zeros(N, np.int_) # Preallocated buffer
for i in range(I):
s = allocate(N, M, tmp)
M -= s
success += s
return success
#nb.njit('float64(int_, int_)', parallel=True)
def compute_ps_avg(n, sample_size):
ps = np.zeros(sample_size, dtype=np.float64)
for i in nb.prange(sample_size):
ps[i] = simulation(n) / 100.0
# Note that np.average is not yet supported by Numba
return np.mean(ps)
if __name__ == "__main__":
start = time.perf_counter()
SAMPLE_SIZE = 100_000
N = np.linspace(5,45,41).astype(int)
Ps = [compute_ps_avg(n, SAMPLE_SIZE) for n in N]
elapsed = (time.perf_counter() - start)
print("Time used:",elapsed)
plt.scatter(N,Ps)
plt.show()
Here are performance results on my 10-core machine:
Initial code: 670.6 s
Optimized Numba code: 3.9 s
The resulting code is 172 times faster.
More than 80% of the time is spent in the generation of random numbers. Thus, if you want the code to be faster, one solution is to speed up the generation of random number using a SIMD-optimized random number generator. Unfortunately, AFAIK, this is not possible to (efficiently) achieve this in Python. You certainly need to use a native language like C or C++ to do that.
I might be missing the point but it seems like you can replace your simulation with a closed form calculation.
I believe the problem you are solving is find the expected number of boxes with exactly 1 ball in them given a random distribution of M balls in N boxes.
Follow the answer here https://math.stackexchange.com/a/66094 to get the closed form expression (André Nicolas solved it for 10 balls and 5 boxes but you should be able to extrapolate)
(As a side note I will often also write code to confirm that my probability calculations are correct, if this is what you are doing sorry about stating the obvious :P )
Assume I would like to do sampling in parallel based on a condition.
For example, give the matrix A. I want to sample the p pairs of indices (i,j) such that A[i][j] != 5
import numpy as np
import random
A = np.random.randint(10, size=(5000, 5000)) # assume this is fixed
p = 400 # sample 400 index
res = set()
cnt = 0
while cnt < p:
r, c = random.randint(0, A.shape[0]-1), random.randint(0, A.shape[0]-1)
if A[r, c] != 5 and (r,c) not in res:
res.add((r,c))
cnt += 1
Above is my attempt. However, the matrix A and the number of samples p can be very large. Can we do it in parallel? Like use joblib, multiprocessing? Or any fast way to obtain the row and col?
You can use Numba to speed up this code. Numba can generate fast (parallel) functions at runtime using a just-in-time compiler (JIT). Using a smaller datatype like np.int8 save some memory space and result in a faster execution time. Indeed, smaller arrays can be read/written faster from/into RAM. Moreover, they are more likely to fit in the CPU cache speeding up random access. While you can parallelize the random picking, this is quite hard and the creation of threads can be more expensive than the actual computation regarding the chosen parameters. Still, Numba can improve its speed by a large margin by just (mostly) removing the overhead of the Python interpreter.
Here is the resulting code:
# Initial conditions
import numba as nb
import numpy as np
import random
#nb.njit('int8[:,:](int_, int_)', parallel=True)
def genArray(n, m):
res = np.empty((n, m), dtype=np.int8)
# Parallel loop
for i in nb.prange(n):
for j in range(m):
res[i, j] = np.random.randint(10)
return res
p = 400
A = genArray(5000, 5000)
# Actual computing code
#nb.njit('(int8[:,::1], int_)')
def genPosSet(A, p):
maxi = A.shape[0]-1
res = set()
cnt = 0
while cnt < p:
r, c = random.randint(0, maxi), random.randint(0, maxi)
if A[r, c] != 5 and (r,c) not in res:
res.add((r,c))
cnt += 1
return res
res = genPosSet(A, p)
This implementation of genPosSet takes 64 us on my machine while the initial function takes 1350 us. The new implementation is thus 21 times faster.
Note that the time to create/delete threads (1 thread/core) and share the work between them takes usually from 10 us to 1000 us.
Note that if p is not much smaller than A.size * prob where prob is the probability to find a value different of 5, then the current algorithm is not very efficient. In this case, it is better to filter the values that are different of 5 before picking random locations. If p is not much smaller than A.size, then the best solution is to shuffle all the possible locations that can be picked and finally extract the p first values of the resulting list.
I wanted to generate 1 or -1 in Python as a step to randomizing between non-negative and non-positive numbers or to randomly changing sign of an already existing integer. What would be the best way to generate 1 or -1 in Python? Assuming even distribution I know I could use:
import random
#method1
my_number = random.choice((-1, 1))
#method2
my_number = (-1)**random.randrange(2)
#method3
# if I understand correctly random.random() should never return exactly 1
# so I use "<", not "<="
if random.random() < 0.5:
my_number = 1
else:
my_number = -1
#method4
my_number = random.randint(0,1)*2-1
Using timeit module I got the following results:
#method1
s = "my_number = random.choice((-1, 1))"
timeit.timeit(stmt = s, setup = "import random")
>2.814896769857569
#method2
s = "my_number = (-1)**random.randrange(2)"
timeit.timeit(stmt = s, setup = "import random")
>3.521280517518562
#method3
s = """
if random.random() < 0.5: my_number = 1
else: my_number = -1"""
timeit.timeit(stmt = s, setup = "import random")
>0.25321546903273884
#method4
s = "random.randint(0,1)*2-1"
timeit.timeit(stmt = s, setup = "import random")
>4.526625442240402
So unexpectedly method 3 is the fastest. My bet was on method 1 to be the fastest as it is also shortest. Also both method 1 (since Python 3.6 I think?) and 3 give the possibility to introduce uneven distributions. Although method 1 is shortest (main advantege) for now I would choose method 3:
def positive_or_negative():
if random.random() < 0.5:
return 1
else:
return -1
Testing:
s = """
import random
def positive_or_negative():
if random.random() < 0.5:
return 1
else:
return -1
"""
timeit.timeit(stmt = "my_number = positive_or_negative()", setup = s)
>0.3916183138621818
Any better (faster or shorter) method to randomly generate -1 or 1 in Python? Any reason why would you choose method 1 over method 3 or vice versa?
A one liner variation of #3:
return 1 if random.random() < 0.5 else -1
It's fast(er) than the 'math' variants, because it doesn't involve additional arithmetic.
Here's another one-liner that my timings show to be faster than the if/else comparison to 0.5:
[-1,1][random.randrange(2)]
not sure what your application is exactly, but I needed something similar for a large vectorized array.
Here's a good way to get a sign array:
(2*np.random.randint(0,2,size=(your_size))-1)
The result is an array, for example:
array([-1, -1, -1, 1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, -1])
and you can use the reshape command to get the above to the size of your matrix:
(2*np.random.randint(0,2,size=(m*n))-1).reshape(m,n)
Then you can multiply a matrix by the above and get all of the members with random signs.
A= np.array([[1, 2, 3],
[4, 5, 6]])
B = A*(2*np.random.randint(0,2,size=(2*3))-1).reshape(2,3)
Then you get something like :
B = array([[ 1, 2, -3],[ 4, 5, -6]])
Pretty quick, if your data is vectorized.
Maths made simple:
Generate random number: 0 or 1
Get them mutiplied by 2: 0 or 2
Substract 1: -1 or 1
Adapt that to any programming code. No need for test functions.
print(random.randint(0,1)*2-1)
works also without randint
print(int(random.random()*2)*2-1)
The fastest way to generate random numbers if you're going to be doing lots of them is by using numpy:
In [1]: import numpy as np
In [2]: import random
In [3]: %timeit [random.choice([-1,1]) for i in range(100000)]
10 loops, best of 3: 88.9 ms per loop
In [4]: %timeit [(-1)**random.randrange(2) for i in range(100000)]
10 loops, best of 3: 110 ms per loop
In [5]: %timeit [1 if random.random() < 0.5 else -1 for i in range(100000)]
100 loops, best of 3: 18.4 ms per loop
In [6]: %timeit [random.randint(0,1)*2-1 for i in range(100000)]
1 loop, best of 3: 180 ms per loop
In [7]: %timeit np.random.choice([-1,1],size=100000)
1000 loops, best of 3: 1.52 ms per loop
If you need single bits (one per call), you already did your benchmark and other answers provide additional info.
If you need many bits or can pre-calculate bit-arrays for later consumption, numpy's methods might shine.
Here is some more demo-approach using numpy (which surprisingly does not have a method dedicated for this job exactly):
import numpy as np
import random
def sample_bits(N):
assert N % 8 == 0 # demo only
n_bytes = N // 8
rbytes = np.random.randint(0, 255, dtype=np.uint8, size=n_bytes)
return np.unpackbits(rbytes)
def alt(N):
return np.random.choice([-1,1],size=N)
def alt2(N):
return [1 if random.random() < 0.5 else -1 for i in range(N)]
if __name__ == '__main__':
import timeit
print(timeit.timeit("sample_bits(1024)", setup="from __main__ import sample_bits", number=10000))
print(timeit.timeit("alt(1024)", setup="from __main__ import alt", number=10000))
print(timeit.timeit("alt2(1024)", setup="from __main__ import alt2", number=10000))
Output:
0.06640421246836543
0.352129537507486
1.5522800431775592
The general idea is:
use numpy to generate many uint8's in one step
(there might be something better using internal functions without the randint-API)
unpack uint8's to 8 bits
uniformity follows from randint's uniformity guarantees
Again, this is only a demo:
for one specific case
not caring about different result-types of these functions
not caring about -1 vs. 0 (might be important in your use-case)
(not even optimal compared to much more low-level approaches; MT used internally can be used as a bit-source, which does not need fp-math, like many other PRNGs!)
My code is as
vals = array("i", [-1, 1])
def my_rnd():
return vals[randint(0, 7) % 2]
I'm experimenting with NumPy to see how and where it is faster than using generic list comprehensions in Python. Here's a standard coding question I'm using for this experiment.
Find the sum of all the multiples of 3 or 5 below 1000000.
I have written three functions to compute this number.
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
I first did a quick sanity check to see that the three functions produced the same answer.
I then used timeit to compare the runtimes for the three functions.
%timeit -n 100 fA(1000000)
100 loops, best of 3: 182 ms per loop
%timeit -n 100 fB(1000000)
100 loops, best of 3: 14.4 ms per loop
%timeit -n 100 fC(1000000)
100 loops, best of 3: 44 ms per loop
It's no surprise that fA is the slowest. But why is fB so much better than fC? Is there a better way to compute this answer using NumPy?
I don't think size is an issue here. In fact, if I change the 1e6 to 1e9, fC becomes even slower when compared to fB.
fB is so much faster than fC because fC is not the NumPy equivalent of fB. fC is the NumPy equivalent of fA. This is the NumPy equivalent of fB:
def fD(M):
multiples_3 = np.arange(0, M, 3)
multiples_5 = np.arange(0, M, 5)
multiples_15 = np.arange(0, M, 15)
return multiples_3.sum() + multiples_5.sum() - multiples_15.sum()
It runs way faster:
In [4]: timeit fB(1000000)
100 loops, best of 3: 9.96 ms per loop
In [5]: timeit fD(1000000)
1000 loops, best of 3: 637 µs per loop
In fB you are constructing the ranges with the exact multiples you want. Their sizes become smaller from 3 to 5 to 15 and thus each takes less time to construct than the one before, after they are constructed you only need to take the sum and do some arithmetic.
In fC you are constructing a 100000 element array, the size isn't really the issue as much as the two modulo comparisons you are doing which must look at every single element in the array. This takes the lion's share of the execution time (about 90 %) for fC.
You're only really using numpy there to generate an array. You'd see a much bigger difference if you were trying to perform operations on arrays as opposed to performing them on lists or tuples. With regards to that particular problem, take a look at the function fD in the code below, which just calculates how many multiples there should be in each range and then calculates their sum, rather than generating the array. Actually, if you run the below snippet, you'll see how the times change in function of M. Also, fC breaks down for M >= 100000. I couldn't tell you why.
import numpy as np
from time import time
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
def fD(M):
return sum_mult(M,3)+sum_mult(M,5)-sum_mult(M,15)
def sum_mult(M,n):
instances=(M-1)/n
check=len(range(n,M,n))
return (n*instances*(instances+1))/2
for x in range(5,20):
print "*"*20
M=2**x
print M
answers=[]
T=[]
for f in (fA,fB,fC,fD):
ts=time()
answers.append(f(M))
for i in range(20):
f(M)
T.append(time()-ts)
if not all([x==answers[0] for x in answers]):
print "Warning! Answers do not match!",answers
print T