I'm experimenting with NumPy to see how and where it is faster than using generic list comprehensions in Python. Here's a standard coding question I'm using for this experiment.
Find the sum of all the multiples of 3 or 5 below 1000000.
I have written three functions to compute this number.
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
I first did a quick sanity check to see that the three functions produced the same answer.
I then used timeit to compare the runtimes for the three functions.
%timeit -n 100 fA(1000000)
100 loops, best of 3: 182 ms per loop
%timeit -n 100 fB(1000000)
100 loops, best of 3: 14.4 ms per loop
%timeit -n 100 fC(1000000)
100 loops, best of 3: 44 ms per loop
It's no surprise that fA is the slowest. But why is fB so much better than fC? Is there a better way to compute this answer using NumPy?
I don't think size is an issue here. In fact, if I change the 1e6 to 1e9, fC becomes even slower when compared to fB.
fB is so much faster than fC because fC is not the NumPy equivalent of fB. fC is the NumPy equivalent of fA. This is the NumPy equivalent of fB:
def fD(M):
multiples_3 = np.arange(0, M, 3)
multiples_5 = np.arange(0, M, 5)
multiples_15 = np.arange(0, M, 15)
return multiples_3.sum() + multiples_5.sum() - multiples_15.sum()
It runs way faster:
In [4]: timeit fB(1000000)
100 loops, best of 3: 9.96 ms per loop
In [5]: timeit fD(1000000)
1000 loops, best of 3: 637 µs per loop
In fB you are constructing the ranges with the exact multiples you want. Their sizes become smaller from 3 to 5 to 15 and thus each takes less time to construct than the one before, after they are constructed you only need to take the sum and do some arithmetic.
In fC you are constructing a 100000 element array, the size isn't really the issue as much as the two modulo comparisons you are doing which must look at every single element in the array. This takes the lion's share of the execution time (about 90 %) for fC.
You're only really using numpy there to generate an array. You'd see a much bigger difference if you were trying to perform operations on arrays as opposed to performing them on lists or tuples. With regards to that particular problem, take a look at the function fD in the code below, which just calculates how many multiples there should be in each range and then calculates their sum, rather than generating the array. Actually, if you run the below snippet, you'll see how the times change in function of M. Also, fC breaks down for M >= 100000. I couldn't tell you why.
import numpy as np
from time import time
def fA(M):
sum = 0
for x in range(M):
if x % 3 == 0 or x % 5 == 0:
sum += x
return sum
def fB(M):
multiples_3 = range(0, M, 3)
multiples_5 = range(0, M, 5)
multiples_15 = range(0, M, 15)
return sum(multiples_3) + sum(multiples_5) - sum(multiples_15)
def fC(M):
arr = np.arange(M)
return np.sum(arr[np.logical_or(arr % 3 == 0, arr % 5 == 0)])
def fD(M):
return sum_mult(M,3)+sum_mult(M,5)-sum_mult(M,15)
def sum_mult(M,n):
instances=(M-1)/n
check=len(range(n,M,n))
return (n*instances*(instances+1))/2
for x in range(5,20):
print "*"*20
M=2**x
print M
answers=[]
T=[]
for f in (fA,fB,fC,fD):
ts=time()
answers.append(f(M))
for i in range(20):
f(M)
T.append(time()-ts)
if not all([x==answers[0] for x in answers]):
print "Warning! Answers do not match!",answers
print T
Related
I am trying to write functions which emulate math.sin and math.tan but, instead of using the math library, performing the calculation using a series expansion.
The formulae are from Mathematics SE, How would you calculate the Tangent without a calculator?:
sin(x) = x − x^3/3! + x^5/5! −...
tan(x) = sin(x) / √(1 − sin(x)^2)
This is my attempt, but I could not figure out how to perform the sign flipping + / - / + / ... part of the series expansion for sin:
from math import factorial
res = 0
for i in [1, 3, 5, 7, 9, 11, 13, 15, 17, 19]:
res += 1**i/factorial(i)
print(res) # 1.1752011936438016
The result is not correct because I have not applied a + / - switch. I could add an if / else clause but this seems messy. Is there a better way?
Note: This question is an embellished version of a now deleted question that was posted yesterday by #Lana.
You can avoid recalculating x**n and the factorial at each step by calculating the next term of the sum using the previous one:
def sin2(x, n=20):
curr = x
res = curr
for i in range(2, n, 2):
curr *= - x**2/(i*(i+1))
res += curr
return res
Compared to jpp's version, it's about twice as fast:
from math import factorial
def sin(x, n=20):
return sum(x**j/factorial(j)*(1 if i%2==0 else -1)
for i, j in enumerate(range(1, n, 2)))
%timeit sin(0.7)
# 100000 loops, best of 3: 8.52 µs per loop
%timeit sin2(0.7)
# 100000 loops, best of 3: 4.54 µs per loop
And it can get a bit faster if we calculate - x**2 once and for all:
def sin3(x, n=20):
curr = x
res = 0
minus_x_squared = - x**2
for i in range(2, n, 2):
res += curr
curr *= minus_x_squared/(i*(i+1))
return res
%timeit sin2(0.7)
# 100000 loops, best of 3: 4.6 µs per loop
%timeit sin3(0.7)
# 100000 loops, best of 3: 3.54 µs per loop
You are close. Below is one way using sum with enumerate for your series expansion.
enumerate works by taking each value of an iterable and attaching an index, i.e. 0 for the first item, 1 for the second item, etc. Then we only need to test whether the index is even or odd and use a ternary statement.
In addition, you can use range instead of listing the odd numbers required in your expansion.
from math import factorial
def sin(x, n=20):
return sum(x**j/factorial(j)*(1 if i%2==0 else -1)
for i, j in enumerate(range(1, n, 2)))
def tan(x):
return sin(x) / (1-(sin(x))**2)**0.5
print(tan(1.2)) # 2.572151622126318
You can avoid the need for a ternary statement and enumerate altogether:
def sin(x, n=20):
return sum((-1)**i * x**(2*i+1) / factorial(2*i+1) for i in range(n))
If you write out the first few terms by hand, the equivalence will become clear.
Notes:
The sign of the tan function is only correct for 1st and 4th quadrants. This is consistent with the formulae you have provided. You can perform a trivial transformation to the input to account for this.
You can improve accuracy by increasing parameter n.
You can also calculate factorial without a library, but I'll leave that as an exercise.
I am performing a large number of these calculations:
A == A[np.newaxis].T
where A is a dense numpy array which frequently has common values.
For benchmarking purposes we can use:
n = 30000
A = np.random.randint(0, 1000, n)
A == A[np.newaxis].T
When I perform this calculation, I run into memory issues. I believe this is because the output isn't in more efficient bitarray or np.packedbits format. A secondary concern is we are performing twice as many comparisons as necessary, since the resulting Boolean array is symmetric.
The questions I have are:
Is it possible to produce the Boolean numpy array output in a more memory efficient fashion without sacrificing speed? The options I know about are bitarray and np.packedbits, but I only know how to apply these after the large Boolean array is created.
Can we utilise the symmetry of our calculation to halve the number of comparisons processed, again without sacrificing speed?
I will need to be able to perform & and | operations on Boolean arrays output. I have tried bitarray, which is super-fast for these bitwise operations. But it is slow to pack np.ndarray -> bitarray and then unpack bitarray -> np.ndarray.
[Edited to provide clarification.]
Here's one with numba to give us a NumPy boolean array as output -
from numba import njit
#njit
def numba_app1(idx, n, s, out):
for i,j in zip(idx[:-1],idx[1:]):
s0 = s[i:j]
c = 0
for p1 in s0[c:]:
for p2 in s0[c+1:]:
out[p1,p2] = 1
out[p2,p1] = 1
c += 1
return out
def app1(A):
s = A.argsort()
b = A[s]
n = len(A)
idx = np.flatnonzero(np.r_[True,b[1:] != b[:-1],True])
out = np.zeros((n,n),dtype=bool)
numba_app1(idx, n, s, out)
out.ravel()[::out.shape[1]+1] = 1
return out
Timings -
In [287]: np.random.seed(0)
...: n = 30000
...: A = np.random.randint(0, 1000, n)
# Original soln
In [288]: %timeit A == A[np.newaxis].T
1 loop, best of 3: 317 ms per loop
# #Daniel F's soln-1 that skips assigning lower diagonal in output
In [289]: %timeit sparse_outer_eq(A)
1 loop, best of 3: 450 ms per loop
# #Daniel F's soln-2 (complete one)
In [291]: %timeit sparse_outer_eq(A)
1 loop, best of 3: 634 ms per loop
# Solution from this post
In [292]: %timeit app1(A)
10 loops, best of 3: 66.9 ms per loop
This isn't even a numpy answer, but should work to keep your data requirements down by using a bit of homebrewed sparse notation
from numba import jit
#jit # because this is gonna be loopy
def sparse_outer_eq(A):
n = A.size
c = []
for i in range(n):
for j in range(i + 1, n):
if A[i] == A[j]:
c.append((i, j))
return c
Now c is a list of coordinate tuples (i, j), i < j that correspond to coordinates in your boolean array that are "True". You can easily do and and or operations on these setwise:
list(set(c1) & set(c2))
list(set(c1) | set(c2))
Later, when you want to apply this mask to an array, you can back out the coordinates and use them for fancy indexing instead:
i_, j_ = list(np.array(c).T)
i = np.r_[i_, j_, np.arange(n)]
j = np.r_[j_, i_, np.arange(n)]
You can then np.lexsort i nd j if you care about order
Alternatively, you can define sparse_outer_eq as:
#jit
def sparse_outer_eq(A):
n = A.size
c = []
for i in range(n):
for j in range(n):
if A[i] == A[j]:
c.append((i, j))
return c
Which keeps >2x the data, but then the coordinates come out simply:
i, j = list(np.array(c).T)
if you've done any set operations, this will still need to be lexsorted if you want a rational order.
If your coordinates are each n-bit integers, this should be more space-efficient than boolean format as long as your sparsity is less than 1/n -> 3% or so for 32-bit.
as for time, thanks to numba it's even faster than broadcasting:
n = 3000
A = np.random.randint(0, 1000, n)
%timeit sparse_outer_eq(A)
100 loops, best of 3: 4.86 ms per loop
%timeit A == A[:, None]
100 loops, best of 3: 11.8 ms per loop
and comparisons:
a = A == A[:, None]
b = B == B[:, None]
a_ = sparse_outer_eq(A)
b_ = sparse_outer_eq(B)
%timeit a & b
100 loops, best of 3: 5.9 ms per loop
%timeit list(set(a_) & set(b_))
1000 loops, best of 3: 641 µs per loop
%timeit a | b
100 loops, best of 3: 5.52 ms per loop
%timeit list(set(a_) | set(b_))
1000 loops, best of 3: 955 µs per loop
EDIT: if you want to do &~ (as per your comment) use the second sparse_outer_eq method (so you don't have to keep track of the diagonal) and just do:
list(set(a_) - set(b_))
Here is the more or less canonical argsort solution:
import numpy as np
def f_argsort(A):
idx = np.argsort(A)
As = A[idx]
ne_ = np.r_[True, As[:-1] != As[1:], True]
bnds = np.flatnonzero(ne_)
valid = np.diff(bnds) != 1
return [idx[bnds[i]:bnds[i+1]] for i in np.flatnonzero(valid)]
n = 30000
A = np.random.randint(0, 1000, n)
groups = f_argsort(A)
for grp in groups:
print(len(grp), set(A[grp]), end=' ')
print()
I'm adding a solution to my question because it satisfies these 3 properties:
Low, fixed, memory requirement
Fast bitwise operations (&, |, ~, etc)
Low storage, 1-bit per Boolean via packing integers
The downside is it is stored in np.packbits format. It is substantially slower than other methods (especially argsort), but if speed is not an issue the algorithm should work well. If anyone figures a way to optimise further, this would be very helpful.
Update: A more efficient version of the below algorithm can be found here: Improving performance on comparison algorithm np.packbits(A==A[:, None], axis=1).
import numpy as np
from numba import jit
#jit(nopython=True)
def bool2int(x):
y = 0
for i, j in enumerate(x):
if j: y += int(j)<<(7-i)
return y
#jit(nopython=True)
def compare_elementwise(arr, result, section):
n = len(arr)
for row in range(n):
for col in range(n):
section[col%8] = arr[row] == arr[col]
if ((col + 1) % 8 == 0) or (col == (n-1)):
result[row, col // 8] = bool2int(section)
section[:] = 0
return result
A = np.random.randint(0, 10, 100)
n = len(A)
result_arr = np.zeros((n, n // 8 if n % 8 == 0 else n // 8 + 1)).astype(np.uint8)
selection_arr = np.zeros(8).astype(np.uint8)
packed = compare_elementwise(A, result_arr, selection_arr)
I wanted to generate 1 or -1 in Python as a step to randomizing between non-negative and non-positive numbers or to randomly changing sign of an already existing integer. What would be the best way to generate 1 or -1 in Python? Assuming even distribution I know I could use:
import random
#method1
my_number = random.choice((-1, 1))
#method2
my_number = (-1)**random.randrange(2)
#method3
# if I understand correctly random.random() should never return exactly 1
# so I use "<", not "<="
if random.random() < 0.5:
my_number = 1
else:
my_number = -1
#method4
my_number = random.randint(0,1)*2-1
Using timeit module I got the following results:
#method1
s = "my_number = random.choice((-1, 1))"
timeit.timeit(stmt = s, setup = "import random")
>2.814896769857569
#method2
s = "my_number = (-1)**random.randrange(2)"
timeit.timeit(stmt = s, setup = "import random")
>3.521280517518562
#method3
s = """
if random.random() < 0.5: my_number = 1
else: my_number = -1"""
timeit.timeit(stmt = s, setup = "import random")
>0.25321546903273884
#method4
s = "random.randint(0,1)*2-1"
timeit.timeit(stmt = s, setup = "import random")
>4.526625442240402
So unexpectedly method 3 is the fastest. My bet was on method 1 to be the fastest as it is also shortest. Also both method 1 (since Python 3.6 I think?) and 3 give the possibility to introduce uneven distributions. Although method 1 is shortest (main advantege) for now I would choose method 3:
def positive_or_negative():
if random.random() < 0.5:
return 1
else:
return -1
Testing:
s = """
import random
def positive_or_negative():
if random.random() < 0.5:
return 1
else:
return -1
"""
timeit.timeit(stmt = "my_number = positive_or_negative()", setup = s)
>0.3916183138621818
Any better (faster or shorter) method to randomly generate -1 or 1 in Python? Any reason why would you choose method 1 over method 3 or vice versa?
A one liner variation of #3:
return 1 if random.random() < 0.5 else -1
It's fast(er) than the 'math' variants, because it doesn't involve additional arithmetic.
Here's another one-liner that my timings show to be faster than the if/else comparison to 0.5:
[-1,1][random.randrange(2)]
not sure what your application is exactly, but I needed something similar for a large vectorized array.
Here's a good way to get a sign array:
(2*np.random.randint(0,2,size=(your_size))-1)
The result is an array, for example:
array([-1, -1, -1, 1, 1, 1, -1, -1, 1, 1, 1, -1, -1, 1, -1])
and you can use the reshape command to get the above to the size of your matrix:
(2*np.random.randint(0,2,size=(m*n))-1).reshape(m,n)
Then you can multiply a matrix by the above and get all of the members with random signs.
A= np.array([[1, 2, 3],
[4, 5, 6]])
B = A*(2*np.random.randint(0,2,size=(2*3))-1).reshape(2,3)
Then you get something like :
B = array([[ 1, 2, -3],[ 4, 5, -6]])
Pretty quick, if your data is vectorized.
Maths made simple:
Generate random number: 0 or 1
Get them mutiplied by 2: 0 or 2
Substract 1: -1 or 1
Adapt that to any programming code. No need for test functions.
print(random.randint(0,1)*2-1)
works also without randint
print(int(random.random()*2)*2-1)
The fastest way to generate random numbers if you're going to be doing lots of them is by using numpy:
In [1]: import numpy as np
In [2]: import random
In [3]: %timeit [random.choice([-1,1]) for i in range(100000)]
10 loops, best of 3: 88.9 ms per loop
In [4]: %timeit [(-1)**random.randrange(2) for i in range(100000)]
10 loops, best of 3: 110 ms per loop
In [5]: %timeit [1 if random.random() < 0.5 else -1 for i in range(100000)]
100 loops, best of 3: 18.4 ms per loop
In [6]: %timeit [random.randint(0,1)*2-1 for i in range(100000)]
1 loop, best of 3: 180 ms per loop
In [7]: %timeit np.random.choice([-1,1],size=100000)
1000 loops, best of 3: 1.52 ms per loop
If you need single bits (one per call), you already did your benchmark and other answers provide additional info.
If you need many bits or can pre-calculate bit-arrays for later consumption, numpy's methods might shine.
Here is some more demo-approach using numpy (which surprisingly does not have a method dedicated for this job exactly):
import numpy as np
import random
def sample_bits(N):
assert N % 8 == 0 # demo only
n_bytes = N // 8
rbytes = np.random.randint(0, 255, dtype=np.uint8, size=n_bytes)
return np.unpackbits(rbytes)
def alt(N):
return np.random.choice([-1,1],size=N)
def alt2(N):
return [1 if random.random() < 0.5 else -1 for i in range(N)]
if __name__ == '__main__':
import timeit
print(timeit.timeit("sample_bits(1024)", setup="from __main__ import sample_bits", number=10000))
print(timeit.timeit("alt(1024)", setup="from __main__ import alt", number=10000))
print(timeit.timeit("alt2(1024)", setup="from __main__ import alt2", number=10000))
Output:
0.06640421246836543
0.352129537507486
1.5522800431775592
The general idea is:
use numpy to generate many uint8's in one step
(there might be something better using internal functions without the randint-API)
unpack uint8's to 8 bits
uniformity follows from randint's uniformity guarantees
Again, this is only a demo:
for one specific case
not caring about different result-types of these functions
not caring about -1 vs. 0 (might be important in your use-case)
(not even optimal compared to much more low-level approaches; MT used internally can be used as a bit-source, which does not need fp-math, like many other PRNGs!)
My code is as
vals = array("i", [-1, 1])
def my_rnd():
return vals[randint(0, 7) % 2]
Versions of this question have already been asked but I have not found a satisfactory answer.
Problem: given a large numpy vector, find indices of the vector elements which are duplicated (a variation of that could be comparison with tolerance).
So the problem is ~O(N^2) and memory bound (at least from the current algorithm point of view). I wonder why whatever I tried Python is 100x or more slower than an equivalent C code.
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = []
print("init done")
counter = 0
for i in range(N):
for j in range(i+1, N):
if vect[i] == vect[j]:
dupl.append(j)
counter += 1
print("counter =", counter)
print(dupl)
# For simplicity, this code ignores repeated indices
# which can be trimmed later. Ref output is
# counter = 3
# [2500, 5000, 5000]
I tried using numpy iterators but they are even worse (~ x4-5)
http://docs.scipy.org/doc/numpy/reference/arrays.nditer.html
Using N=10,000 I'm getting 0.1 sec in C, 12 sec in Python (code above), 40 sec in Python using np.nditer, 50 sec in Python using np.ndindex. I pushed it to N=160,000 and the timing scales as N^2 as expected.
Since the answers have stopped coming and none was totally satisfactory, for the record I post my own solution.
It is my understanding that it's the assignment which makes Python slow in this case, not the nested loops as I thought initially. Using a library or compiled code eliminates the need for assignments and performance improves dramatically.
from __future__ import print_function
import numpy as np
from numba import jit
N = 10000
vect = np.arange(N, dtype=np.float32)
vect[N/2] = 1
vect[N/4] = 1
dupl = np.zeros(N, dtype=np.int32)
print("init done")
# uncomment to enable compiled function
##jit
def duplicates(i, counter, dupl, vect):
eps = 0.01
ns = len(vect)
for j in range(i+1, ns):
# replace if to use approx comparison
#if abs(vect[i] - vect[j]) < eps:
if vect[i] == vect[j]:
dupl[counter] = j
counter += 1
return counter
counter = 0
for i in xrange(N):
counter = duplicates(i, counter, dupl, vect)
print("counter =", counter)
print(dupl[0:counter])
Tests
# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 10.135 s
# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 0.480 s
The performance of compiled version (with #jit uncommented) is close to C code performance ~0.1 - 0.2 sec. Perhaps eliminating the last loop could improve the performance even further. The difference in performance is even stronger when using approximate comparison using eps while there is very little difference for the compiled version.
# no jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 109.218 s
# with jit
$ time python array-test-numba.py
init done
counter = 3
[2500 5000 5000]
elapsed 0.506 s
This is ~ 200x difference. In the real code, I had to put both loops in the function as well as use a function template with variable types so it was a bit more complex but not very much.
Python itself is a highly-dynamic, slow, language. The idea in numpy is to use vectorization, and avoid explicit loops. In this case, you can use np.equal.outer. You can start with
a = np.equal.outer(vect, vect)
Now, for example, to find the sum:
>>> np.sum(a)
10006
To find the indices of i that are equal, you can do
np.fill_diagonal(a, 0)
>>> np.nonzero(np.any(a, axis=0))[0]
array([ 1, 2500, 5000])
Timing
def find_vec():
a = np.equal.outer(vect, vect)
s = np.sum(a)
np.fill_diagonal(a, 0)
return np.sum(a), np.nonzero(np.any(a, axis=0))[0]
>>> %timeit find_vec()
1 loops, best of 3: 214 ms per loop
def find_loop():
dupl = []
counter = 0
for i in range(N):
for j in range(i+1, N):
if vect[i] == vect[j]:
dupl.append(j)
counter += 1
return dupl
>>> % timeit find_loop()
1 loops, best of 3: 8.51 s per loop
This solution using the numpy_indexed package has complexity n Log n, and is fully vectorized; so not terribly different from C performance, in all likelihood.
import numpy_indexed as npi
dpl = np.flatnonzero(npi.multiplicity(vect) > 1)
The obvious question is why you want to do this in this way. NumPy arrays are intended to be opaque data structures – by this I mean NumPy arrays are intended to be created inside the NumPy system and then operations sent in to the NumPy subsystem to deliver a result. i.e. NumPy should be a black box into which you throw requests and out come results.
So given the code above I am not at all suprised that NumPy performance is worse than dreadful.
The following should be effectively what you want, I believe, but done the NumPy way:
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
print([np.where(a == vect)[0] for a in vect][1])
# Delivers [1, 2500, 5000]
Approach #1
You can simulate that iterator dependency criteria for a vectorized solution using a triangular matrix. This is based on this post that dealt with multiplication involving iterator dependency. For performing the elementwise equality of each element in vect against its all elements, we can use NumPy broadcasting. Finally, we can use np.count_nonzero to get the count, as it's supposed to be very efficient in summing purposes on boolean arrays.
So, we would have a solution like so -
mask = np.triu(vect[:,None] == vect,1)
counter = np.count_nonzero(mask)
dupl = np.where(mask)[1]
If you only care about the count counter, we could have two more approaches as listed next.
Approach #2
We can avoid the use of the triangular matrix and simply get the entire count and just subtract the contribution from diagonal elements and consider just one of either lower of upper triangular regions by just halving the remaining count as the contributions from either ones would be identical.
So, we would have a modified solution like so -
counter = (np.count_nonzero(vect[:,None] == vect) - vect.size)//2
Approach #3
Here's an entirely different approach that uses the fact the count of each unique element plays a cumsumed contribution to the final total.
So, with that idea in mind, we would have a third approach like so -
count = np.bincount(vect) # OR np.unique(vect,return_counts=True)[1]
idx = count[count>1]
id_arr = np.ones(idx.sum(),dtype=int)
id_arr[0] = 0
id_arr[idx[:-1].cumsum()] = -idx[:-1]+1
counter = np.sum(id_arr.cumsum())
As an alternative to Ami Tavory's answer, you can use a Counter from the collections package to detect duplicates. On my computer it seems to be even faster. See the function below which can also find different duplicates.
import collections
import numpy as np
def find_duplicates_original(x):
d = []
for i in range(len(x)):
for j in range(i + 1, len(x)):
if x[i] == x[j]:
d.append(j)
return d
def find_duplicates_outer(x):
a = np.equal.outer(x, x)
np.fill_diagonal(a, 0)
return np.flatnonzero(np.any(a, axis=0))
def find_duplicates_counter(x):
counter = collections.Counter(x)
values = (v for v, c in counter.items() if c > 1)
return {v: np.flatnonzero(x == v) for v in values}
n = 10000
x = np.arange(float(n))
x[n // 2] = 1
x[n // 4] = 1
>>>> find_duplicates_counter(x)
{1.0: array([ 1, 2500, 5000], dtype=int64)}
>>>> %timeit find_duplicates_original(x)
1 loop, best of 3: 12 s per loop
>>>> %timeit find_duplicates_outer(x)
10 loops, best of 3: 84.3 ms per loop
>>>> %timeit find_duplicates_counter(x)
1000 loops, best of 3: 1.63 ms per loop
This runs in 8 ms compared to 18 s for your code and doesn't use any strange libraries. It's similar to the approach by #vs0, but I like defaultdict more. It should be approximately O(N).
from collections import defaultdict
dupl = []
counter = 0
indexes = defaultdict(list)
for i, e in enumerate(vect):
indexes[e].append(i)
if len(indexes[e]) > 1:
dupl.append(i)
counter += 1
I wonder why whatever I tried Python is 100x or more slower than an equivalent C code.
Because Python programs are usually 100x slower than C programs.
You can either implement critical code paths in C and provide Python-C bindings, or change the algorithm. You can write an O(N) version by using a dict that reverses the array from value to index.
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = {}
print("init done")
counter = 0
for i in range(N):
e = dupl.get(vect[i], None)
if e is None:
dupl[vect[i]] = [i]
else:
e.append(i)
counter += 1
print("counter =", counter)
print([(k, v) for k, v in dupl.items() if len(v) > 1])
Edit:
If you need to test against an eps with abs(vect[i] - vect[j]) < eps you can then normalize the values up to eps
abs(vect[i] - vect[j]) < eps ->
abs(vect[i] - vect[j]) / eps < (eps / eps) ->
abs(vect[i]/eps - vect[j]/eps) < 1
int(abs(vect[i]/eps - vect[j]/eps)) = 0
Like this:
import numpy as np
N = 10000
vect = np.arange(float(N))
vect[N/2] = 1
vect[N/4] = 1
dupl = {}
print("init done")
counter = 0
eps = 0.01
for i in range(N):
k = int(vect[i] / eps)
e = dupl.get(k, None)
if e is None:
dupl[k] = [i]
else:
e.append(i)
counter += 1
print("counter =", counter)
print([(k, v) for k, v in dupl.items() if len(v) > 1])
If I have numpy arrays A and B, then I can compute the trace of their matrix product with:
tr = numpy.linalg.trace(A.dot(B))
However, the matrix multiplication A.dot(B) unnecessarily computes all of the off-diagonal entries in the matrix product, when only the diagonal elements are used in the trace. Instead, I could do something like:
tr = 0.0
for i in range(n):
tr += A[i, :].dot(B[:, i])
but this performs the loop in Python code and isn't as obvious as numpy.linalg.trace.
Is there a better way to compute the trace of a matrix product of numpy arrays? What is the fastest or most idiomatic way to do this?
You can improve on #Bill's solution by reducing intermediate storage to the diagonal elements only:
from numpy.core.umath_tests import inner1d
m, n = 1000, 500
a = np.random.rand(m, n)
b = np.random.rand(n, m)
# They all should give the same result
print np.trace(a.dot(b))
print np.sum(a*b.T)
print np.sum(inner1d(a, b.T))
%timeit np.trace(a.dot(b))
10 loops, best of 3: 34.7 ms per loop
%timeit np.sum(a*b.T)
100 loops, best of 3: 4.85 ms per loop
%timeit np.sum(inner1d(a, b.T))
1000 loops, best of 3: 1.83 ms per loop
Another option is to use np.einsum and have no explicit intermediate storage at all:
# Will print the same as the others:
print np.einsum('ij,ji->', a, b)
On my system it runs slightly slower than using inner1d, but it may not hold for all systems, see this question:
%timeit np.einsum('ij,ji->', a, b)
100 loops, best of 3: 1.91 ms per loop
From wikipedia you can calculate the trace using the hadamard product (element-wise multiplication):
# Tr(A.B)
tr = (A*B.T).sum()
I think this takes less computation than doing numpy.trace(A.dot(B)).
Edit:
Ran some timers. This way is much faster than using numpy.trace.
In [37]: timeit("np.trace(A.dot(B))", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[38]: 8.6434469223022461
In [39]: timeit("(A*B.T).sum()", setup="""import numpy as np;
A, B = np.random.rand(1000,1000), np.random.rand(1000,1000)""", number=100)
Out[40]: 0.5516049861907959
Note that one slight variant is to take the dot product of the vectorized matrices. In python, vectorization is done using .flatten('F'). It's slightly slower than taking the sum of the Hadamard product, on my computer, so it's a worse solution than wflynny's , but I think it's kind of interesting, since it can be more intuitive, in some situations, in my opinion. For example, personally I find that for the matrix normal distribution, the vectorized solution is easier for me to understand.
Speed comparison, on my system:
import numpy as np
import time
N = 1000
np.random.seed(123)
A = np.random.randn(N, N)
B = np.random.randn(N, N)
tart = time.time()
for i in range(10):
C = np.trace(A.dot(B))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = A.flatten('F').dot(B.T.flatten('F'))
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A.T * B).sum()
print(time.time() - start, C)
start = time.time()
for i in range(10):
C = (A * B.T).sum()
print(time.time() - start, C)
Result:
6.246593236923218 -629.370798672
0.06539678573608398 -629.370798672
0.057890892028808594 -629.370798672
0.05709719657897949 -629.370798672