I profiled my program, and more than 80% of the time is spent in this one-line function! How can I optimize it? I am running with PyPy, so I'd rather not use NumPy, but since my program is spending almost all of its time there, I think giving up PyPy for NumPy might be worth it. However, I would prefer to use the CFFI, since that's more compatible with PyPy.
#x, y, are lists of 1s and 0s. c_out is a positive int. bit is 1 or 0.
def findCarryIn(x, y, c_out, bit):
return (2 * c_out +
bit -
sum(map(lambda x_bit, y_bit: x_bit & y_bit, x, reversed(y)))) #note this is basically a dot product.
Without using Numpy, After testing with timeit , The fastest method for the summing (that you are doing) seems to be using simple for loop and summing over the elements, Example -
def findCarryIn(x, y, c_out, bit):
s = 0
for i,j in zip(x, reversed(y)):
s += i & j
return (2 * c_out + bit - s)
Though this did not increase the performance by a lot (maybe 20% or so).
The results of timing tests (With different methods , func4 containing the method described above) -
def func1(x,y):
return sum(map(lambda x_bit, y_bit: x_bit & y_bit, x, reversed(y)))
def func2(x,y):
return sum([i & j for i,j in zip(x,reversed(y))])
def func3(x,y):
return sum(x[i] & y[-1-i] for i in range(min(len(x),len(y))))
def func4(x,y):
s = 0
for i,j in zip(x, reversed(y)):
s += i & j
return s
In [125]: %timeit func1(x,y)
100000 loops, best of 3: 3.02 µs per loop
In [126]: %timeit func2(x,y)
The slowest run took 6.42 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 2.9 µs per loop
In [127]: %timeit func3(x,y)
100000 loops, best of 3: 4.31 µs per loop
In [128]: %timeit func4(x,y)
100000 loops, best of 3: 2.2 µs per loop
This can for sure be sped up a lot using numpy. You could define your function something like this:
def find_carry_numpy(x, y, c_out, bit):
return 2 * c_out + bit - np.sum(x & y[::-1])
Create some random data:
In [36]: n = 100; c = 15; bit = 1
In [37]: x_arr = np.random.rand(n) > 0.5
In [38]: y_arr = np.random.rand(n) > 0.5
In [39]: x_list = list(x_arr)
In [40]: y_list = list(y_arr)
Check that results are the same:
In [42]: find_carry_numpy(x_arr, y_arr, c, bit)
Out[42]: 10
In [43]: findCarryIn(x_list, y_list, c, bit)
Out[43]: 10
Quick speed test:
In [44]: timeit find_carry_numpy(x_arr, y_arr, c, bit)
10000 loops, best of 3: 19.6 µs per loop
In [45]: timeit findCarryIn(x_list, y_list, c, bit)
1000 loops, best of 3: 409 µs per loop
So you gain a factor of 20 in speed! That is a pretty typical speedup when converting Python code to Numpy.
Related
Binomial coefficient for given value of n and k(nCk)
using numpy to multiply the results of a for loop
but numpy method is returning the memory location not the result
pls provide better solution in terms of time complexity if possible.
or any other suggestions.
import time
import numpy
def binomialc(n,k):
return 1 if k==0 or k==n else numpy.prod((n+1-i)/i for i in range(1,k+1))
starttime=time.perf_counter()
print(binomialc(600,298))
print(time.perf_counter()-starttime)
You may want to use: scipy.special.binom()
or, since Python 3.8: math.comb()
EDIT
I am not quite sure why you would not want to use SciPy but you are OK with NumPy, as SciPy is a well-established library from essentially the same folks developing NumPy.
Anyway, here a couple of other methods:
using math.factorial:
import math
def binom(n, k):
return math.factorial(n) // math.factorial(k) // math.factorial(n - k)
using prod() and math.factorial() (theoretically more efficient, but not in practice):
def prod(items, start=1):
for item in items:
start *= item
return start
def binom_simplified(n, k):
if k > n - k:
return prod(range(k + 1, n + 1)) // math.factorial(n - k)
else:
return prod(range(n - k + 1, n + 1)) // math.factorial(k)
using numpy.prod():
import numpy as np
def binom_np(n, k):
return 1 if k == 0 or k == n else np.prod([(n + 1 - i) / i for i in range(1, k + 1)])
Speed-wise, scipy.special.binom() is the fastest by far and large, but if you need the exact value also for very large numbers, you may prefer binom() (somewhat surprisingly even over math.comb()).
%timeit scipy.special.binom(600, 298)
# 1000000 loops, best of 3: 1.56 µs per loop
print(scipy.special.binom(600, 298))
# 1.3332140543730587e+179
%timeit math.comb(600, 298)
# 10000 loops, best of 3: 75.6 µs per loop
print(math.binom(600, 298))
# 133321405437268991724586879878020905773601074858558174180536459530557427686938822154484588609548964189291743543415057988154692680263088796451884071926401665548516571367537285901600
%timeit binom(600, 298)
# 10000 loops, best of 3: 36.5 µs per loop
print(binom(600, 298))
# 133321405437268991724586879878020905773601074858558174180536459530557427686938822154484588609548964189291743543415057988154692680263088796451884071926401665548516571367537285901600
%timeit binom_np(600, 298)
# 10000 loops, best of 3: 45.8 µs per loop
print(binom_np(600, 298))
# 1.3332140543726893e+179
%timeit binom_simplified(600, 298)
# 10000 loops, best of 3: 41.9 µs per loop
print(binom_simplified(600, 298))
# 133321405437268991724586879878020905773601074858558174180536459530557427686938822154484588609548964189291743543415057988154692680263088796451884071926401665548516571367537285901600
Consider two ndarrays of length n, arr1 and arr2. I'm computing the following sum of products, and doing it num_runs times to benchmark:
import numpy as np
import time
num_runs = 1000
n = 100
arr1 = np.random.rand(n)
arr2 = np.random.rand(n)
start_comp = time.clock()
for r in xrange(num_runs):
sum_prods = np.sum( [arr1[i]*arr2[j] for i in xrange(n)
for j in xrange(i+1, n)] )
print "total time for comprehension = ", time.clock() - start_comp
start_loop = time.clock()
for r in xrange(num_runs):
sum_prod = 0.0
for i in xrange(n):
for j in xrange(i+1, n):
sum_prod += arr1[i]*arr2[j]
print "total time for loop = ", time.clock() - start_loop
The output is
total time for comprehension = 3.23097066953
total time for comprehension = 3.9045544426
so using list comprehension appears faster.
Is there a much more efficient implementation, using Numpy routines perhaps, to calculate such a sum of products?
Rearrange the operation into an O(n) runtime algorithm instead of O(n^2), and take advantage of NumPy for the products and sums:
# arr1_weights[i] is the sum of all terms arr1[i] gets multiplied by in the
# original version
arr1_weights = arr2[::-1].cumsum()[::-1] - arr2
sum_prods = arr1.dot(arr1_weights)
Timing shows this to be about 200 times faster than the list comprehension for n == 100.
In [21]: %%timeit
....: np.sum([arr1[i] * arr2[j] for i in range(n) for j in range(i+1, n)])
....:
100 loops, best of 3: 5.13 ms per loop
In [22]: %%timeit
....: arr1_weights = arr2[::-1].cumsum()[::-1] - arr2
....: sum_prods = arr1.dot(arr1_weights)
....:
10000 loops, best of 3: 22.8 µs per loop
A vectorized way : np.sum(np.triu(np.multiply.outer(arr1,arr2),1)).
for a 30x improvement:
In [9]: %timeit np.sum(np.triu(np.multiply.outer(arr1,arr2),1))
1000 loops, best of 3: 272 µs per loop
In [10]: %timeit np.sum( [arr1[i]*arr2[j] for i in range(n)
for j in range(i+1, n)]
100 loops, best of 3: 7.9 ms per loop
In [11]: allclose(np.sum(np.triu(np.multiply.outer(arr1,arr2),1)),
np.sum(np.triu(np.multiply.outer(arr1,arr2),1)))
Out[11]: True
Another fast approch is to use numba :
from numba import jit
#jit
def t(arr1,arr2):
s=0
for i in range(n):
for j in range(i+1,n):
s+= arr1[i]*arr2[j]
return s
for a 10x new factor :
In [12]: %timeit t(arr1,arr2)
10000 loops, best of 3: 21.1 µs per loop
And using #user2357112 minimal answer,
#jit
def t2357112(arr1,arr2):
s=0
c=0
for i in range(n-2,-1,-1):
c += arr2[i+1]
s += arr1[i]*c
return s
for
In [13]: %timeit t2357112(arr1,arr2)
100000 loops, best of 3: 2.33 µs per loop
, just doing the necessary operations.
You can use the following broadcasting trick:
a = np.sum(np.triu(arr1[:,None]*arr2[None,:],1))
b = np.sum( [arr1[i]*arr2[j] for i in xrange(n) for j in xrange(i+1, n)] )
print a == b # True
Basically, I'm paying the price of calculating the product of all elements pairwise in arr1 and arr2 to take advantage of the speed of numpy broadcasting/vectorization being done much faster in low-level code.
And timings:
%timeit np.sum(np.triu(arr1[:,None]*arr2[None,:],1))
10000 loops, best of 3: 55.9 µs per loop
%timeit np.sum( [arr1[i]*arr2[j] for i in xrange(n) for j in xrange(i+1, n)] )
1000 loops, best of 3: 1.45 ms per loop
I am trying to get a fast vectorized version of the following loop:
for i in xrange(N1):
A[y[i]] -= B[i,:]
Here A.shape = (N2,N3), y.shape = (N1) with y taking values in [0,N2[, B.shape = (N1,N3). You can think of entries of y being indices into rows of A. Here N1 is large, N2 is pretty small and N3 is smallish.
I thought simply doing
A[y] -= B
would work, but the issue is that there are repeated entries in y and this does not do the right thing (i.e., if y=[1,1] then A[1] is only added to once, not twice). Also this is does not seem to be any faster than the unvectorized for loop.
Is there a better way of doing this?
EDIT: YXD linked this answer to in comments which at first seems to fit the bill. It would seem you can do exactly what I want with
np.subtract.at(A, y, B)
and it does work, however when I try to run it it is significantly slower than the unvectorized version. So, the question remains: is there a more performant way of doing this?
EDIT2: An example, to make things concrete:
n1,n2,n3 = 10000, 10, 500
A = np.random.rand(n2,n3)
y = np.random.randint(n2, size=n1)
B = np.random.rand(n1,n3)
The for loop, when run using %timeit in ipython gives on my machine:
10 loops, best of 3: 19.4 ms per loop
The subtract.at version produces the same value for A in the end, but is much slower:
1 loops, best of 3: 444 ms per loop
The code for the original for-loop based approach would look something like this -
def for_loop(A):
N1 = B.shape[0]
for i in xrange(N1):
A[y[i]] -= B[i,:]
return A
Case #1
If n2 >> n3, I would suggest this vectorized approach -
def bincount_vectorized(A):
n3 = A.shape[1]
nrows = y.max()+1
id = y[:,None] + nrows*np.arange(n3)
A[:nrows] -= np.bincount(id.ravel(),B.ravel()).reshape(n3,nrows).T
return A
Runtime tests -
In [203]: n1,n2,n3 = 10000, 500, 10
...: A = np.random.rand(n2,n3)
...: y = np.random.randint(n2, size=n1)
...: B = np.random.rand(n1,n3)
...:
...: # Make copies
...: Acopy1 = A.copy()
...: Acopy2 = A.copy()
...:
In [204]: %timeit for_loop(Acopy1)
10 loops, best of 3: 19 ms per loop
In [205]: %timeit bincount_vectorized(Acopy2)
1000 loops, best of 3: 779 µs per loop
Case #2
If n2 << n3, a modified for-loop approach with lesser loop complexity could be suggested -
def for_loop_v2(A):
n2 = A.shape[0]
for i in range(n2):
A[i] -= np.einsum('ij->j',B[y==i]) # OR (B[y==i]).sum(0)
return A
Runtime tests -
In [206]: n1,n2,n3 = 10000, 10, 500
...: A = np.random.rand(n2,n3)
...: y = np.random.randint(n2, size=n1)
...: B = np.random.rand(n1,n3)
...:
...: # Make copies
...: Acopy1 = A.copy()
...: Acopy2 = A.copy()
...:
In [207]: %timeit for_loop(Acopy1)
10 loops, best of 3: 24.2 ms per loop
In [208]: %timeit for_loop_v2(Acopy2)
10 loops, best of 3: 20.3 ms per loop
Using sympy I have two lists:
terms = [1, x, x*(x-1)]
coefficients = [-1,8.1,7]
I need to get the output:
-1+8.1*x+7*x(x-1)
I tried:
print (sum(a,x__i) for a, x__i in izip(terminos,coeficientes))
But I actually got a generator, I tried such based on a working code:
def lag_l(xx, j):
x = Symbol("x")
parts = ((x - x_i) / (xx[j] - x_i) for i, x_i in enumerate(xx) if i != j)
return prod(parts)
def lag_L(xx, yy):
return sum(y*lag_l(xx, j) for j, y in enumerate(yy))
How can I complete this?
In [159]: import sympy as sy
In [160]: from sympy.abc import x
In [161]: terms = [1, x, x*(x-1)]
In [162]: coefficients = [-1,8.1,7]
In [163]: sum(t*c for t, c in zip(terms, coefficients))
Out[163]: 7*x*(x - 1) + 8.1*x - 1
Interestingly, sum(term*coef for term, coef in zip(terms, coefficients)) is a bit faster than sum(coef * term for coef, term in zip(coefficients, terms)):
In [182]: %timeit sum(term * coef for term, coef in zip(terms, coefficients))
10000 loops, best of 3: 34.1 µs per loop
In [183]: %timeit sum(coef * term for coef, term in zip(coefficients, terms))
10000 loops, best of 3: 38.7 µs per loop
The reason for this is because coef * term calls coef.__mul__(term) which then has to call term.__rmul__(coef) since ints do not know how to multiply with sympy Symbols. That extra function call makes coef * term slower than term * coef. (term * coef calls term.__mul__(coef) directly.)
Here are some more microbenchmarks:
In [178]: %timeit sum(IT.imap(op.mul, coefficients, terms))
10000 loops, best of 3: 38 µs per loop
In [186]: %timeit sum(IT.imap(op.mul, terms, coefficients))
10000 loops, best of 3: 32.8 µs per loop
In [179]: %timeit sum(map(op.mul, coefficients, terms))
10000 loops, best of 3: 38.5 µs per loop
In [188]: %timeit sum(map(op.mul, terms, coefficients))
10000 loops, best of 3: 33.3 µs per loop
Notice that the order of the terms and coefficients matters, but otherwise there is little time difference between these variants. For larger input, they also perform about the same:
In [203]: terms = [1, x, x*(x-1)] * 100000
In [204]: coefficients = [-1,8.1,7] * 100000
In [205]: %timeit sum(IT.imap(op.mul, terms, coefficients))
1 loops, best of 3: 3.63 s per loop
In [206]: %timeit sum(term * coef for term, coef in zip(terms, coefficients))
1 loops, best of 3: 3.63 s per loop
In [207]: %timeit sum(map(op.mul, terms, coefficients))
1 loops, best of 3: 3.48 s per loop
Also be aware that if you do not know (through profiling) that this operation is a critical bottleneck in your code, worrying about these slight differences is a waste of your time, since the time it takes to pre-optimize this stuff is far greater than the amount of time you save whilst the code is running. As they say, preoptimization is the root of all evil. I'm probably already guilty of that.
In Python2,
sum(IT.imap(op.mul, coefficients, terms))
uses the least memory.
In Python3, zip and map returns iterators, so
sum(t*c for t, c in zip(terms, coefficients))
sum(map(op.mul, coefficients, terms))
would also be memory-efficient.
Using a simple generator expression:
sum(coef * term for coef, term in zip(coefficients, terms))
Alternatively, instead of using zip you want to use something similar to zip_with:
def zip_with(operation, *iterables):
for elements in zip(*iterables):
yield operation(*elements)
And use it as:
import operator as op
sum(zip_with(op.mul, coefficients, terms))
As unutbu mentioned python already provide such a function in itertools.imap under python2 and the built'-in's map in python3, so you can avoid re-writing it and use either:
sum(itertools.imap(op.mul, coefficients, terms))
or
sum(map(op.mul, coefficients, terms) #python3
python 2 map works slightly differently when you pass more than one sequence and the length differ.
If I have a list of numbers or objects in a list like l = [3,5,3,6,47,89]. We can calculate the minimum, maximum and average using following python code
minimum = min(l)
maximum = max(l)
avg = sum(l) / len(l)
Since all involve iterating the entire list, it is slow for large lists and lot of code.Is there any python module which can calculate all these values together?
Cython function:
#cython.boundscheck(False)
#cython.wraparound(False)
def minmaxAvg(list x):
cdef int i
cdef int _min, _max, total
_min = x[0]
_max = x[0]
total = 0
for i in x:
if i < _min: _min = i
elif i > _max: _max = i
total += i
return _min, _max, total/len(x)
pure python function to compare against:
def builtinfuncs(x):
a = min(x)
b = max(x)
avg = sum(x) / len(x)
return a,b,avg
In [16]: x = [random.randint(0,1000) for _ in range(10000)]
In [17]: %timeit minmaxAvg(x)
10000 loops, best of 3: 34 µs per loop
In [18]: %timeit frob(x)
1000 loops, best of 3: 460 µs per loop
Disclaimer:
- Speed result from cython will be dependent on computer hardware.
- Not as flexible and foolproof as using builtins. You would have to change the function to handle anything but integers for example.
- Before going down this path, you should ask yourself if this operation really is a big bottleneck in your application. It's probably not.
If you have pandas installed, you can do something like this:
import numpy as np
import pandas
s = pandas.Series(np.random.normal(size=37))
stats = s.describe()
stats will be a another series that behaves like a dictionary:
print(stats)
count 37.000000
mean 0.072138
std 0.932000
min -1.267888
25% -0.688728
50% -0.048624
75% 0.784244
max 2.501713
dtype: float64
stats['max']
2.501713
...etc. However, I don't recommend this unless you're striving simply for concise code. Here's why:
%%timeit
stats = s.describe()
# 100 loops, best of 3: 1.44 ms per loop
%%timeit
mymin = min(s)
mymax = max(s)
myavg = sum(s)/len(s)
# 10000 loops, best of 3: 89.5 µs per loop
I just can't imagine that you'll be able to squeeze any more performance out of the built-in functions with your own implementations (barring some cython voodoo, maybe).