I have a mathematical function of this form $f(x)=\sum_{j=0}^N x^j * \sin(j*x)$ that I would like to compute efficiently in Python. N is of order ~100. This function f is evaluated thousands of times for all entries x of a huge matrix, and therefore I would like to improve the performance (profiler indicates that calculation of f takes up most of the time). In order to avoid the loop in the definition of the function f I wrote:
def f(x)
J=np.arange(0,N+1)
return sum(x**J*np.sin(j*x))
The issue is that if I want to evaluate this function for all entries of a matrix, I would need to use numpy.vectorize first, but as far as I know this not necessarily faster than a for loop.
Is there an efficient way to perform a calculation of this type?
Welcome to Sack Overflow! ^^
Well, calculating something ** 100 is some serious thing. But notice how, when you declare your array J, you are forcing your function to calculate x, x^2, x^3, x^4, ... (and so on) independently.
Let us take for example this function (which is what you are using):
def powervector(x, n):
return x ** np.arange(0, n)
And now this other function, which does not even use NumPy:
def power(x, n):
result = [1., x]
aux = x
for i in range(2, n):
aux *= x
result.append(aux)
return result
Now, let us verify that they both calculate the same thing:
In []: sum(powervector(1.1, 10))
Out[]: 15.937424601000005
In []: sum(power(1.1, 10))
Out[]: 15.937424601000009
Cool, now let us compare the performance of both (in iPython):
In [36]: %timeit sum(powervector(1.1, 10))
The slowest run took 20.42 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 3.52 µs per loop
In [37]: %timeit sum(power(1.1, 10))
The slowest run took 5.28 times longer than the fastest. This could mean that an intermediate result is being cached
1000000 loops, best of 3: 1.13 µs per loop
It is faster, as you are not calculating all the powers of x, because you know that x ^ N == (x ^ N - 1) * x and you take advantage of it.
You could use this to see if your performance improves. Of course you can change power() to use NumPy vectors as output. You can also have a look at Numba, which is easy to try and may improve performance a bit as well.
As you see, this is only a hint on how to improve some part of your problem. I bet there are a couple of other ways to further improve your code! :-)
Edit
It seems that Numba might not be a bad idea... Simply adding #numba.jit decorator:
#numba.jit
def powernumba(x, n):
result = [1., x]
aux = x
for i in range(2, n):
aux *= x
result.append(aux)
return result
Then:
In [52]: %timeit sum(power(1.1, 100))
100000 loops, best of 3: 7.67 µs per loop
In [51]: %timeit sum(powernumba(1.1, 100))
The slowest run took 5.64 times longer than the fastest. This could mean that an intermediate result is being cached
100000 loops, best of 3: 2.64 µs per loop
It seems Numba can do some magic there. ;-)
For a scalar x:
>>> import numpy as np
>>> x = 0.5
>>> jj = np.arange(10)
>>> x**jj
array([ 1. , 0.5 , 0.25 , 0.125 , 0.0625 ,
0.03125 , 0.015625 , 0.0078125 , 0.00390625, 0.00195312])
>>> np.sin(jj*x)
array([ 0. , 0.47942554, 0.84147098, 0.99749499, 0.90929743,
0.59847214, 0.14112001, -0.35078323, -0.7568025 , -0.97753012])
>>> (x**jj * np.sin(jj*x)).sum()
0.64489974041068521
Notice the use of the sum method of numpy arrays (equivalently, use np.sum not built-in sum).
If your x is itself an array, use broadcasting:
>>> a = x[:, None]**jj
>>> a.shape
(3, 10)
>>> x[0]**jj == a[0]
array([ True, True, True, True, True, True, True, True, True, True], dtype=bool)
Then sum over a second axis:
>>> res = a * np.sin(jj * x[:, None])
>>> res.shape
(3, 10)
>>> res.sum(axis=1)
array([ 0.01230993, 0.0613201 , 0.17154859])
Related
I have a function that will always converge to a fixpoint, e.g. f(x)= (x-a)/2+a. I have a function that will find this fixpoint through repetive invoking of the function:
def find_fix_point(f,x):
while f(x)>0.1:
x = f(x)
return x
Which works fine, now I want to do this for a vectoriced version;
def find_fix_point(f,x):
while (f(x)>0.1).any():
x = f(x)
return x
However this is quite inefficient, if most of the instances only need about 10 iterations and one needs 1000. What is a fast method to remove `x that already have been found?
The code can use numpy or scipy.
One way to solve this would be to use recursion:
def find_fix_point_recursive(f, x):
ind = x > 0.1
if ind.any():
x[ind] = find_fix_point_recursive(f, f(x[ind]))
return x
With this implementation, we only call f on the points which need to be updated.
Note that by using recursion we avoid having to do the check x > 0.1 all the time, with each call working on smaller and smaller arrays.
%timeit x = np.zeros(10000); x[0] = 10000; find_fix_point(f, x)
1000 loops, best of 3: 1.04 ms per loop
%timeit x = np.zeros(10000); x[0] = 10000; find_fix_point_recursive(f, x)
10000 loops, best of 3: 141 µs per loop
First for generality,I change the criteria to fit with the fix-point definition : we stop when |x-f(x)|<=epsilon.
You can mix boolean indexing and integer indexing to keep each time the active points. Here a way to do that :
def find_fix_point(f,x,epsilon):
ind=np.mgrid[:len(x)] # initial indices.
while ind.size>0:
xind=x[ind] # integer indexing
yind=f(xind)
x[ind]=yind
ind=ind[abs(yind-xind)>epsilon] # boolean indexing
An example with a lot of fix points :
from matplotlib.pyplot import plot,show
x0=np.linspace(0,1,1000)
x = x0.copy()
def f(x): return x*np.sin(1/x)
find_fix_point(f,x,1e-5)
plot(x0,x,'.');show()
The general method is to use boolean indexing to compute only the ones, that did not yet reach equilibrium.
I adapted the algorithm given by Jonas Adler to avoid maximal recursion depth:
def find_fix_point_vector(f,x):
x = x.copy()
x_fix = np.empty(x.shape)
unfixed = np.full(x.shape, True, dtype = bool)
while unfixed.any():
x_new = f(x) #iteration
x_fix[unfixed] = x_new # copy the values
cond = np.abs(x_new-x)>1
unfixed[unfixed] = cond #find out which ones are fixed
x = x_new[cond] # update the x values that still need to be computed
return x_fix
Edit:
Here I review the 3 solutions proposed. I will call the different fucntions according to their proposer, find_fix_Jonas, find_fix_Jurg, find_fix_BM. I changed the fixpoint condition in all functions according to BM (see updated fucntion of Jonas at the end).
Speed:
%timeit find_fix_BM(f, np.linspace(0,100000,10000),1)
100 loops, best of 3: 2.31 ms per loop
%timeit find_fix_Jonas(f, np.linspace(0,100000,10000))
1000 loops, best of 3: 1.52 ms per loop
%timeit find_fix_Jurg(f, np.linspace(0,100000,10000))
1000 loops, best of 3: 1.28 ms per loop
According to readability I think the version of Jonas is the easiest one to understand, so should be chosen when speed does not matter very much.
Jonas's version however might raise a Runtimeerror, when the number of iterations until fixpoint is reached is large (>1000). The other two solutions do not have this drawback.
The verion of B.M. however might be easier to understand than the version proposed by me.
#
Version of Jonas used:
def find_fix_Jonas(f, x):
fx = f(x)
ind = np.abs(fx-x)>1
if ind.any():
fx[ind] = find_fix_Jonas(f, fx[ind])
return fx
...remove `x that already have been found?
Create a new array using boolean indexing with your condition.
>>> a = np.array([3,1,6,3,9])
>>> a != 3
array([False, True, True, False, True], dtype=bool)
>>> b = a[a != 3]
>>> b
array([1, 6, 9])
>>>
I need to check if a given float is close, within a given tolerance, to any float in an array of floats.
import numpy as np
# My float
a = 0.27
# The tolerance
t = 0.01
# Array of floats
arr_f = np.arange(0.05, 0.75, 0.008)
Is there a simple way to do this? Something like if a in arr_f: but allowing for some tolerance in the difference?
Add
By "allow tolerance" I mean it in the following sense:
for i in arr_f:
if abs(a - i) <= t:
print 'float a is in arr_f within tolerance t'
break
How about using np.isclose?
>>> np.isclose(arr_f, a, atol=0.01).any()
True
np.isclose compares two objects element-wise to see if the values are within a given tolerance (here specified by the keyword argument atol which is the absolute difference between two elements). The function returns a boolean array.
If you're only interested in a True/False result, then this should work:
In [1]: (abs(arr_f - a) < t).any()
Out[1]: True
Explanation: abs(arr_f - a) < t returns a boolean array on which any() is invoked in order to find out whether any of its values is True.
EDIT - Comparing this approach and the one suggested in the other answer reveals that this one is slightly faster:
In [37]: arr_f = np.arange(0.05, 0.75, 0.008)
In [38]: timeit (abs(arr_f - a) < t).any()
100000 loops, best of 3: 11.5 µs per loop
In [39]: timeit np.isclose(arr_f, a, atol=t).any()
10000 loops, best of 3: 44.7 µs per loop
In [40]: arr_f = np.arange(0.05, 1000000, 0.008)
In [41]: timeit (abs(arr_f - a) < t).any()
1 loops, best of 3: 646 ms per loop
In [42]: timeit np.isclose(arr_f, a, atol=t).any()
1 loops, best of 3: 802 ms per loop
An alternative solution that also returns the relevant indices is as follows:
In [5]: np.where(abs(arr_f - a) < t)[0]
Out[5]: array([27, 28])
This means that the values residing in indices 27 and 28 of arr_f are within the desired range, and indeed:
In [9]: arr_f[27]
Out[9]: 0.26600000000000001
In [10]: arr_f[28]
Out[10]: 0.27400000000000002
Using this approach can also generate a True/False result:
In [11]: np.where(abs(arr_f - a) < t)[0].any()
Out[11]: True
[temp] = np.where(np.int32((sliceArray - aimFloat) > 0) == 1)
temp[0] is answer.
sliceArray is sorted!
In python, I have an array X with N rows (the number of examples) and n columns (the number of features).
If I want to calculate the second order moment matrix C
C[i,j] = E(x_i x_j)
then I have two possibility:
First, do the loop:
for i in range(N):
for j in range(n):
for k in range(n):
C[j,k] = C[j,k] + X[i,j]*X[i,k]/N
Second, more simple, use numpy product matrix:
import numpy np
C = np.transpose(X).dot(X)/N
The second version in practice is extremely faster.
If now I want to calculate the third order moment matrix T
T[i,j,k] = E(x_i x_j x_k)
then the loop alternative is easy:
for i in range(N):
for j in range(n):
for k in range(n):
for m in range(n):
T[j,k,m] = T[j,k,m] + X[i,j]*X[i,k]*X[i,m]/N
Is there a fast way using numpy libraries to calculate this last tensor, like for the second order moment?
You can use NumPy's einsum notation to solve both your second and third order cases.
Second order :
np.einsum('ij,ik->jk',X,X)/N
Third order :
np.einsum('ij,ik,il->jkl',X,X,X)/N
As can be seen, it would be easier/intuitive to extend this to higher order cases with it.
I know it is not perfect in terms of speed, but why not using np.power(x, 3).sum() / N? It is slower than the dot product, but faster than looping.
In [1]: import numpy as np
In [2]: x = np.random.rand(10000)
In [3]: x.dot(x.T)
Out[3]: 3373.6189738897856
In [4]: %timeit(x.dot(x.T))
The slowest run took 48.74 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 4.39 µs per loop
In [5]: %timeit(np.power(x, 2).sum())
The slowest run took 4.14 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 140 µs per loop
In [6]: np.power(x, 2).sum()
Out[6]: 3373.6189738897865
Btw, that's how I calculate the moments...
I have two 1 dimensional numpy vectors va and vb which are being used to populate a matrix by passing all pair combinations to a function.
na = len(va)
nb = len(vb)
D = np.zeros((na, nb))
for i in range(na):
for j in range(nb):
D[i, j] = foo(va[i], vb[j])
As it stands, this piece of code takes a very long time to run due to the fact that va and vb are relatively large (4626 and 737). However I am hoping this can be improved due to the fact that a similiar procedure is performed using the cdist method from scipy with very good performance.
D = cdist(va, vb, metric)
I am obviously aware that scipy has the benefit of running this piece of code in C rather than in python - but I'm hoping there is some numpy function im unaware of that can execute this quickly.
One of the least known numpy functions for what the docs call functional programming routines is np.frompyfunc. This creates a numpy ufunc from a Python function. Not some other object that closely simulates a numpy ufunc, but a proper ufunc with all its bells and whistles. While the behavior is in many aspects very similar to np.vectorize, it has some distinct advantages, that hopefully the following code should highlight:
In [2]: def f(a, b):
...: return a + b
...:
In [3]: f_vec = np.vectorize(f)
In [4]: f_ufunc = np.frompyfunc(f, 2, 1) # 2 inputs, 1 output
In [5]: a = np.random.rand(1000)
In [6]: b = np.random.rand(2000)
In [7]: %timeit np.add.outer(a, b) # a baseline for comparison
100 loops, best of 3: 9.89 ms per loop
In [8]: %timeit f_vec(a[:, None], b) # 50x slower than np.add
1 loops, best of 3: 488 ms per loop
In [9]: %timeit f_ufunc(a[:, None], b) # ~20% faster than np.vectorize...
1 loops, best of 3: 425 ms per loop
In [10]: %timeit f_ufunc.outer(a, b) # ...and you get to use ufunc methods
1 loops, best of 3: 427 ms per loop
So while it is still clearly inferior to a properly vectorized implementation, it is a little faster (the looping is in C, but you still have the Python function call overhead).
cdist is fast because it is written in highly-optimized C code (as you already pointed out), and it only supports a small predefined set of metrics.
Since you want to apply the operation generically, to any given foo function, you have no choice but to call that function na-times-nb times. That part is not likely to be further optimizable.
What's left to optimize are the loops and the indexing. Some suggestions to try out:
Use xrange instead of range (if in python2.x. in python3, range is already a generator-like)
Use enumerate, instead of range + explicitly indexing
Use a python speed "magic", such as cython or numba, to speed up the looping process.
If you can make further assumptions about foo, it might be possible to speed it up further.
Like #shx2 said, it all depends on what is foo. If you can express it in terms of numpy ufuncs, then use outer method:
In [11]: N = 400
In [12]: B = np.empty((N, N))
In [13]: x = np.random.random(N)
In [14]: y = np.random.random(N)
In [15]: %%timeit
for i in range(N):
for j in range(N):
B[i, j] = x[i] - y[j]
....:
10 loops, best of 3: 87.2 ms per loop
In [16]: %timeit A = np.subtract.outer(x, y) # <--- np.subtract is a ufunc
1000 loops, best of 3: 294 µs per loop
Otherwise you can push the looping down to cython level. Continuing a trivial example above:
In [45]: %%cython
cimport cython
#cython.boundscheck(False)
#cython.wraparound(False)
def foo(double[::1] x, double[::1] y, double[:, ::1] out):
cdef int i, j
for i in xrange(x.shape[0]):
for j in xrange(y.shape[0]):
out[i, j] = x[i] - y[j]
....:
In [46]: foo(x, y, B)
In [47]: np.allclose(B, np.subtract.outer(x, y))
Out[47]: True
In [48]: %timeit foo(x, y, B)
10000 loops, best of 3: 149 µs per loop
The cython example is deliberately made overly simplistic: in reality you might want to add some shape/stride checks, allocate the memory within your function etc.
Numpy allows matrices of different sizes to be added/multiplied/divided provided certain broadcasting rules are followed. Also, creation of temporary arrays is a major speed impediment to numpy.
The following timit results surprise me...what is going on?
In [41]: def f_no_dot(mat,arr):
....: return mat/arr
In [42]: def f_dot(mat,arr):
....: denominator = scipy.dot(arr, scipy.ones((1,2)))
....: return mat/denominator
In [43]: mat = scipy.rand(360000,2)
In [44]: arr = scipy.rand(360000,1)
In [45]: timeit temp = f_no_dot(mat,arr)
10 loops, best of 3: 44.7 ms per loop
In [46]: timeit temp = f_dot(mat,arr)
100 loops, best of 3: 10.1 ms per loop
I thought that f_dot would be slower since it had to create the temporary array denominator, and I assumed that this step was skipped by f_no_dot. I should note that these times scale linearly (with array size, up to length 1 billion) for f_no_dot, and slightly worse than linear for f_dot.
I thought that f_dot would be slower since it had to create the temporary array denominator, and I assumed that this step was skipped by f_no_dot.
For what it's worth, creating the temporary array is skipped, which is why f_no_dot is slower (but uses less memory).
Element-wise operations on arrays of the same size are faster, because numpy doesn't have to worry about the striding (dimensions, size, etc) of the arrays.
Operations that use broadcasting will generally be a bit slower than operations that don't have to.
If you have the memory to spare, creating a temporary copy can give you a speedup, but will use more memory.
For example, comparing these three functions:
import numpy as np
import timeit
def f_no_dot(x, y):
return x / y
def f_dot(x, y):
denom = np.dot(y, np.ones((1,2)))
return x / denom
def f_in_place(x, y):
x /= y
return x
num = 3600000
x = np.ones((num, 2))
y = np.ones((num, 1))
for func in ['f_dot', 'f_no_dot', 'f_in_place']:
t = timeit.timeit('%s(x,y)' % func, number=100,
setup='from __main__ import x,y,f_dot, f_no_dot, f_in_place')
print func, 'time...'
print t / 100.0
This yields similar timings to your results:
f_dot time...
0.184361531734
f_no_dot time...
0.619203259945
f_in_place time...
0.585789341927
However, if we compare the memory usage, things become a bit clearer...
The combined size of our x and y arrays is about 27.5 + 55 MB, or 82 MB (for 64-bit ints). There's an additional ~11 MB of overhead in import numpy, etc.
Returning x / y as a new array (i.e. not doing x /= y) will require another 55 MB array.
100 runs of f_dot:
We're creating a temporary array here, so we'd expect to see 11 + 82 + 55 + 55 MB or ~203 MB of memory usage. And, that's what we see...
100 runs of f_no_dot:
If no temporary array is created, we'd expect a peak memory usage of 11 + 82 + 55 MB, or 148 MB...
...which is exactly what we see.
So, x / y is not creating an additional num x 2 temporary array to do the division.
Thus, the division takes a quite a bit longer than it would if it were operating on two arrays of the same size.
100 runs of f_in_place:
If we can modify x in-place, we can save even more memory, if that's the main concern.
Basically, numpy tries to conserve memory at the expense of speed, in some cases.
What you are seeing is most likely iteration overhead over the small (2,) dimension. Numpy (versions < 1.6) dealt efficiently only with operations involving contiguous arrays (of the same shape). The effect goes away as the size of the last dimension increases.
To see the effect of contiguity:
In [1]: import numpy
In [2]: numpy.__version__
Out[2]: '1.5.1'
In [3]: arr_cont1 = numpy.random.rand(360000, 2)
In [4]: arr_cont2 = numpy.random.rand(360000, 2)
In [5]: arr_noncont = numpy.random.rand(360000, 4)[:,::2]
In [6]: arr_bcast = numpy.random.rand(360000, 1)
In [7]: %timeit arr_cont1 / arr_cont2
100 loops, best of 3: 5.75 ms per loop
In [8]: %timeit arr_noncont / arr_cont2
10 loops, best of 3: 54.4 ms per loop
In [9]: %timeit arr_bcast / arr_cont2
10 loops, best of 3: 55.2 ms per loop
The situation is however much improved in Numpy >= 1.6.0:
In [1]: import numpy
In [2]: numpy.__version__
Out[2]: '1.6.1'
In [3]: arr_cont1 = numpy.random.rand(360000, 2)
In [4]: arr_cont2 = numpy.random.rand(360000, 2)
In [5]: arr_noncont = numpy.random.rand(360000, 4)[:,::2]
In [6]: arr_bcast = numpy.random.rand(360000, 1)
In [7]: %timeit arr_cont1 / arr_cont2
100 loops, best of 3: 5.37 ms per loop
In [8]: %timeit arr_noncont / arr_cont2
100 loops, best of 3: 6.12 ms per loop
In [9]: %timeit arr_bcast / arr_cont2
100 loops, best of 3: 7.81 ms per loop
(All the timings above are probably only accurate to 1 ms.)
Note also that temporaries are not that expensive:
In [82]: %timeit arr_cont1.copy()
1000 loops, best of 3: 778 us per loop
EDIT: Note above that also arr_noncont is sort of contiguous with stride of 2*itemsize, so that the inner loop can be unraveled --- Numpy can make it about as fast as the contiguous array. With broadcasting (or with a really noncontiguous array such as numpy.random.rand(360000*2, 2)[::2,:], the inner loop cannot be unraveled, and these cases are slightly slower. Improving that would still be possible if Numpy emitted tailored machine code on the fly for each loop, but it doesn't do that (at least yet :)