Vectorized sum-reduction with outer product - NumPy

Vectorized sum-reduction with outer product - NumPy - python

I'm relatively new to NumPy and often read that you should avoid to write loops. In many cases I understand how to deal with that, but at the moment I have the following code:
p = np.arange(15).reshape(5,3)
w = np.random.rand(5)
A = np.sum(w[i] * np.outer(p[i], p[i]) for i in range(len(p)))
Does anybody know if there is there a way to avoid the inner for loop?
Thanks in advance!

Approach #1 : With np.einsum -
np.einsum('ij,ik,i->jk',p,p,w)
Approach #2 : With broadcasting + np.tensordot -
np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))
Approach #3 : With np.einsum + np.dot -
np.einsum('ij,i->ji',p,w).dot(p)
Runtime test
Set #1 :
In [653]: p = np.random.rand(50,30)
In [654]: w = np.random.rand(50)
In [655]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10000 loops, best of 3: 101 µs per loop
In [656]: %timeit np.tensordot(p[...,None]*p[:,None], w, axes=((0),(0)))
10000 loops, best of 3: 124 µs per loop
In [657]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
100000 loops, best of 3: 9.07 µs per loop
Set #2 :
In [658]: p = np.random.rand(500,300)
In [659]: w = np.random.rand(500)
In [660]: %timeit np.einsum('ij,ik,i->jk',p,p,w)
10 loops, best of 3: 139 ms per loop
In [661]: %timeit np.einsum('ij,i->ji',p,w).dot(p)
1000 loops, best of 3: 1.01 ms per loop
The third approach just blew everything else!
Why Approach #3 is 10x-130x faster than Approach #1?
np.einsum is implemented in C. In the first approach, with those three strings there i,j,k in its string-notation, we would have three nested loops (in C of course). That's a lot of memory overhead there.
With the third approach, we are only getting into two strings i, j, hence two nested loops (in C again) and also leveraging BLAS based matrix-multiplication with that np.dot. These two factors are responsible for the amazing speedup with this one.

Related

Faster implementation for ReLu derivative in python?

I have implemented ReLu derivative as:
def relu_derivative(x):
return (x>0)*np.ones(x.shape)
I also tried:
def relu_derivative(x):
x[x>=0]=1
x[x<0]=0
return x
Size of X=(3072,10000).
But it's taking much time to compute. Is there any other optimized solution?

Approach #1 : Using numexpr
When working with large data, we can use numexpr module that supports multi-core processing if the intended operations could be expressed as arithmetic ones. Here, one way would be -
(X>=0)+0
Thus, to solve our case, it would be -
import numexpr as ne
ne.evaluate('(X>=0)+0')
Approach #2 : Using NumPy views
Another trick would be to use views by viewing the mask of comparisons as an int array, like so -
(X>=0).view('i1')
On performance, it should be identical to creating X>=0.
Timings
Comparing all posted solutions on a random array -
In [14]: np.random.seed(0)
...: X = np.random.randn(3072,10000)
In [15]: # OP's soln-1
...: def relu_derivative_v1(x):
...: return (x>0)*np.ones(x.shape)
...:
...: # OP's soln-2
...: def relu_derivative_v2(x):
...: x[x>=0]=1
...: x[x<0]=0
...: return x
In [16]: %timeit ne.evaluate('(X>=0)+0')
10 loops, best of 3: 27.8 ms per loop
In [17]: %timeit (X>=0).view('i1')
100 loops, best of 3: 19.3 ms per loop
In [18]: %timeit relu_derivative_v1(X)
1 loop, best of 3: 269 ms per loop
In [19]: %timeit relu_derivative_v2(X)
1 loop, best of 3: 89.5 ms per loop
The numexpr based one was with 8 threads. Thus, with more number of threads available for compute, it should improve further. Related post on how to control multi-core functionality.
Approach #3 : Approach #1 + #2 -
Mix both of those for the most optimal one for large arrays -
In [27]: np.random.seed(0)
...: X = np.random.randn(3072,10000)
In [28]: %timeit ne.evaluate('X>=0').view('i1')
100 loops, best of 3: 14.7 ms per loop

Numpy array multiplication of LDL^T factorization of symmetric matrix

Suppose I have an "LDL^T" decomposition of a symmetric, positive-semidefinite matrix A (numpy array), and I would like to multiply all factors together to obtain A.
What is the most efficient way to achieve this?
Currently, I am doing (D is available as "vector"):
np.dot(np.dot(L, np.diag(D)), L.T),
which is quite obviously a bad solution.

Approach #1
We could use elementwise multiplication and then matrix-multiplication. This basically replaces np.dot(L, np.diag(D)) with a direct element-wise multiplication for hopefully some speedup. So, with it, the implementation would become -
(L*D).dot(L.T)
Approach #2
Another approach could be with np.einsum to do all those things in one-go, like so -
np.einsum('ij,j,kj->ik',L,D,L)
Runtime test
In [303]: L = np.random.randint(0,9,(1000,1000))
In [304]: D = np.random.randint(0,9,(1000))
In [305]: %timeit np.dot(np.dot(L, np.diag(D)), L.T)
1 loops, best of 3: 3.87 s per loop
In [306]: %timeit (L*D).dot(L.T)
1 loops, best of 3: 1.39 s per loop
In [307]: %timeit np.einsum('ij,j,kj->ik',L,D,L)
1 loops, best of 3: 1.71 s per loop

Why are for loops quicker than numpy for 2D array multiplication

Consider the following two functions, which essentially multiply every number in a small sequence with every number in a larger sequence to build up a 2D array, and then doubles all the values in the array. noloop() uses direct multiplication of 2D numpy arrays and returns the result, whereas loop() uses a for loop to iterate over arr1 and gradually build up an output array.
import numpy as np
arr1 = np.random.rand(100, 1)
arr2 = np.random.rand(1, 100000)
def noloop():
return (arr1*arr2)*2
def loop():
out = np.empty((arr1.size, arr2.size))
for i in range(arr1.size):
tmp = (arr1[i]*arr2)*2
out[i] = tmp.reshape(tmp.size)
return out
I expected noloop to be much faster even for a small number of iterations, but for the array sizes above, loop is actually faster:
>>> %timeit noloop()
10 loops, best of 3: 64.7 ms per loop
>>> %timeit loop()
10 loops, best of 3: 41.6 ms per loop
And interestingly, if I remove *2 in both functions, noloop is faster, but only slightly:
>>> %timeit noloop()
10 loops, best of 3: 29.4 ms per loop
>>> %timeit loop()
10 loops, best of 3: 34.4 ms per loop
Is there a good explanation for these results, and is there a notably faster way to perform the same task?

I wasn't able to reproduce your results, but I did find that I could get substantial speed up (factor of 2) using numpy.multiply. By using the out argument you can take advantage of the fact that the memory is already allocated and eliminate the copying of tmp to out.
def out_loop():
out = np.empty((arr1.size, arr2.size))
for i in range(arr1.size):
np.multiply(arr1[i], arr2, out=out[i].reshape((1, arr2.size)))
out[i] *= 2
return out
Results on my machine:
In [32]: %timeit out_loop()
100 loops, best of 3: 17.7 ms per loop
In [33]: %timeit loop()
10 loops, best of 3: 28.3 ms per loop

Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

Edit to add: I don't think the numba benchmarks are fair, notes below
I'm trying to benchmark different approaches to numerically processing data for the following use case:
Fairly big dataset (100,000+ records)
100+ lines of fairly simple code (z = x + y)
Don't need to sort or index
In other words, the full generality of series and dataframes is not needed, although they are included here b/c they are still convenient ways to encapsulate the data and there is often pre- or post-processing that does require the generality of pandas over numpy arrays.
Question: Based on this use case, are the following benchmarks appropriate and if not, how can I improve them?
# importing pandas, numpy, Series, DataFrame in standard way
from numba import jit
nobs = 10000
nlines = 100
def proc_df():
df = DataFrame({ 'x': np.random.randn(nobs),
'y': np.random.randn(nobs) })
for i in range(nlines):
df['z'] = df.x + df.y
return df.z
def proc_ser():
x = Series(np.random.randn(nobs))
y = Series(np.random.randn(nobs))
for i in range(nlines):
z = x + y
return z
def proc_arr():
x = np.random.randn(nobs)
y = np.random.randn(nobs)
for i in range(nlines):
z = x + y
return z
#jit
def proc_numba():
xx = np.random.randn(nobs)
yy = np.random.randn(nobs)
zz = np.zeros(nobs)
for j in range(nobs):
x, y = xx[j], yy[j]
for i in range(nlines):
z = x + y
zz[j] = z
return zz
Results (Win 7, 3 year old Xeon workstation (quad-core). Standard and recent anaconda distribution or very close.)
In [1251]: %timeit proc_df()
10 loops, best of 3: 46.6 ms per loop
In [1252]: %timeit proc_ser()
100 loops, best of 3: 15.8 ms per loop
In [1253]: %timeit proc_arr()
100 loops, best of 3: 2.02 ms per loop
In [1254]: %timeit proc_numba()
1000 loops, best of 3: 1.04 ms per loop # may not be valid result (see note below)
Edit to add (response to jeff) alternate results from passing df/series/array into functions rather than creating them inside of functions (i.e. move the code lines containing 'randn' from inside function to outside function):
10 loops, best of 3: 45.1 ms per loop
100 loops, best of 3: 15.1 ms per loop
1000 loops, best of 3: 1.07 ms per loop
100000 loops, best of 3: 17.9 µs per loop # may not be valid result (see note below)
Note on numba results: I think the numba compiler must be optimizing on the for loop and reducing the for loop to a single iteration. I don't know that but it's the only explanation I can come up as it couldn't be 50x faster than numpy, right? Followup question here: Why is numba faster than numpy here?

Well, you are not really timing the same things here (or rather, you are timing different aspects).
E.g.
In [6]: x = Series(np.random.randn(nobs))
In [7]: y = Series(np.random.randn(nobs))
In [8]: %timeit x + y
10000 loops, best of 3: 131 µs per loop
In [9]: %timeit Series(np.random.randn(nobs)) + Series(np.random.randn(nobs))
1000 loops, best of 3: 1.33 ms per loop
So [8] times the actual operation, while [9] includes the overhead for the series creation (and the random number generation) PLUS the actual operation
Another example is proc_ser vs proc_df. The proc_df includes the overhead of assignement of a particular column in the DataFrame (which is actually different for an initial creation and subsequent reassignement).
So create the structure (you can time that too, but that is a separate issue). Perform the exact same operation and time them.
Further you say that you don't need alignment. Pandas gives you this by default (and no really easy way to turn it off, though its just a simple check if they are already aligned). While in numba you need to 'manually' align them.

Following up on #Jeff answer. The code can be further optimized.
nobs = 10000
x = pd.Series(np.random.randn(nobs))
y = pd.Series(np.random.randn(nobs))
%timeit proc_ser()
%timeit x + y
%timeit x.values + y.values
100 loops, best of 3: 11.8 ms per loop
10000 loops, best of 3: 107 µs per loop
100000 loops, best of 3: 12.3 µs per loop

Fast check for NaN in NumPy

I'm looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.
I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)

Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:
In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop
In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
Unlike min, sum doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum's throughput seems constant regardless of whether there are NaNs and where they're located:
In [40]: x = np.random.rand(100000)
In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop
In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In [43]: x[50000] = np.nan
In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop
In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop
In [46]: x[0] = np.nan
In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop
In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop

I think np.isnan(np.min(X)) should do what you want.

There are two general approaches here:
Check each array item for nan and take any.
Apply some cumulative operation that preserves nans (like sum) and check its result.
While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.
import numpy as np
import perfplot
def min(a):
return np.isnan(np.min(a))
def sum(a):
return np.isnan(np.sum(a))
def dot(a):
return np.isnan(np.dot(a, a))
def any(a):
return np.any(np.isnan(a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
b = perfplot.bench(
setup=np.random.rand,
kernels=[min, sum, dot, any, einsum],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
)
b.save("out.png")
b.show()

Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):
In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop
In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop
Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.

If you're comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:
import numba as nb
import math
#nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False
If there is no NaN the function might actually be slower than np.min, I think that's because np.min uses multiprocessing for large arrays:
import numpy as np
array = np.random.random(2000000)
%timeit anynan(array) # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.64 ms per loop
But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:
array = np.random.random(2000000)
array[100] = np.nan
%timeit anynan(array) # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.65 ms per loop
Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.

use .any()
if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking
if not np.isfinite(prop).all()

Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:
index = next((i for (i,n) in enumerate(iterable) if n!=n), None)

Adding to #nico-schlömer and #mseifert 's answers, I computed the performance of a numba-test has_nan with early stops, compared to some of the functions that will parse the full array.
On my machine, for an array without nans, the break-even happens for ~10^4 elements.
import perfplot
import numpy as np
import numba
import math
def min(a):
return np.isnan(np.min(a))
def dot(a):
return np.isnan(np.dot(a, a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
#numba.njit
def has_nan(a):
for i in range(a.size - 1):
if math.isnan(a[i]):
return True
return False
def array_with_missing_values(n, p):
""" Return array of size n, p : nans ( % of array length )
Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
a = np.random.rand(n)
p = np.random.randint(0, len(a), int(p*len(a)/100))
a[p] = np.nan
return a
#%%
perfplot.show(
setup=lambda n: array_with_missing_values(n, 0),
kernels=[min, dot, has_nan],
n_range=[2 ** k for k in range(20)],
logx=True,
logy=True,
xlabel="len(a)",
)
What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.
For arrays of length 1,000,000, has_nan becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.
#%%
N = 1000000 # 100000
perfplot.show(
setup=lambda p: array_with_missing_values(N, p),
kernels=[min, dot, has_nan],
n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01,
logy=True,
xlabel=f"% of nan in array (N = {N})",
)
If in your application most arrays have nan and you're looking for ones without, then has_nan is the best approach.
Else; dot seems to be the best option.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorized sum-reduction with outer product - NumPy - python

Related

Faster implementation for ReLu derivative in python?

Numpy array multiplication of LDL^T factorization of symmetric matrix

Why are for loops quicker than numpy for 2D array multiplication

Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

Fast check for NaN in NumPy

Categories

Resources