How does numpy broadcasting perform faster? - python

In the following question,
https://stackoverflow.com/a/40056135/5714445
Numpy's broadcasting provides a solution that's almost 6x faster than using np.setdiff1d() paired with np.view(). How does it manage to do this?
And using A[~((A[:,None,:] == B).all(-1)).any(1)] speeds it up even more.
Interesting, but raises yet another question. How does this perform even better?

I would try to answer the second part of the question.
So, with it we are comparing :
A[np.all(np.any((A-B[:, None]), axis=2), axis=0)] (I)
and
A[~((A[:,None,:] == B).all(-1)).any(1)]
To compare with a matching perspective against the first one, we could write down the second approach like this -
A[(((~(A[:,None,:] == B)).any(2))).all(1)] (II)
The major difference when considering performance, would be the fact that with the first one, we are getting non-matches with subtraction and then checking for non-zeros with .any(). Thus, any() is made to operate on an array of non-boolean dtype array. In the second approach, instead we are feeding it a boolean array obtained with A[:,None,:] == B.
Let's do a small runtime test to see how .any() performs on int dtype vs boolean array -
In [141]: A = np.random.randint(0,9,(1000,1000)) # An int array
In [142]: %timeit A.any(0)
1000 loops, best of 3: 1.43 ms per loop
In [143]: A = np.random.randint(0,9,(1000,1000))>5 # A boolean array
In [144]: %timeit A.any(0)
10000 loops, best of 3: 164 µs per loop
So, with close to 9x speedup on this part, we see a huge advantage to use any() with boolean arrays. This I think was the biggest reason to make the second approach faster.

Related

Vectorisation of coordinate distances in Numpy

I'm trying to understand Numpy by applying vectorisation. I'm trying to find the fastest function to do it.
def get_distances3(coordinates):
return np.linalg.norm(
coordinates[:, None, :] - coordinates[None, :, :],
axis=-1)
coordinates = np.random.rand(1000, 3)
%timeit get_distances3(coordinates)
The function above took 10 loops, best of 3: 35.4 ms per loop. From numpy library there's also an np.vectorize option to do it.
def get_distances4(coordinates):
return np.vectorize(coordinates[:, None, :] - coordinates[None, :, :],axis=-1)
%timeit get_distances4(coordinates)
I tried with np.vectorize below, yet ended up with the following error.
TypeError: __init__() got an unexpected keyword argument 'axis'
How can I find vectorization in get_distances4? How should I edit the lsat code in order to avoid the error? I have never used np.vectorize, so I might be missing something.
You're not calling np.vectorize() correctly. I suggest referring to the documentation.
Vectorize takes as its argument a function that is written to operate on scalar values, and converts it into a function that can be vectorized over values in arrays according to the Numpy broadcasting rules. It's basically like a fancy map() for Numpy array.
i.e. as you know Numpy already has built-in vectorized versions of many common functions, but if you had some custom function like "my_special_function(x)" and you wanted to be able to call it on Numpy arrays, you could use my_special_function_ufunc = np.vectorize(my_special_function).
In your above example you might "vectorize" your distance function like:
>>> norm = np.linalg.norm
>>> get_distance4 = np.vectorize(lambda a, b: norm(a - b))
>>> get_distance4(coordinates[:, None, :], coordinates[None, :, :])
However, you will find that this is incredibly slow:
>>> %timeit get_distance4(coordinates[:, None, :], coordinates[None, :, :])
1 loop, best of 3: 10.8 s per loop
This is because your first example get_distance3 is already using Numpy's built-in fast implementations of these operations, whereas the np.vectorize version requires calling the Python function I defined some 3000 times.
In fact according to the docs:
The vectorize function is provided primarily for convenience, not for performance. The implementation is essentially a for loop.
If you want a potentially faster function for converting distances between vectors you could use scipy.spacial.distance.pdist:
>>> %timeit get_distances3(coordinates)
10 loops, best of 3: 24.2 ms per loop
>>> %timeit distance.pdist(coordinates)
1000 loops, best of 3: 1.77 ms per loop
It's worth noting that this has a different return formation. Rather than a 1000x1000 array it uses a condensed format that excludes i = j entries and i > j entries. If you wish you can then use scipy.spatial.distance.squareform to convert back to the square matrix format.

Best way to count Greater Than in numpy 2d array

results is 2d numpy array with size 300000
for i in range(np.size(results,0)):
if results[i][0]>=0.7:
count+=1
it takes me 0.7 second in this python code,but I run this in C++ code,it takes less than 0.07 second.
So how to make this python code as fast as possible?
When doing numerical computation for speed, especially in Python, you never want to use for loops if possible. Numpy is optimized for "vectorized" computation, so you want to pass off the work you'd typically do in for loops to special numpy indexing and functions like where.
I did a quick test on a 300,000 x 600 array of random values from 0 to 1 and found the following.
Your code, non-vectorized with one for loop:
226 ms per run
%%timeit
count = 0
for i in range(np.size(n,0)):
if results[i][0]>=0.7:
count+=1
emilaz Solution:
8.36 ms per run
%%timeit
first_col = results[:,0]
x = len(first_col[first_col>.7])
Ethan's Solution:
7.84 ms per run
%%timeit
np.bincount(results[:,0]>=.7)[1]
Best I came up with
6.92 ms per run
%%timeit
len(np.where(results[:,0] > 0.7)[0])
All 4 methods yielded the same answer, which for my data was 90,134. Hope this helps!
Try
first_col=results[:,0]
res =len(first_col[first_col>.7])
Depending on the shape of your matrix, this can be 2-10 times faster than your approach.
You could give the following a try:
np.bincount(results[:,0]>=.7)[1]
Not sure it’s faster, but should produce the correct answer

Combine Einsum Expressions

I would like to evaluate
E = np.einsum('ij,jk,kl->ijkl',A,A,A)
F = np.einsum('ijki->ijk',E)
where A is a matrix (no more than 1000 by 1000 in size). Computing E is slow. I would like to speed this up by only computing the "diagonal" elements which I store in F. Is it possible to combine these two expressions?/Are there any better ways to speed up this computation?
I'm not sure if there is an automatic way, but you can always do the maths yourself and give einsum the final expression:
F = np.einsum('ij,jk,ki->ijk', A, A, A)
In [86]: A=np.random.randint(0,100,(100,100))
In [88]: E1=np.einsum('ijki->ijk',np.einsum('ij,jk,kl->ijkl',A,A,A))
In [89]: E2=np.einsum('ij,jk,ki->ijk',A,A,A)
In [90]: np.allclose(E1,E2)
Out[90]: True
Good time improvement - 100x, corresponding to the saved dimension (l)
In [91]: timeit np.einsum('ijki->ijk',np.einsum('ij,jk,kl->ijkl',A,A,A))
1 loops, best of 3: 1.1 s per loop
In [92]: timeit np.einsum('ij,jk,ki->ijk',A,A,A)
100 loops, best of 3: 10.9 ms per loop
einsum performs a combined iteration over all the indices, albeit in Cython code. So reducing the number of indices can have a significant time savings. Looks like doing that i...i combination works in the initial calc.
With only 2g of memory, the (1000,1000) is too large 'iterator too large' in the E1 case, 'memory error' in the E2 case.

Why does numpy's fromiter function require specifying the dtype when other array creation routines don't?

In order to improve memory efficiency, I've been working on converting some of my code from lists to generators/iterators where I can. I've found a lot of instances of cases where I am just converting a list I've made to an np.array with the code pattern np.array(some_list).
Notably, some_list is often a list comprehension that is iterating over a generator.
I was looking into np.fromiter to see if I could use the generator more directly (rather than having to first cast it into a list to then convert it into an numpy array), but I noticed that the np.fromiter function, unlike any other array creation routine that uses existing data requires specifying the dtype.
In most of my particular cases, I can make that work(mostly dealing with loglikelihoods so float64 will be fine), but it left me wondering why it was that this is only necessary for the fromiter array creator and not other array creators.
First attempts at a guess:
Memory preallocation?
What I understand is that if you know the dtype and the count, it allows preallocating memory to the resulting np.array, and that if you don't specify the optional count argument that it will "resize the output array on demand". But if you do not specify the count, it would seem that you should be able to infer the dtype on the fly in the same way that you can in a normal np.array call.
Datatype recasting?
I could see this being useful for recasting data into new dtypes, but that would hold for other array creation routines as well, and would seem to merit placement as an optional but not required argument.
A couple ways of restating the question
So why is it that you need to specify the dtype to use np.fromiter; or put another way what are the gains that result from specifying the dtype if the array is going to be resized on demand anyway?
A more subtle version of the same question that is more directly related to my problem:
I know many of the efficiency gains of np.ndarrays are lost when you're constantly resizing them, so what is gained from using np.fromiter(generator,dtype=d) over np.fromiter([gen_elem for gen_elem in generator],dtype=d) over np.array([gen_elem for gen_elem in generator],dtype=d)?
If this code was written a decade ago, and there hasn't been pressure to change it, then the old reasons still apply. Most people are happy using np.array. np.fromiter is mainly used by people who are trying squeeze out some speed from iterative methods of generating values.
My impression is that np.array, the main alternative reads/processes the whole input, before deciding on the dtype (and other properties):
I can force a float return just by changing one element:
In [395]: np.array([0,1,2,3,4,5])
Out[395]: array([0, 1, 2, 3, 4, 5])
In [396]: np.array([0,1,2,3,4,5,6.])
Out[396]: array([ 0., 1., 2., 3., 4., 5., 6.])
I don't use fromiter much, but my sense is that by requiring dtype, it can start converting the inputs to that type right from the start. That could end up producing a faster iteration, though that needs time tests.
I know that the np.array generality comes at a certain time cost. Often for small lists it is faster to use a list comprehension than to convert it to an array - even though array operations are fast.
Some time tests:
In [404]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=int)
100000 loops, best of 3: 3.35 µs per loop
In [405]: timeit np.fromiter([0,1,2,3,4,5,6.],dtype=float)
100000 loops, best of 3: 3.88 µs per loop
In [406]: timeit np.array([0,1,2,3,4,5,6.])
100000 loops, best of 3: 4.51 µs per loop
In [407]: timeit np.array([0,1,2,3,4,5,6])
100000 loops, best of 3: 3.93 µs per loop
The differences are small, but suggest my reasoning is correct. Requiring dtype helps keep fromiter faster. count does not make a difference in this small size.
Curiously, specifying a dtype for np.array slows it down. It's as though it appends a astype call:
In [416]: timeit np.array([0,1,2,3,4,5,6],dtype=float)
100000 loops, best of 3: 6.52 µs per loop
In [417]: timeit np.array([0,1,2,3,4,5,6]).astype(float)
100000 loops, best of 3: 6.21 µs per loop
The differences between np.array and np.fromiter are more dramatic when I use range(1000) (Python3 generator version)
In [430]: timeit np.array(range(1000))
1000 loops, best of 3: 704 µs per loop
Actually, turning the range into a list is faster:
In [431]: timeit np.array(list(range(1000)))
1000 loops, best of 3: 196 µs per loop
but fromiter is still faster:
In [432]: timeit np.fromiter(range(1000),dtype=int)
10000 loops, best of 3: 87.6 µs per loop
It is faster to apply the int to float conversion on the whole array than to each element during the generation/iteration
In [434]: timeit np.fromiter(range(1000),dtype=int).astype(float)
10000 loops, best of 3: 106 µs per loop
In [435]: timeit np.fromiter(range(1000),dtype=float)
1000 loops, best of 3: 189 µs per loop
Note that the astype resizing operation is not that expensive, only some 20 µs.
============================
array_fromiter(PyObject *NPY_UNUSED(ignored), PyObject *args, PyObject *keywds) is defined in:
https://github.com/numpy/numpy/blob/eeba2cbfa4c56447e36aad6d97e323ecfbdade56/numpy/core/src/multiarray/multiarraymodule.c
It processes the keywds and calls
PyArray_FromIter(PyObject *obj, PyArray_Descr *dtype, npy_intp count)
in
https://github.com/numpy/numpy/blob/97c35365beda55c6dead8c50df785eb857f843f0/numpy/core/src/multiarray/ctors.c
This makes an initial array ret using the defined dtype:
ret = (PyArrayObject *)PyArray_NewFromDescr(&PyArray_Type, dtype, 1,
&elcount, NULL,NULL, 0, NULL);
The data attribute of this array is grown with 50% overallocation => 0, 4, 8, 14, 23, 36, 56, 86 ..., and shrunk to fit at the end.
The dtype of this array, PyArray_DESCR(ret), apparently has a function that can take value (provided by the iterator next), convert it, and set it in the data.
`(PyArray_DESCR(ret)->f->setitem(value, item, ret)`
In other words, all the dtype conversion is done by the defined dtype. The code would be lot more complicated if it decided 'on the fly' how to convert the value (and all previously allocated ones). Most of the code in this function deals with allocating the data buffer.
I'll hold off on looking up np.array. I'm sure it is much more complex.

Why is numpy.dot much faster than numpy.einsum?

I have numpy compiled with OpenBlas and I am wondering why einsum is much slower than dot (I understand in the 3 indices case, but I dont understand why it is also less performant in the two indices case)? Here an example:
import numpy as np
A = np.random.random([1000,1000])
B = np.random.random([1000,1000])
%timeit np.dot(A,B)
Out: 10 loops, best of 3: 26.3 ms per loop
%timeit np.einsum("ij,jk",A,B)
Out: 5 loops, best of 3: 477 ms per loop
Is there a way to let einsum use OpenBlas and parallelization like numpy.dot?
Why does np.einsum not just call np.dot if it notices a dot product?
einsum parses the index string, and then constructs an nditer object, and uses that to perform a sum-of-products iteration. It has special cases where the indexes just perform axis swaps, and sums ('ii->i'). It may also have special cases for 2 and 3 variables (as opposed to more). But it does not make any attempt to invoke external libraries.
I worked out a pure python work-a-like, but with more focus on the parsing than the calculation special cases.
tensordot reshapes and swaps, so it can then call dot to the actual calculations.

Categories