Efficiently build an integer from bits of a boolean Numpy array - python

I am looking for a more efficient way to do the equivalent of
myarray * (2**arange(len(myarray))
Essentially I am after something like numpy.packbits that packs the bits into a single integer for any reasonable sized myarray yielding an appropriate size integer. I can implement this using numpy.packbits but I was wandering there is already a builtin that does this.

Three versions:
from numpy import *
from numba import jit
myarray=random.randint(0,2,64).astype(uint64)
def convert1(arr) : return (arr*(2**arange(arr.size,dtype=uint64))).sum()
pow2=2**arange(64,dtype=uint64)
def convert2(arr) : return (arr*pow2[:arr.size]).sum()
#jit("uint64(uint64[:])")
def convert3(arr):
m=1
y=0
for i in range(arr.size):
y=y + pow2[i] * arr[i]
return y
with times:
In [44]: %timeit convert1(myarray)
10000 loops, best of 3: 62.7 µs per loop
In [45]: %timeit convert2(myarray)
10000 loops, best of 3: 11.6 µs per loop
In [46]: %timeit convert3(myarray)
1000000 loops, best of 3: 1.55 µs per loop
Precomputing and Numba allow big improvements.

Related

Eager compilation with Numba slower than automatic type inference?

I am experimenting with numba currently and I just tried one of their example notebooks about jit compilation. I noticed that the eager compilation when giving types was slower than the compilation without types. The function they have been given was:
import numpy as np
from numba import jit, autojit
def looped_ver(k, a):
x = np.empty_like(a)
x[0] = 0.0
for i in range(1, x.size):
sm = 0.0
for j in range(0, i):
sm += k[i-j,j] * a[i-j] * a[j]
x[i] = sm
return x
typed_ver = jit('f8[:](f8[:,:],f8[:])')(looped_ver)
auto_ver = autojit(looped_ver)
I measured the execution time with
for n in [200,500,1000]:
k = np.random.rand(n,n)
a = np.random.rand(n)
%timeit typed_ver(k, a)
%timeit auto_ver(k, a)
The results are:
10000 loops, best of 3: 42.8 µs per loop
10000 loops, best of 3: 39.2 µs per loop
10000 loops, best of 3: 168 µs per loop
10000 loops, best of 3: 152 µs per loop
1000 loops, best of 3: 1.82 ms per loop
1000 loops, best of 3: 1.66 ms per loop
Which seems strange the typed version is slower than the untyped one. After inspecting the compiled code and there was no difference, except that the inferred type for the typed one was in A order and the untyped in C order.
I am using Numba 0.23.1, Numpy 1.10.1, Python 3.5.1 on Windows10.
How can I setup the typed version that it is as fast as the untyped one? Or is that not possible?

Why are for loops quicker than numpy for 2D array multiplication

Consider the following two functions, which essentially multiply every number in a small sequence with every number in a larger sequence to build up a 2D array, and then doubles all the values in the array. noloop() uses direct multiplication of 2D numpy arrays and returns the result, whereas loop() uses a for loop to iterate over arr1 and gradually build up an output array.
import numpy as np
arr1 = np.random.rand(100, 1)
arr2 = np.random.rand(1, 100000)
def noloop():
return (arr1*arr2)*2
def loop():
out = np.empty((arr1.size, arr2.size))
for i in range(arr1.size):
tmp = (arr1[i]*arr2)*2
out[i] = tmp.reshape(tmp.size)
return out
I expected noloop to be much faster even for a small number of iterations, but for the array sizes above, loop is actually faster:
>>> %timeit noloop()
10 loops, best of 3: 64.7 ms per loop
>>> %timeit loop()
10 loops, best of 3: 41.6 ms per loop
And interestingly, if I remove *2 in both functions, noloop is faster, but only slightly:
>>> %timeit noloop()
10 loops, best of 3: 29.4 ms per loop
>>> %timeit loop()
10 loops, best of 3: 34.4 ms per loop
Is there a good explanation for these results, and is there a notably faster way to perform the same task?
I wasn't able to reproduce your results, but I did find that I could get substantial speed up (factor of 2) using numpy.multiply. By using the out argument you can take advantage of the fact that the memory is already allocated and eliminate the copying of tmp to out.
def out_loop():
out = np.empty((arr1.size, arr2.size))
for i in range(arr1.size):
np.multiply(arr1[i], arr2, out=out[i].reshape((1, arr2.size)))
out[i] *= 2
return out
Results on my machine:
In [32]: %timeit out_loop()
100 loops, best of 3: 17.7 ms per loop
In [33]: %timeit loop()
10 loops, best of 3: 28.3 ms per loop

Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

Edit to add: I don't think the numba benchmarks are fair, notes below
I'm trying to benchmark different approaches to numerically processing data for the following use case:
Fairly big dataset (100,000+ records)
100+ lines of fairly simple code (z = x + y)
Don't need to sort or index
In other words, the full generality of series and dataframes is not needed, although they are included here b/c they are still convenient ways to encapsulate the data and there is often pre- or post-processing that does require the generality of pandas over numpy arrays.
Question: Based on this use case, are the following benchmarks appropriate and if not, how can I improve them?
# importing pandas, numpy, Series, DataFrame in standard way
from numba import jit
nobs = 10000
nlines = 100
def proc_df():
df = DataFrame({ 'x': np.random.randn(nobs),
'y': np.random.randn(nobs) })
for i in range(nlines):
df['z'] = df.x + df.y
return df.z
def proc_ser():
x = Series(np.random.randn(nobs))
y = Series(np.random.randn(nobs))
for i in range(nlines):
z = x + y
return z
def proc_arr():
x = np.random.randn(nobs)
y = np.random.randn(nobs)
for i in range(nlines):
z = x + y
return z
#jit
def proc_numba():
xx = np.random.randn(nobs)
yy = np.random.randn(nobs)
zz = np.zeros(nobs)
for j in range(nobs):
x, y = xx[j], yy[j]
for i in range(nlines):
z = x + y
zz[j] = z
return zz
Results (Win 7, 3 year old Xeon workstation (quad-core). Standard and recent anaconda distribution or very close.)
In [1251]: %timeit proc_df()
10 loops, best of 3: 46.6 ms per loop
In [1252]: %timeit proc_ser()
100 loops, best of 3: 15.8 ms per loop
In [1253]: %timeit proc_arr()
100 loops, best of 3: 2.02 ms per loop
In [1254]: %timeit proc_numba()
1000 loops, best of 3: 1.04 ms per loop # may not be valid result (see note below)
Edit to add (response to jeff) alternate results from passing df/series/array into functions rather than creating them inside of functions (i.e. move the code lines containing 'randn' from inside function to outside function):
10 loops, best of 3: 45.1 ms per loop
100 loops, best of 3: 15.1 ms per loop
1000 loops, best of 3: 1.07 ms per loop
100000 loops, best of 3: 17.9 µs per loop # may not be valid result (see note below)
Note on numba results: I think the numba compiler must be optimizing on the for loop and reducing the for loop to a single iteration. I don't know that but it's the only explanation I can come up as it couldn't be 50x faster than numpy, right? Followup question here: Why is numba faster than numpy here?
Well, you are not really timing the same things here (or rather, you are timing different aspects).
E.g.
In [6]: x = Series(np.random.randn(nobs))
In [7]: y = Series(np.random.randn(nobs))
In [8]: %timeit x + y
10000 loops, best of 3: 131 µs per loop
In [9]: %timeit Series(np.random.randn(nobs)) + Series(np.random.randn(nobs))
1000 loops, best of 3: 1.33 ms per loop
So [8] times the actual operation, while [9] includes the overhead for the series creation (and the random number generation) PLUS the actual operation
Another example is proc_ser vs proc_df. The proc_df includes the overhead of assignement of a particular column in the DataFrame (which is actually different for an initial creation and subsequent reassignement).
So create the structure (you can time that too, but that is a separate issue). Perform the exact same operation and time them.
Further you say that you don't need alignment. Pandas gives you this by default (and no really easy way to turn it off, though its just a simple check if they are already aligned). While in numba you need to 'manually' align them.
Following up on #Jeff answer. The code can be further optimized.
nobs = 10000
x = pd.Series(np.random.randn(nobs))
y = pd.Series(np.random.randn(nobs))
%timeit proc_ser()
%timeit x + y
%timeit x.values + y.values
100 loops, best of 3: 11.8 ms per loop
10000 loops, best of 3: 107 µs per loop
100000 loops, best of 3: 12.3 µs per loop

Python, faster to use class property, or variable in a function?

In Python, theoretically, which method should be faster out of test1 and test2 (assuming same value of x). I have tried using %timeit but see very little difference.
import numpy as np
class Tester():
def __init__(self):
self.x = np.arange(100000)
def test1(self):
return np.sum(self.x * self.x )
def test2(self,x):
return np.sum(x*x)
In any implementation of Python, the time will be overwhelmingly dominated by the multiplication of two vectors with 100,000 elements each. Everything else is noise compared to that. Make the vector much smaller if you're really interested in measuring other overheads.
In CPython, test2() will most likely be a little faster. It has an "extra" argument, but arguments are unpacked "at C speed" so that doesn't matter much. Arguments are accessed the same way as local variables, via the LOAD_FAST opcode, which is a simple array[index] access.
In test1(), each instance of self.x causes the string "x" to be looked up in the dictionary self.__dict__. That's slower than an indexed array access. But compared to the time taken by the long-winded multiplication, it's basically nothing.
I know this sort of misses the point of the question, but since you tagged the question with numpy and are looking at speed differences for a large array, I thought I would mention that there are faster solutions would be something else entirely.
So, what you're doing is a dot product, so use numpy.dot, which is built with the multiplying and summing all together from an external library (LAPACK?) (For convenience I'll use the syntax of test1, despite #Tim's answer, because no extra argument needs to be passed.)
def test3(self):
return np.dot(self.x, self.x)
or possibly even faster (and certainly more general):
def test4(self):
return np.einsum('i,i->', self.x, self.x)
Here are some tests:
In [363]: paste
class Tester():
def __init__(self, n):
self.x = np.arange(n)
def test1(self):
return np.sum(self.x * self.x)
def test2(self, x):
return np.sum(x*x)
def test3(self):
return np.dot(self.x, self.x)
def test4(self):
return np.einsum('i,i->', self.x, self.x)
## -- End pasted text --
In [364]: t = Tester(10000)
In [365]: np.allclose(t.test1(), [t.test2(t.x), t.test3(), t.test4()])
Out[365]: True
In [366]: timeit t.test1()
10000 loops, best of 3: 37.4 µs per loop
In [367]: timeit t.test2(t.x)
10000 loops, best of 3: 37.4 µs per loop
In [368]: timeit t.test3()
100000 loops, best of 3: 15.2 µs per loop
In [369]: timeit t.test4()
100000 loops, best of 3: 16.5 µs per loop
In [370]: t = Tester(10)
In [371]: timeit t.test1()
100000 loops, best of 3: 16.6 µs per loop
In [372]: timeit t.test2(t.x)
100000 loops, best of 3: 16.5 µs per loop
In [373]: timeit t.test3()
100000 loops, best of 3: 3.14 µs per loop
In [374]: timeit t.test4()
100000 loops, best of 3: 6.26 µs per loop
And speaking of small, almost syntactic, speed differences, think of using a method rather than standalone function:
def test1b(self):
return (self.x*self.x).sum()
gives:
In [385]: t = Tester(10000)
In [386]: timeit t.test1()
10000 loops, best of 3: 40.6 µs per loop
In [387]: timeit t.test1b()
10000 loops, best of 3: 37.3 µs per loop
In [388]: t = Tester(3)
In [389]: timeit t.test1()
100000 loops, best of 3: 16.6 µs per loop
In [390]: timeit t.test1b()
100000 loops, best of 3: 14.2 µs per loop

Fast check for NaN in NumPy

I'm looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.
I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)
Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:
In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop
In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
Unlike min, sum doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum's throughput seems constant regardless of whether there are NaNs and where they're located:
In [40]: x = np.random.rand(100000)
In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop
In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In [43]: x[50000] = np.nan
In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop
In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop
In [46]: x[0] = np.nan
In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop
In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
I think np.isnan(np.min(X)) should do what you want.
There are two general approaches here:
Check each array item for nan and take any.
Apply some cumulative operation that preserves nans (like sum) and check its result.
While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.
import numpy as np
import perfplot
def min(a):
return np.isnan(np.min(a))
def sum(a):
return np.isnan(np.sum(a))
def dot(a):
return np.isnan(np.dot(a, a))
def any(a):
return np.any(np.isnan(a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
b = perfplot.bench(
setup=np.random.rand,
kernels=[min, sum, dot, any, einsum],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
)
b.save("out.png")
b.show()
Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):
In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop
In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop
Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.
If you're comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:
import numba as nb
import math
#nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False
If there is no NaN the function might actually be slower than np.min, I think that's because np.min uses multiprocessing for large arrays:
import numpy as np
array = np.random.random(2000000)
%timeit anynan(array) # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.64 ms per loop
But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:
array = np.random.random(2000000)
array[100] = np.nan
%timeit anynan(array) # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.65 ms per loop
Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.
use .any()
if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking
if not np.isfinite(prop).all()
Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:
index = next((i for (i,n) in enumerate(iterable) if n!=n), None)
Adding to #nico-schlömer and #mseifert 's answers, I computed the performance of a numba-test has_nan with early stops, compared to some of the functions that will parse the full array.
On my machine, for an array without nans, the break-even happens for ~10^4 elements.
import perfplot
import numpy as np
import numba
import math
def min(a):
return np.isnan(np.min(a))
def dot(a):
return np.isnan(np.dot(a, a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
#numba.njit
def has_nan(a):
for i in range(a.size - 1):
if math.isnan(a[i]):
return True
return False
def array_with_missing_values(n, p):
""" Return array of size n, p : nans ( % of array length )
Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
a = np.random.rand(n)
p = np.random.randint(0, len(a), int(p*len(a)/100))
a[p] = np.nan
return a
#%%
perfplot.show(
setup=lambda n: array_with_missing_values(n, 0),
kernels=[min, dot, has_nan],
n_range=[2 ** k for k in range(20)],
logx=True,
logy=True,
xlabel="len(a)",
)
What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.
For arrays of length 1,000,000, has_nan becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.
#%%
N = 1000000 # 100000
perfplot.show(
setup=lambda p: array_with_missing_values(N, p),
kernels=[min, dot, has_nan],
n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01,
logy=True,
xlabel=f"% of nan in array (N = {N})",
)
If in your application most arrays have nan and you're looking for ones without, then has_nan is the best approach.
Else; dot seems to be the best option.

Categories