Collection comparison is reflexive, yet does not short circuit. Why? - python

In python, the built in collections compare elements with the explicit assumption that they are reflexive:
In enforcing reflexivity of elements, the comparison of collections assumes that for a collection element x, x == x is always true. Based on that assumption, element identity is compared first, and element comparison is performed only for distinct elements.
Logically, this means that for any list L, L == L must be True. Given this, why doesn't the implementation check for identity to short circuit the evaluation?
In [1]: x = list(range(10000000))
In [2]: y = list(range(int(len(x)) // 10))
In [3]: z = [1]
# evaluation time likes O(N)
In [4]: %timeit x == x
10 loops, best of 3: 21.8 ms per loop
In [5]: %timeit y == y
100 loops, best of 3: 2.2 ms per loop
In [6]: %timeit z == z
10000000 loops, best of 3: 36.4 ns per loop
Clearly, child classes could choose to make an identity check, and clearly an identity check would add a very small overhead to every such comparison.
Was a historical decision explicitly made not to make such a check in the built in sequences to avoid this expense?

While I'm not privy to the developers' thinking, my guess is that they might have felt comparing L == L does not happen often enough to warrant a special check, and moreover, the user can always use (L is L) or (L==L) to build a
short-circuiting check himself if he deems that advantageous.
In [128]: %timeit (x is x) or (x == x)
10000000 loops, best of 3: 36.1 ns per loop
In [129]: %timeit (y is y) or (y == y)
10000000 loops, best of 3: 34.8 ns per loop

Related

clever any() like function to check if at least n elements are True?

Say I have an iterable (in my case a list):
l = [True, False, False, True]
I know that the easiest and fastest way to check if at least one of those elements is True is simply to use any(l), which will return True.
But what if I want to check that at least two elements are True?
My goal is to process it in the fastest way possible.
My code right now looks like this (for two elements):
def check_filter(l):
if len([i for i in filter(None, l)]) > 1:
return True
return False
This is about 10 times slower than any(), and does not seem very pythonic to me.
You could simply use an iterator over the sequence and check that any on the iterator returns always True for n-times:
def check(it, num):
it = iter(it)
return all(any(it) for _ in range(num))
>>> check([1, 1, 0], 2)
True
>>> check([1, 1, 0], 3)
False
The key point here is that an iterator remembers the position it was last so each any call will start where the last one ended. And wrapping it in all makes sure it exits early as soon as one any is False.
At least performance-wise this should be faster than most other approaches. However at the cost of readability.
If you want to have it even faster than a solution based on map and itertools.repeat can be slightly faster:
from itertools import repeat
def check_map(it, num):
return all(map(any, repeat(iter(it), num)))
Benchmarked against the other answers:
# Second "True" element is in the last place
lst = [1] + [0]*1000 + [1]
%timeit check_map(lst, 2) # 10000 loops, best of 3: 20.3 µs per loop
%timeit check(lst, 2) # 10000 loops, best of 3: 23.5 µs per loop
%timeit many(lst, 2) # 10000 loops, best of 3: 153 µs per loop
%timeit sum(l) >= 2 # 100000 loops, best of 3: 19.6 µs per loop
# Second "True" element is the second item in the iterable
lst = [1, 1] + [0]*1000
%timeit check_map(lst, 2) # 100000 loops, best of 3: 3.05 µs per loop
%timeit check(lst, 2) # 100000 loops, best of 3: 6.39 µs per loop
%timeit many(lst, 2) # 100000 loops, best of 3: 5.02 µs per loop
%timeit sum(lst) >= 2 # 10000 loops, best of 3: 19.5 µs per loop
L = [True, False, False, True]
This does only the needed iterations:
def many(iterable, n):
if n < 1:
return True
counter = 0
for x in iterable:
if x:
counter += 1
if counter == n:
return True
return False
Now:
>>> many(L, 2)
True
Use sum:
sum(l) >= 2
# True
Presumably any goes through the iterable until it finds a element that is True, and then stops.
Your solution scans all of the elements to see if there are at least 2. Instead, it should stop scanning as soon as it finds a second True element.

what is the most efficient way to find the position of the first np.nan value?

consider the array a
a = np.array([3, 3, np.nan, 3, 3, np.nan])
I could do
np.isnan(a).argmax()
But this requires finding all np.nan just to find the first.
Is there a more efficient way?
I've been trying to figure out if I can pass a parameter to np.argpartition such that np.nan get's sorted first as opposed to last.
EDIT regarding [dup].
There are several reasons this question is different.
That question and answers addressed equality of values. This is in regards to isnan.
Those answers all suffer from the same issue my answer faces. Note, I provided a perfectly valid answer but highlighted it's inefficiency. I'm looking to fix the inefficiency.
EDIT regarding second [dup].
Still addressing equality and question/answers are old and very possibly outdated.
It might also be worth to look into numba.jit; without it, the vectorized version will likely beat a straight-forward pure-Python search in most scenarios, but after compiling the code, the ordinary search will take the lead, at least in my testing:
In [63]: a = np.array([np.nan if i % 10000 == 9999 else 3 for i in range(100000)])
In [70]: %paste
import numba
def naive(a):
for i in range(len(a)):
if np.isnan(a[i]):
return i
def short(a):
return np.isnan(a).argmax()
#numba.jit
def naive_jit(a):
for i in range(len(a)):
if np.isnan(a[i]):
return i
#numba.jit
def short_jit(a):
return np.isnan(a).argmax()
## -- End pasted text --
In [71]: %timeit naive(a)
100 loops, best of 3: 7.22 ms per loop
In [72]: %timeit short(a)
The slowest run took 4.59 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 37.7 µs per loop
In [73]: %timeit naive_jit(a)
The slowest run took 6821.16 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 6.79 µs per loop
In [74]: %timeit short_jit(a)
The slowest run took 395.51 times longer than the fastest. This could mean that an intermediate result is being cached.
10000 loops, best of 3: 144 µs per loop
Edit: As pointed out by #hpaulj in their answer, numpy actually ships with an optimized short-circuited search whose performance is comparable with the JITted search above:
In [26]: %paste
def plain(a):
return a.argmax()
#numba.jit
def plain_jit(a):
return a.argmax()
## -- End pasted text --
In [35]: %timeit naive(a)
100 loops, best of 3: 7.13 ms per loop
In [36]: %timeit plain(a)
The slowest run took 4.37 times longer than the fastest. This could mean that an intermediate result is being cached.
100000 loops, best of 3: 7.04 µs per loop
In [37]: %timeit naive_jit(a)
100000 loops, best of 3: 6.91 µs per loop
In [38]: %timeit plain_jit(a)
10000 loops, best of 3: 125 µs per loop
I'll nominate
a.argmax()
With #fuglede's test array:
In [1]: a = np.array([np.nan if i % 10000 == 9999 else 3 for i in range(100000)])
In [2]: np.isnan(a).argmax()
Out[2]: 9999
In [3]: np.argmax(a)
Out[3]: 9999
In [4]: a.argmax()
Out[4]: 9999
In [5]: timeit a.argmax()
The slowest run took 29.94 ....
10000 loops, best of 3: 20.3 µs per loop
In [6]: timeit np.isnan(a).argmax()
The slowest run took 7.82 ...
1000 loops, best of 3: 462 µs per loop
I don't have numba installed, so can compare that. But my speedup relative to short is greater than #fuglede's 6x.
I'm testing in Py3, which accepts <np.nan, while Py2 raises a runtime warning. But the code search suggests this isn't dependent on that comparison.
/numpy/core/src/multiarray/calculation.c PyArray_ArgMax plays with axes (moving the one of interest to the end), and delegates the action to arg_func = PyArray_DESCR(ap)->f->argmax, a function that depends on the dtype.
In numpy/core/src/multiarray/arraytypes.c.src it looks like BOOL_argmax short circuits, returning as soon as it encounters a True.
for (; i < n; i++) {
if (ip[i]) {
*max_ind = i;
return 0;
}
}
And #fname#_argmax also short circuits on maximal nan. np.nan is 'maximal' in argmin as well.
#if #isfloat#
if (#isnan#(mp)) {
/* nan encountered; it's maximal */
return 0;
}
#endif
Comments from experienced c coders are welcomed, but it appears to me that at least for np.nan, a plain argmax will be as fast you we can get.
Playing with the 9999 in generating a shows that the a.argmax time depends on that value, consistent with short circuiting.
Here is a pythonic approach using itertools.takewhile():
from itertools import takewhile
sum(1 for _ in takewhile(np.isfinite, a))
Benchmark with generator_expression_within_next approach: 1
In [118]: a = np.repeat(a, 10000)
In [120]: %timeit next(i for i, j in enumerate(a) if np.isnan(j))
100 loops, best of 3: 12.4 ms per loop
In [121]: %timeit sum(1 for _ in takewhile(np.isfinite, a))
100 loops, best of 3: 11.5 ms per loop
But still (by far) slower than numpy approach:
In [119]: %timeit np.isnan(a).argmax()
100000 loops, best of 3: 16.8 µs per loop
1. The problem with this approach is using enumerate function. Which returns an enumerate object from the numpy array first (which is an iterator like object) and calling the generator function and next attribute of the iterator will take time.
When looking for the first match in various scenarios, we could iterate through and look for the first match and exit out on the first match rather than going/processing the entire array. So, we would have an approach using Python's next function , like so -
next((i for i, val in enumerate(a) if np.isnan(val)))
Sample runs -
In [192]: a = np.array([3, 3, np.nan, 3, 3, np.nan])
In [193]: next((i for i, val in enumerate(a) if np.isnan(val)))
Out[193]: 2
In [194]: a[2] = 10
In [195]: next((i for i, val in enumerate(a) if np.isnan(val)))
Out[195]: 5

Fastest way to numerically process 2d-array: dataframe vs series vs array vs numba

Edit to add: I don't think the numba benchmarks are fair, notes below
I'm trying to benchmark different approaches to numerically processing data for the following use case:
Fairly big dataset (100,000+ records)
100+ lines of fairly simple code (z = x + y)
Don't need to sort or index
In other words, the full generality of series and dataframes is not needed, although they are included here b/c they are still convenient ways to encapsulate the data and there is often pre- or post-processing that does require the generality of pandas over numpy arrays.
Question: Based on this use case, are the following benchmarks appropriate and if not, how can I improve them?
# importing pandas, numpy, Series, DataFrame in standard way
from numba import jit
nobs = 10000
nlines = 100
def proc_df():
df = DataFrame({ 'x': np.random.randn(nobs),
'y': np.random.randn(nobs) })
for i in range(nlines):
df['z'] = df.x + df.y
return df.z
def proc_ser():
x = Series(np.random.randn(nobs))
y = Series(np.random.randn(nobs))
for i in range(nlines):
z = x + y
return z
def proc_arr():
x = np.random.randn(nobs)
y = np.random.randn(nobs)
for i in range(nlines):
z = x + y
return z
#jit
def proc_numba():
xx = np.random.randn(nobs)
yy = np.random.randn(nobs)
zz = np.zeros(nobs)
for j in range(nobs):
x, y = xx[j], yy[j]
for i in range(nlines):
z = x + y
zz[j] = z
return zz
Results (Win 7, 3 year old Xeon workstation (quad-core). Standard and recent anaconda distribution or very close.)
In [1251]: %timeit proc_df()
10 loops, best of 3: 46.6 ms per loop
In [1252]: %timeit proc_ser()
100 loops, best of 3: 15.8 ms per loop
In [1253]: %timeit proc_arr()
100 loops, best of 3: 2.02 ms per loop
In [1254]: %timeit proc_numba()
1000 loops, best of 3: 1.04 ms per loop # may not be valid result (see note below)
Edit to add (response to jeff) alternate results from passing df/series/array into functions rather than creating them inside of functions (i.e. move the code lines containing 'randn' from inside function to outside function):
10 loops, best of 3: 45.1 ms per loop
100 loops, best of 3: 15.1 ms per loop
1000 loops, best of 3: 1.07 ms per loop
100000 loops, best of 3: 17.9 µs per loop # may not be valid result (see note below)
Note on numba results: I think the numba compiler must be optimizing on the for loop and reducing the for loop to a single iteration. I don't know that but it's the only explanation I can come up as it couldn't be 50x faster than numpy, right? Followup question here: Why is numba faster than numpy here?
Well, you are not really timing the same things here (or rather, you are timing different aspects).
E.g.
In [6]: x = Series(np.random.randn(nobs))
In [7]: y = Series(np.random.randn(nobs))
In [8]: %timeit x + y
10000 loops, best of 3: 131 µs per loop
In [9]: %timeit Series(np.random.randn(nobs)) + Series(np.random.randn(nobs))
1000 loops, best of 3: 1.33 ms per loop
So [8] times the actual operation, while [9] includes the overhead for the series creation (and the random number generation) PLUS the actual operation
Another example is proc_ser vs proc_df. The proc_df includes the overhead of assignement of a particular column in the DataFrame (which is actually different for an initial creation and subsequent reassignement).
So create the structure (you can time that too, but that is a separate issue). Perform the exact same operation and time them.
Further you say that you don't need alignment. Pandas gives you this by default (and no really easy way to turn it off, though its just a simple check if they are already aligned). While in numba you need to 'manually' align them.
Following up on #Jeff answer. The code can be further optimized.
nobs = 10000
x = pd.Series(np.random.randn(nobs))
y = pd.Series(np.random.randn(nobs))
%timeit proc_ser()
%timeit x + y
%timeit x.values + y.values
100 loops, best of 3: 11.8 ms per loop
10000 loops, best of 3: 107 µs per loop
100000 loops, best of 3: 12.3 µs per loop

Efficient iteration over slice in Python

How efficient are iterations over slice operations in Python? And if a copy is inevitable with slices, is there an alternative?
I know that a slice operation over a list is O(k), where k is the size of the slice.
x[5 : 5+k] # O(k) copy operation
However, when iterating over a part of a list, I find that the cleanest (and most Pythonic?) way to do this (without having to resort to indices) is to do:
for elem in x[5 : 5+k]:
print elem
However my intuition is that this still results in an expensive copy of the sublist, rather than simply iterating over the existing list.
Use:
for elem in x[5 : 5+k]:
It's Pythonic! Don't change this until you've profiled your code and determined that this is a bottleneck -- though I doubt you will ever find this to be the main source of a bottleneck.
In terms of speed it will probably be your best choice:
In [30]: x = range(100)
In [31]: k = 90
In [32]: %timeit x[5:5+k]
1000000 loops, best of 3: 357 ns per loop
In [35]: %timeit list(IT.islice(x, 5, 5+k))
100000 loops, best of 3: 2.42 us per loop
In [36]: %timeit [x[i] for i in xrange(5, 5+k)]
100000 loops, best of 3: 5.71 us per loop
In terms of memory, it is not as bad you might think. x[5: 5+k] is a shallow copy of part of x. So even if the objects in x are large, x[5: 5+k] is creating a new list with k elements which reference the same objects as in x. So you only need extra memory to create a list with k references to pre-existing objects. That probably is not going to be the source of any memory problems.
You can use itertools.islice to get a sliced iterator from the list:
Example:
>>> from itertools import islice
>>> lis = range(20)
>>> for x in islice(lis, 10, None, 1):
... print x
...
10
11
12
13
14
15
16
17
18
19
Update:
As noted by #user2357112 the performance of islice depends on the start point of slice and the size of the iterable, normal slice is going to be fast in almost all cases and should be preferred. Here are some more timing comparisons:
For Huge lists islice is slightly faster or equal to normal slice when the slice's start point is less than half the size of list, for bigger indexes normal slice is the clear winner.
>>> def func(lis, n):
it = iter(lis)
for x in islice(it, n, None, 1):pass
...
>>> def func1(lis, n):
#it = iter(lis)
for x in islice(lis, n, None, 1):pass
...
>>> def func2(lis, n):
for x in lis[n:]:pass
...
>>> lis = range(10**6)
>>> n = 100
>>> %timeit func(lis, n)
10 loops, best of 3: 62.1 ms per loop
>>> %timeit func1(lis, n)
1 loops, best of 3: 60.8 ms per loop
>>> %timeit func2(lis, n)
1 loops, best of 3: 82.8 ms per loop
>>> n = 1000
>>> %timeit func(lis, n)
10 loops, best of 3: 64.4 ms per loop
>>> %timeit func1(lis, n)
1 loops, best of 3: 60.3 ms per loop
>>> %timeit func2(lis, n)
1 loops, best of 3: 85.8 ms per loop
>>> n = 10**4
>>> %timeit func(lis, n)
10 loops, best of 3: 61.4 ms per loop
>>> %timeit func1(lis, n)
10 loops, best of 3: 61 ms per loop
>>> %timeit func2(lis, n)
1 loops, best of 3: 80.8 ms per loop
>>> n = (10**6)/2
>>> %timeit func(lis, n)
10 loops, best of 3: 39.2 ms per loop
>>> %timeit func1(lis, n)
10 loops, best of 3: 39.6 ms per loop
>>> %timeit func2(lis, n)
10 loops, best of 3: 41.5 ms per loop
>>> n = (10**6)-1000
>>> %timeit func(lis, n)
100 loops, best of 3: 18.9 ms per loop
>>> %timeit func1(lis, n)
100 loops, best of 3: 18.8 ms per loop
>>> %timeit func2(lis, n)
10000 loops, best of 3: 50.9 us per loop #clear winner for large index
>>> %timeit func1(lis, n)
For Small lists normal slice is faster than islice for almost all cases.
>>> lis = range(1000)
>>> n = 100
>>> %timeit func(lis, n)
10000 loops, best of 3: 60.7 us per loop
>>> %timeit func1(lis, n)
10000 loops, best of 3: 59.6 us per loop
>>> %timeit func2(lis, n)
10000 loops, best of 3: 59.9 us per loop
>>> n = 500
>>> %timeit func(lis, n)
10000 loops, best of 3: 38.4 us per loop
>>> %timeit func1(lis, n)
10000 loops, best of 3: 33.9 us per loop
>>> %timeit func2(lis, n)
10000 loops, best of 3: 26.6 us per loop
>>> n = 900
>>> %timeit func(lis, n)
10000 loops, best of 3: 20.1 us per loop
>>> %timeit func1(lis, n)
10000 loops, best of 3: 17.2 us per loop
>>> %timeit func2(lis, n)
10000 loops, best of 3: 11.3 us per loop
Conclusion:
Go for normal slices.
Just traverse the desired indexes, there's no need to create a new slice for this:
for i in xrange(5, 5+k):
print x[i]
Granted: it looks unpythonic, but it's more efficient than creating a new slice in the sense that no extra memory is wasted. An alternative would be to use an iterator, as demonstrated in #AshwiniChaudhary's answer.
You're already doing an O(n) iteration over the slice. In most cases, this will be much more of a concern than the actual creation of the slice, which happens entirely in optimized C. Looping over a slice once you've made it takes over twice as long as making the slice, even if you don't do anything with it:
>>> timeit.timeit('l[50:100]', 'import collections; l=range(150)')
0.46978958638010226
>>> timeit.timeit('for x in slice: pass',
'import collections; l=range(150); slice=l[50:100]')
1.2332711270150867
You might try iterating over the indices with xrange, but accounting for the time needed to retrieve the list element, it's slower than slicing. Even if you skip that part, it still doesn't beat slicing:
>>> timeit.timeit('for i in xrange(50, 100): x = l[i]', 'l = range(150)')
4.3081963062022055
>>> timeit.timeit('for i in xrange(50, 100): pass', 'l = range(150)')
1.675838213385532
Do not use itertools.islice for this! It will loop through your list from the start rather than skipping to the values you want with __getitem__. Here's some timing data showing how its performance depends on where the slice starts:
>>> timeit.timeit('next(itertools.islice(l, 9, None))', 'import itertools; l = r
ange(1000000)')
0.5628290558478852
>>> timeit.timeit('next(itertools.islice(l, 999, None))', 'import itertools; l =
range(1000000)')
6.885294697594759
Here's islice losing to regular slicing:
>>> timeit.timeit('for i in itertools.islice(l, 900, None): pass', 'import itert
ools; l = range(1000)')
8.979957560911316
>>> timeit.timeit('for i in l[900:]: pass', 'import itertools; l = range(1000)')
3.0318417204211983
This is on Python 2.7.5, in case any later versions add list-specific optimizations.
I think a better way is using a c-like iteration if the 'k' is so large (as a large 'k' -like 10000000000000- even could make you wait about 10 hours to get answer in a pythonic for loop)
here is what I try to tell you do:
i = 5 ## which is the initial value
f = 5 + k ## which will be the final index
while i < f:
print(x[i])
i += 1
I assume this one could do it just in 5 hours (as the pythonic equivalent for loop do it for about 10 hours) for that large k that I told because it need to go from 5 to 10000000000005 just one time!
every use of 'range()' of 'xrange()' or even the slicing itself (as yo mentioned above) make the program just do 20000000000000 iterations that could lead to a longer execution time i think. (as I learn using a generator method will return an iterable object that need to run the generator completely first to be made and it takes twice time to do job; One for the generator itself and another for the 'for' loop)
Edited:
In python 3 the generator method/object does not need to run first to make the iterable object for the for loop
The accepted answer does not offer an efficient solution. It is in order of n, where n is a slice starting point. If n is large, it is going to be a problem.
The followup conclusion ("Go for normal slices") is not ideal either, because it uses extra space in order of k for copying.
Python provides a very elegant and efficient solution to the slicing problem called generator expression that works as optimal as you can get: O(1) space and O(k) running time:
(l[i] for i in range(n,n+k))
It is similar to list comprehension, except that it works lazily and you can combine it with other iterator tools like itertools module or filters. Note the round parenthesis enclosing the expression.
For iterating through subarray (for creating subarray see unutbu's answer) slicing is slightly faster than indexing in worst case (l[1:]).
10 items
========
Slicing: 2.570001e-06 s
Indexing: 3.269997e-06 s
100 items
=========
Slicing: 6.820001e-06 s
Indexing: 1.220000e-05 s
1000 items
==========
Slicing: 7.647000e-05 s
Indexing: 1.482100e-04 s
10000 items
===========
Slicing: 2.876200e-04 s
Indexing: 5.270000e-04 s
100000 items
============
Slicing: 3.763300e-03 s
Indexing: 7.731050e-03 s
1000000 items
=============
Slicing: 2.963523e-02 s
Indexing: 4.921381e-02 s
Benchmark code:
def f_slice(l):
for v in l[1:]:
_x = v
def f_index(l):
for i in range(1, len(l)):
_x = l[i]
from time import perf_counter
def func_time(func, l):
start = perf_counter()
func(l)
return perf_counter()-start
def bench(num_item):
l = list(range(num_item))
times = 10
t_index = t_slice = 0
for _ in range(times):
t_slice += func_time(f_slice, l)
t_index += func_time(f_index, l)
print(f"Slicing: {t_slice/times:e} s")
print(f"Indexing: {t_index/times:e} s")
for i in range(1, 7):
s = f"{10**i} items"
print(s)
print('='*len(s))
bench(10**i)
print()

Fast check for NaN in NumPy

I'm looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.
I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)
Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:
In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop
In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
Unlike min, sum doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum's throughput seems constant regardless of whether there are NaNs and where they're located:
In [40]: x = np.random.rand(100000)
In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop
In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In [43]: x[50000] = np.nan
In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop
In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop
In [46]: x[0] = np.nan
In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop
In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
I think np.isnan(np.min(X)) should do what you want.
There are two general approaches here:
Check each array item for nan and take any.
Apply some cumulative operation that preserves nans (like sum) and check its result.
While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.
import numpy as np
import perfplot
def min(a):
return np.isnan(np.min(a))
def sum(a):
return np.isnan(np.sum(a))
def dot(a):
return np.isnan(np.dot(a, a))
def any(a):
return np.any(np.isnan(a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
b = perfplot.bench(
setup=np.random.rand,
kernels=[min, sum, dot, any, einsum],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
)
b.save("out.png")
b.show()
Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):
In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop
In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop
Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.
If you're comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:
import numba as nb
import math
#nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False
If there is no NaN the function might actually be slower than np.min, I think that's because np.min uses multiprocessing for large arrays:
import numpy as np
array = np.random.random(2000000)
%timeit anynan(array) # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.64 ms per loop
But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:
array = np.random.random(2000000)
array[100] = np.nan
%timeit anynan(array) # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.65 ms per loop
Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.
use .any()
if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking
if not np.isfinite(prop).all()
Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:
index = next((i for (i,n) in enumerate(iterable) if n!=n), None)
Adding to #nico-schlömer and #mseifert 's answers, I computed the performance of a numba-test has_nan with early stops, compared to some of the functions that will parse the full array.
On my machine, for an array without nans, the break-even happens for ~10^4 elements.
import perfplot
import numpy as np
import numba
import math
def min(a):
return np.isnan(np.min(a))
def dot(a):
return np.isnan(np.dot(a, a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
#numba.njit
def has_nan(a):
for i in range(a.size - 1):
if math.isnan(a[i]):
return True
return False
def array_with_missing_values(n, p):
""" Return array of size n, p : nans ( % of array length )
Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
a = np.random.rand(n)
p = np.random.randint(0, len(a), int(p*len(a)/100))
a[p] = np.nan
return a
#%%
perfplot.show(
setup=lambda n: array_with_missing_values(n, 0),
kernels=[min, dot, has_nan],
n_range=[2 ** k for k in range(20)],
logx=True,
logy=True,
xlabel="len(a)",
)
What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.
For arrays of length 1,000,000, has_nan becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.
#%%
N = 1000000 # 100000
perfplot.show(
setup=lambda p: array_with_missing_values(N, p),
kernels=[min, dot, has_nan],
n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01,
logy=True,
xlabel=f"% of nan in array (N = {N})",
)
If in your application most arrays have nan and you're looking for ones without, then has_nan is the best approach.
Else; dot seems to be the best option.

Categories