Numpy does not see equal floating-point numbers [duplicate] - python

I need to check if a given float is close, within a given tolerance, to any float in an array of floats.
import numpy as np
# My float
a = 0.27
# The tolerance
t = 0.01
# Array of floats
arr_f = np.arange(0.05, 0.75, 0.008)
Is there a simple way to do this? Something like if a in arr_f: but allowing for some tolerance in the difference?
Add
By "allow tolerance" I mean it in the following sense:
for i in arr_f:
if abs(a - i) <= t:
print 'float a is in arr_f within tolerance t'
break

How about using np.isclose?
>>> np.isclose(arr_f, a, atol=0.01).any()
True
np.isclose compares two objects element-wise to see if the values are within a given tolerance (here specified by the keyword argument atol which is the absolute difference between two elements). The function returns a boolean array.

If you're only interested in a True/False result, then this should work:
In [1]: (abs(arr_f - a) < t).any()
Out[1]: True
Explanation: abs(arr_f - a) < t returns a boolean array on which any() is invoked in order to find out whether any of its values is True.
EDIT - Comparing this approach and the one suggested in the other answer reveals that this one is slightly faster:
In [37]: arr_f = np.arange(0.05, 0.75, 0.008)
In [38]: timeit (abs(arr_f - a) < t).any()
100000 loops, best of 3: 11.5 µs per loop
In [39]: timeit np.isclose(arr_f, a, atol=t).any()
10000 loops, best of 3: 44.7 µs per loop
In [40]: arr_f = np.arange(0.05, 1000000, 0.008)
In [41]: timeit (abs(arr_f - a) < t).any()
1 loops, best of 3: 646 ms per loop
In [42]: timeit np.isclose(arr_f, a, atol=t).any()
1 loops, best of 3: 802 ms per loop
An alternative solution that also returns the relevant indices is as follows:
In [5]: np.where(abs(arr_f - a) < t)[0]
Out[5]: array([27, 28])
This means that the values residing in indices 27 and 28 of arr_f are within the desired range, and indeed:
In [9]: arr_f[27]
Out[9]: 0.26600000000000001
In [10]: arr_f[28]
Out[10]: 0.27400000000000002
Using this approach can also generate a True/False result:
In [11]: np.where(abs(arr_f - a) < t)[0].any()
Out[11]: True

[temp] = np.where(np.int32((sliceArray - aimFloat) > 0) == 1)
temp[0] is answer.
sliceArray is sorted!

Related

Numpy searchsorted along many dimensions? [duplicate]

Assume that I have two arrays A and B, where both A and B are m x n. My goal is now, for each row of A and B, to find where I should insert the elements of row i of A in the corresponding row of B. That is, I wish to apply np.digitize or np.searchsorted to each row of A and B.
My naive solution is to simply iterate over the rows. However, this is far too slow for my application. My question is therefore: is there a vectorized implementation of either algorithm that I haven't managed to find?
We can add each row some offset as compared to the previous row. We would use the same offset for both arrays. The idea is to use np.searchsorted on flattened version of input arrays thereafter and thus each row from b would be restricted to find sorted positions in the corresponding row in a. Additionally, to make it work for negative numbers too, we just need to offset for the minimum numbers as well.
So, we would have a vectorized implementation like so -
def searchsorted2d(a,b):
m,n = a.shape
max_num = np.maximum(a.max() - a.min(), b.max() - b.min()) + 1
r = max_num*np.arange(a.shape[0])[:,None]
p = np.searchsorted( (a+r).ravel(), (b+r).ravel() ).reshape(m,-1)
return p - n*(np.arange(m)[:,None])
Runtime test -
In [173]: def searchsorted2d_loopy(a,b):
...: out = np.zeros(a.shape,dtype=int)
...: for i in range(len(a)):
...: out[i] = np.searchsorted(a[i],b[i])
...: return out
...:
In [174]: # Setup input arrays
...: a = np.random.randint(11,99,(10000,20))
...: b = np.random.randint(11,99,(10000,20))
...: a = np.sort(a,1)
...: b = np.sort(b,1)
...:
In [175]: np.allclose(searchsorted2d(a,b),searchsorted2d_loopy(a,b))
Out[175]: True
In [176]: %timeit searchsorted2d_loopy(a,b)
10 loops, best of 3: 28.6 ms per loop
In [177]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 13.7 ms per loop
The solution provided by #Divakar is ideal for integer data, but beware of precision issues for floating point values, especially if they span multiple orders of magnitude (e.g. [[1.0, 2,0, 3.0, 1.0e+20],...]). In some cases r may be so large that applying a+r and b+r wipes out the original values you're trying to run searchsorted on, and you're just comparing r to r.
To make the approach more robust for floating-point data, you could embed the row information into the arrays as part of the values (as a structured dtype), and run searchsorted on these structured dtypes instead.
def searchsorted_2d (a, v, side='left', sorter=None):
import numpy as np
# Make sure a and v are numpy arrays.
a = np.asarray(a)
v = np.asarray(v)
# Augment a with row id
ai = np.empty(a.shape,dtype=[('row',int),('value',a.dtype)])
ai['row'] = np.arange(a.shape[0]).reshape(-1,1)
ai['value'] = a
# Augment v with row id
vi = np.empty(v.shape,dtype=[('row',int),('value',v.dtype)])
vi['row'] = np.arange(v.shape[0]).reshape(-1,1)
vi['value'] = v
# Perform searchsorted on augmented array.
# The row information is embedded in the values, so only the equivalent rows
# between a and v are considered.
result = np.searchsorted(ai.flatten(),vi.flatten(), side=side, sorter=sorter)
# Restore the original shape, decode the searchsorted indices so they apply to the original data.
result = result.reshape(vi.shape) - vi['row']*a.shape[1]
return result
Edit: The timing on this approach is abysmal!
In [21]: %timeit searchsorted_2d(a,b)
10 loops, best of 3: 92.5 ms per loop
You would be better off just just using map over the array:
In [22]: %timeit np.array(list(map(np.searchsorted,a,b)))
100 loops, best of 3: 13.8 ms per loop
For integer data, #Divakar's approach is still the fastest:
In [23]: %timeit searchsorted2d(a,b)
100 loops, best of 3: 7.26 ms per loop

Average of numpy array ignoring specified value

I have a number of 1-dimensional numpy ndarrays containing the path length between a given node and all other nodes in a network for which I would like to calculate the average. The matter is complicated though by the fact that if no path exists between two nodes the algorithm returns a value of 2147483647 for that given connection. If I leave this value untreated it would obviously grossly inflate my average as a typical path length would be somewhere between 1 and 3 in my network.
One option of dealing with this would be to loop through all elements of all arrays and replace 2147483647 with NaN and then use numpy.nanmean to find the average though that is probably not the most efficient method of going about it. Is there a way of calculating the average with numpy just ignoring all values of 2147483647?
I should add that, I could have up to several million arrays with several million values to average over so any performance gain in how the average is found will make a real difference.
Why not using your usual numpy filtering for this?
m = my_array[my_array != 2147483647].mean()
By the way, if you really want speed, your whole algorithm description seems certainly naive and could be improved by a lot.
Oh and I guess that you are calculating the mean because you have rigorously checked that the underlying distribution is normal so that it means something, aren't you?
np.nanmean(np.where(my_array == 2147483647, np.nan, my_array))
Timings
a = np.random.randn(100000)
a[::10] = 2147483647
%timeit np.nanmean(np.where(a == 2147483647, np.nan, a))
1000 loops, best of 3: 639 µs per loop
%timeit a[a != 2147483647].mean()
1000 loops, best of 3: 259 µs per loop
import pandas as pd
%timeit pd.Series(a).ne(2147483647).mean()
1000 loops, best of 3: 493 µs per loop
One way would be to get the sum for all elements in one go and then removing the contribution from the invalid ones. Finally, we need to get the average value itself, divide by the number of valid elements. So, we would have an implementation like so -
def mean_ignore_num(arr,num):
# Get count of invalid ones
invc = np.count_nonzero(arr==num)
# Get the average value for all numbers and remove contribution from num
return (arr.sum() - invc*num)/float(arr.size-invc)
Verify results -
In [191]: arr = np.full(10,2147483647).astype(np.int32)
...: arr[1] = 5
...: arr[4] = 4
...:
In [192]: arr.max()
Out[192]: 2147483647
In [193]: arr.sum() # Extends beyond int32 max limit, so no overflow
Out[193]: 17179869185
In [194]: arr[arr != 2147483647].mean()
Out[194]: 4.5
In [195]: mean_ignore_num(arr,2147483647)
Out[195]: 4.5
Runtime test -
In [38]: arr = np.random.randint(0,9,(10000))
In [39]: arr[arr != 7].mean()
Out[39]: 3.6704609489462414
In [40]: mean_ignore_num(arr,7)
Out[40]: 3.6704609489462414
In [41]: %timeit arr[arr != 7].mean()
10000 loops, best of 3: 102 µs per loop
In [42]: %timeit mean_ignore_num(arr,7)
10000 loops, best of 3: 36.6 µs per loop

Collection comparison is reflexive, yet does not short circuit. Why?

In python, the built in collections compare elements with the explicit assumption that they are reflexive:
In enforcing reflexivity of elements, the comparison of collections assumes that for a collection element x, x == x is always true. Based on that assumption, element identity is compared first, and element comparison is performed only for distinct elements.
Logically, this means that for any list L, L == L must be True. Given this, why doesn't the implementation check for identity to short circuit the evaluation?
In [1]: x = list(range(10000000))
In [2]: y = list(range(int(len(x)) // 10))
In [3]: z = [1]
# evaluation time likes O(N)
In [4]: %timeit x == x
10 loops, best of 3: 21.8 ms per loop
In [5]: %timeit y == y
100 loops, best of 3: 2.2 ms per loop
In [6]: %timeit z == z
10000000 loops, best of 3: 36.4 ns per loop
Clearly, child classes could choose to make an identity check, and clearly an identity check would add a very small overhead to every such comparison.
Was a historical decision explicitly made not to make such a check in the built in sequences to avoid this expense?
While I'm not privy to the developers' thinking, my guess is that they might have felt comparing L == L does not happen often enough to warrant a special check, and moreover, the user can always use (L is L) or (L==L) to build a
short-circuiting check himself if he deems that advantageous.
In [128]: %timeit (x is x) or (x == x)
10000000 loops, best of 3: 36.1 ns per loop
In [129]: %timeit (y is y) or (y == y)
10000000 loops, best of 3: 34.8 ns per loop

Check if float is close to any float stored in array

I need to check if a given float is close, within a given tolerance, to any float in an array of floats.
import numpy as np
# My float
a = 0.27
# The tolerance
t = 0.01
# Array of floats
arr_f = np.arange(0.05, 0.75, 0.008)
Is there a simple way to do this? Something like if a in arr_f: but allowing for some tolerance in the difference?
Add
By "allow tolerance" I mean it in the following sense:
for i in arr_f:
if abs(a - i) <= t:
print 'float a is in arr_f within tolerance t'
break
How about using np.isclose?
>>> np.isclose(arr_f, a, atol=0.01).any()
True
np.isclose compares two objects element-wise to see if the values are within a given tolerance (here specified by the keyword argument atol which is the absolute difference between two elements). The function returns a boolean array.
If you're only interested in a True/False result, then this should work:
In [1]: (abs(arr_f - a) < t).any()
Out[1]: True
Explanation: abs(arr_f - a) < t returns a boolean array on which any() is invoked in order to find out whether any of its values is True.
EDIT - Comparing this approach and the one suggested in the other answer reveals that this one is slightly faster:
In [37]: arr_f = np.arange(0.05, 0.75, 0.008)
In [38]: timeit (abs(arr_f - a) < t).any()
100000 loops, best of 3: 11.5 µs per loop
In [39]: timeit np.isclose(arr_f, a, atol=t).any()
10000 loops, best of 3: 44.7 µs per loop
In [40]: arr_f = np.arange(0.05, 1000000, 0.008)
In [41]: timeit (abs(arr_f - a) < t).any()
1 loops, best of 3: 646 ms per loop
In [42]: timeit np.isclose(arr_f, a, atol=t).any()
1 loops, best of 3: 802 ms per loop
An alternative solution that also returns the relevant indices is as follows:
In [5]: np.where(abs(arr_f - a) < t)[0]
Out[5]: array([27, 28])
This means that the values residing in indices 27 and 28 of arr_f are within the desired range, and indeed:
In [9]: arr_f[27]
Out[9]: 0.26600000000000001
In [10]: arr_f[28]
Out[10]: 0.27400000000000002
Using this approach can also generate a True/False result:
In [11]: np.where(abs(arr_f - a) < t)[0].any()
Out[11]: True
[temp] = np.where(np.int32((sliceArray - aimFloat) > 0) == 1)
temp[0] is answer.
sliceArray is sorted!

Fast check for NaN in NumPy

I'm looking for the fastest way to check for the occurrence of NaN (np.nan) in a NumPy array X. np.isnan(X) is out of the question, since it builds a boolean array of shape X.shape, which is potentially gigantic.
I tried np.nan in X, but that seems not to work because np.nan != np.nan. Is there a fast and memory-efficient way to do this at all?
(To those who would ask "how gigantic": I can't tell. This is input validation for library code.)
Ray's solution is good. However, on my machine it is about 2.5x faster to use numpy.sum in place of numpy.min:
In [13]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 244 us per loop
In [14]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 97.3 us per loop
Unlike min, sum doesn't require branching, which on modern hardware tends to be pretty expensive. This is probably the reason why sum is faster.
edit The above test was performed with a single NaN right in the middle of the array.
It is interesting to note that min is slower in the presence of NaNs than in their absence. It also seems to get slower as NaNs get closer to the start of the array. On the other hand, sum's throughput seems constant regardless of whether there are NaNs and where they're located:
In [40]: x = np.random.rand(100000)
In [41]: %timeit np.isnan(np.min(x))
10000 loops, best of 3: 153 us per loop
In [42]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
In [43]: x[50000] = np.nan
In [44]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 239 us per loop
In [45]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.8 us per loop
In [46]: x[0] = np.nan
In [47]: %timeit np.isnan(np.min(x))
1000 loops, best of 3: 326 us per loop
In [48]: %timeit np.isnan(np.sum(x))
10000 loops, best of 3: 95.9 us per loop
I think np.isnan(np.min(X)) should do what you want.
There are two general approaches here:
Check each array item for nan and take any.
Apply some cumulative operation that preserves nans (like sum) and check its result.
While the first approach is certainly the cleanest, the heavy optimization of some of the cumulative operations (particularly the ones that are executed in BLAS, like dot) can make those quite fast. Note that dot, like some other BLAS operations, are multithreaded under certain conditions. This explains the difference in speed between different machines.
import numpy as np
import perfplot
def min(a):
return np.isnan(np.min(a))
def sum(a):
return np.isnan(np.sum(a))
def dot(a):
return np.isnan(np.dot(a, a))
def any(a):
return np.any(np.isnan(a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
b = perfplot.bench(
setup=np.random.rand,
kernels=[min, sum, dot, any, einsum],
n_range=[2 ** k for k in range(25)],
xlabel="len(a)",
)
b.save("out.png")
b.show()
Even there exist an accepted answer, I'll like to demonstrate the following (with Python 2.7.2 and Numpy 1.6.0 on Vista):
In []: x= rand(1e5)
In []: %timeit isnan(x.min())
10000 loops, best of 3: 200 us per loop
In []: %timeit isnan(x.sum())
10000 loops, best of 3: 169 us per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 134 us per loop
In []: x[5e4]= NaN
In []: %timeit isnan(x.min())
100 loops, best of 3: 4.47 ms per loop
In []: %timeit isnan(x.sum())
100 loops, best of 3: 6.44 ms per loop
In []: %timeit isnan(dot(x, x))
10000 loops, best of 3: 138 us per loop
Thus, the really efficient way might be heavily dependent on the operating system. Anyway dot(.) based seems to be the most stable one.
If you're comfortable with numba it allows to create a fast short-circuit (stops as soon as a NaN is found) function:
import numba as nb
import math
#nb.njit
def anynan(array):
array = array.ravel()
for i in range(array.size):
if math.isnan(array[i]):
return True
return False
If there is no NaN the function might actually be slower than np.min, I think that's because np.min uses multiprocessing for large arrays:
import numpy as np
array = np.random.random(2000000)
%timeit anynan(array) # 100 loops, best of 3: 2.21 ms per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.45 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.64 ms per loop
But in case there is a NaN in the array, especially if it's position is at low indices, then it's much faster:
array = np.random.random(2000000)
array[100] = np.nan
%timeit anynan(array) # 1000000 loops, best of 3: 1.93 µs per loop
%timeit np.isnan(array.sum()) # 100 loops, best of 3: 4.57 ms per loop
%timeit np.isnan(array.min()) # 1000 loops, best of 3: 1.65 ms per loop
Similar results may be achieved with Cython or a C extension, these are a bit more complicated (or easily avaiable as bottleneck.anynan) but ultimatly do the same as my anynan function.
use .any()
if numpy.isnan(myarray).any()
numpy.isfinite maybe better than isnan for checking
if not np.isfinite(prop).all()
Related to this is the question of how to find the first occurrence of NaN. This is the fastest way to handle that that I know of:
index = next((i for (i,n) in enumerate(iterable) if n!=n), None)
Adding to #nico-schlömer and #mseifert 's answers, I computed the performance of a numba-test has_nan with early stops, compared to some of the functions that will parse the full array.
On my machine, for an array without nans, the break-even happens for ~10^4 elements.
import perfplot
import numpy as np
import numba
import math
def min(a):
return np.isnan(np.min(a))
def dot(a):
return np.isnan(np.dot(a, a))
def einsum(a):
return np.isnan(np.einsum("i->", a))
#numba.njit
def has_nan(a):
for i in range(a.size - 1):
if math.isnan(a[i]):
return True
return False
def array_with_missing_values(n, p):
""" Return array of size n, p : nans ( % of array length )
Ex : n=1e6, p=1 : 1e4 nan assigned at random positions """
a = np.random.rand(n)
p = np.random.randint(0, len(a), int(p*len(a)/100))
a[p] = np.nan
return a
#%%
perfplot.show(
setup=lambda n: array_with_missing_values(n, 0),
kernels=[min, dot, has_nan],
n_range=[2 ** k for k in range(20)],
logx=True,
logy=True,
xlabel="len(a)",
)
What happens if the array has nans ? I investigated the impact of the nan-coverage of the array.
For arrays of length 1,000,000, has_nan becomes a better option is there are ~10^-3 % nans (so ~10 nans) in the array.
#%%
N = 1000000 # 100000
perfplot.show(
setup=lambda p: array_with_missing_values(N, p),
kernels=[min, dot, has_nan],
n_range=np.array([2 ** k for k in range(20)]) / 2**20 * 0.01,
logy=True,
xlabel=f"% of nan in array (N = {N})",
)
If in your application most arrays have nan and you're looking for ones without, then has_nan is the best approach.
Else; dot seems to be the best option.

Categories