Related
I have two numpy arrays NS, EW to sum up. Each of them has missing values at different positions, like
NS = array([[ 1., 2., nan],
[ 4., 5., nan],
[ 6., nan, nan]])
EW = array([[ 1., 2., nan],
[ 4., nan, nan],
[ 6., nan, 9.]]
How can I perform a summation operation in the numpy way, which will treat nan as zero if one array has nan at a location, and keep nan if both arrays has nan at the same location.
The result I expect to see is
SUM = array([[ 2., 4., nan],
[ 8., 5., nan],
[ 12., nan, 9.]])
When I try
SUM=np.add(NS,EW)
it gives me
SUM=array([[ 2., 4., nan],
[ 8., nan, nan],
[ 12., nan, nan]])
When I try
SUM = np.nansum(np.dstack((NS,EW)),2)
it gives me
SUM=array([[ 2., 4., 0.],
[ 8., 5., 0.],
[ 12., 0., 9.]])
Of course, I can realize my goal by doing element-level operation,
for i in range(np.size(NS,0)):
for j in range(np.size(NS,1)):
if np.isnan(NS[i,j]) and np.isnan(EW[i,j]):
SUM[i,j] = np.nan
elif np.isnan(NS[i,j]):
SUM[i,j] = EW[i,j]
elif np.isnan(EW[i,j]):
SUM[i,j] = NS[i,j]
else:
SUM[i,j] = NS[i,j]+EW[i,j]
but it is very slow. So I'm looking for a more numpy solution to solve this problem.
Thanks for help in advance!
Approach #1 : One approach with np.where -
def sum_nan_arrays(a,b):
ma = np.isnan(a)
mb = np.isnan(b)
return np.where(ma&mb, np.nan, np.where(ma,0,a) + np.where(mb,0,b))
Sample run -
In [43]: NS
Out[43]:
array([[ 1., 2., nan],
[ 4., 5., nan],
[ 6., nan, nan]])
In [44]: EW
Out[44]:
array([[ 1., 2., nan],
[ 4., nan, nan],
[ 6., nan, 9.]])
In [45]: sum_nan_arrays(NS, EW)
Out[45]:
array([[ 2., 4., nan],
[ 8., 5., nan],
[ 12., nan, 9.]])
Approach #2 : Probably a faster one with a mix of boolean-indexing -
def sum_nan_arrays_v2(a,b):
ma = np.isnan(a)
mb = np.isnan(b)
m_keep_a = ~ma & mb
m_keep_b = ma & ~mb
out = a + b
out[m_keep_a] = a[m_keep_a]
out[m_keep_b] = b[m_keep_b]
return out
Runtime test -
In [140]: # Setup input arrays with 4/9 ratio of NaNs (same as in the question)
...: a = np.random.rand(3000,3000)
...: b = np.random.rand(3000,3000)
...: a.ravel()[np.random.choice(range(a.size), size=4000000, replace=0)] = np.nan
...: b.ravel()[np.random.choice(range(b.size), size=4000000, replace=0)] = np.nan
...:
In [141]: np.nanmax(np.abs(sum_nan_arrays(a, b) - sum_nan_arrays_v2(a, b))) # Verify
Out[141]: 0.0
In [142]: %timeit sum_nan_arrays(a, b)
10 loops, best of 3: 141 ms per loop
In [143]: %timeit sum_nan_arrays_v2(a, b)
10 loops, best of 3: 177 ms per loop
In [144]: # Setup input arrays with lesser NaNs
...: a = np.random.rand(3000,3000)
...: b = np.random.rand(3000,3000)
...: a.ravel()[np.random.choice(range(a.size), size=4000, replace=0)] = np.nan
...: b.ravel()[np.random.choice(range(b.size), size=4000, replace=0)] = np.nan
...:
In [145]: np.nanmax(np.abs(sum_nan_arrays(a, b) - sum_nan_arrays_v2(a, b))) # Verify
Out[145]: 0.0
In [146]: %timeit sum_nan_arrays(a, b)
10 loops, best of 3: 69.6 ms per loop
In [147]: %timeit sum_nan_arrays_v2(a, b)
10 loops, best of 3: 38 ms per loop
Actually your nansum approach almost worked, you just need to add in the nans again:
def add_ignore_nans(a, b):
stacked = np.array([a, b])
res = np.nansum(stacked, axis=0)
res[np.all(np.isnan(stacked), axis=0)] = np.nan
return res
>>> add_ignore_nans(a, b)
array([[ 2., 4., nan],
[ 8., 5., nan],
[ 12., nan, 9.]])
This will be slower than #Divakars answer but I wanted to mention that you were pretty close already! :-)
I think we can get a bit more concise, in the same vein as Divakar's second approach. With a = NS and b = EW:
na = numpy.isnan(a)
nb = numpy.isnan(b)
a[na] = 0
b[nb] = 0
a += b
na &= nb
a[na] = numpy.nan
The operations are done in-place where possible to save memory, assuming this is is feasible in your scenario. The final result is in a.
I would like to apply a sort operation, row per row, only keeping values above a given threshold.
For this, I see I can use a masked array to apply the threshold.
However, argsort keeps considering masked values (below the threshold) and replace them with a fill_value.
However, I simply don't want any result if the value has been replaced with a NaN.
a = np.array([[0.522235,0.128270,0.708973],
[0.994557,0.844426,0.366608],
[0.986669,0.143659,0.395891],
[0.291339,0.421843,0.278869],
[0.250303,0.861475,0.904534],
[0.973436,0.360466,0.751913]])
threshold = 0.5
m_a = np.ma.masked_less_equal(a, threshold)
argsorted = m_a.argsort(-1)
This gives me:
array([[0, 2, 1],
[1, 0, 2],
[0, 1, 2],
[0, 1, 2],
[1, 2, 0],
[2, 0, 1]])
But I would like to get:
array([[0, NaN, 1],
[1, 0, NaN],
[0, NaN, NaN],
[NaN, NaN, NaN],
[NaN, 0, 1],
[ 1, NaN, 0]])
Any idea to get to this result?
Thanks for your help!
Bests,
We can add one more argsort for an easier way to get to our desired output -
sidx = argsorted.argsort(1)
mask = sidx >= (a.shape[1]-m_a.mask.sum(1,keepdims=True))
out = np.where(mask,np.nan,sidx)
We can also start from scratch to avoid masked-arrays -
def thresholded_argsort(a, threshold):
m = a<threshold
ac = a.copy()
ac[m] = ac.max()+1
sidx = ac.argsort(1).argsort(1)
mask = sidx>=(ac.shape[1]-m.sum(1,keepdims=True))
return np.where(mask,np.nan,sidx)
Sample run -
In [46]: a
Out[46]:
array([[0.522235, 0.12827 , 0.708973],
[0.994557, 0.844426, 0.366608],
[0.986669, 0.143659, 0.395891],
[0.291339, 0.421843, 0.278869],
[0.250303, 0.861475, 0.904534],
[0.973436, 0.360466, 0.751913]])
In [47]: thresholded_argsort(a, threshold=0.5)
Out[47]:
array([[ 0., nan, 1.],
[ 1., 0., nan],
[ 0., nan, nan],
[nan, nan, nan],
[nan, 0., 1.],
[ 1., nan, 0.]])
Note : We can avoid the additional argsort with array-assignment for performance using argsort_unique. So, for 2D arrays along second axis, it would be -
def argsort_unique2D(idx):
m,n = idx.shape
idx_out = np.empty((m,n),dtype=int)
np.put_along_axis(idx_out, idx, np.arange(n), axis=1)
return idx_out
So, argsorted.argsort(1) could be replaced by argsort_unique2D(argsorted), while ac.argsort(1).argsort(1) with argsort_unique2D(ac.argsort(1)) in the earlier posted solutions.
If I understand correctly you dont want to consider NaN for for the sorting. In that case, I am not sure about the logic behind your expected result. You can try the following code. I believe this is what you are looking for:-
import numpy as np
a = np.array([[0.522235,0.128270,0.708973],
[0.994557,0.844426,0.366608],
[0.986669,0.143659,0.395891],
[0.291339,0.421843,0.278869],
[0.250303,0.861475,0.904534],
[0.973436,0.360466,0.751913]])
threshold = 0.5
m_a = np.ma.masked_less_equal(a, threshold).filled(np.nan)
result = np.where(
np.isnan(m_a),
np.nan, m_a.argsort(-1)
)
result
It should give you the following result :-
array([[ 0., nan, 1.],
[ 1., 0., nan],
[ 0., nan, nan],
[nan, nan, nan],
[nan, 2., 0.],
[ 2., nan, 1.]])
Hope this helps!!
a = np.array([[0.522235,0.128270,0.708973],
[0.994557,0.844426,0.366608],
[0.986669,0.143659,0.395891],
[0.291339,0.421843,0.278869],
[0.250303,0.861475,0.904534],
[0.973436,0.360466,0.751913]])
threshold = .5
def tri(ligne):
s = sorted(ligne, key=lambda x: x < threshold and float('inf') or x)
nv_liste = [s.index(v) for v in ligne]
for i in range(len(ligne)):
if ligne[i] < threshold:
nv_liste[i] = np.nan
return nv_liste
np.apply_along_axis(tri, 1, a)
gives you:
array([[ 0., nan, 1.],
[ 1., 0., nan],
[ 0., nan, nan],
[nan, nan, nan],
[nan, 0., 1.],
[ 1., nan, 0.]])
I have a large numpy 1d array which contains nans. I need to know all the slices that do not contain any nans:
import numpy as np
A=np.array([1.0,2.0,3.0,np.nan,4.0,3.0,np.nan,np.nan,np.nan,2.0,2.0,2.0])
The expected result for the example would be:
Slices=[slice(0,3),slice(4,6),slice(9,12)]
One approach to get such a list of slices with the idea of performing minimum work in a list comprehension -
def start_stop_nonNaN_slices(A):
mask = ~np.isnan(A)
mask_ext = np.r_[False, mask, False]
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1]).reshape(-1,2)
return [slice(i[0],i[1]) for i in idx]
Sample runs -
In [32]: A
Out[32]:
array([ 1., 2., 3., nan, 4., 3., nan, nan, nan, 2., 2.,
2.])
In [33]: start_stop_nonNaN_slices(A)
Out[33]: [slice(0, 3, None), slice(4, 6, None), slice(9, 12, None)]
In [35]: A
Out[35]:
array([ nan, 1., 2., 3., nan, 4., 3., nan, nan, nan, 2.,
2., 2.])
In [36]: start_stop_nonNaN_slices(A)
Out[36]: [slice(1, 4, None), slice(5, 7, None), slice(10, 13, None)]
Output in different formats
I. If you need those start, stop indices as pairs of tuples -
def start_stop_nonNaN_slices_v2(A):
mask = ~np.isnan(A)
mask_ext = np.r_[False, mask, False]
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
return zip(idx[::2], idx[1::2])
Sample run -
In [51]: A
Out[51]:
array([ nan, 1., 2., 3., nan, 4., 3., nan, nan, nan, 2.,
2., 2., nan, nan])
In [52]: start_stop_nonNaN_slices_v2(A)
Out[52]: [(1, 4), (5, 7), (10, 13)]
II. If you are okay with start and stop indices as two output arrays and this should be pretty efficient as we are avoiding any list-comprehension or zipping -
def start_stop_nonNaN_slices_v3(A):
mask = ~np.isnan(A)
mask_ext = np.r_[False, mask, False]
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
return idx[::2], idx[1::2]
Sample run -
In [74]: A
Out[74]:
array([ nan, 1., 2., 3., nan, 4., 3., nan, nan, nan, 2.,
2., 2., nan, nan])
In [75]: starts, stops = start_stop_nonNaN_slices_v3(A)
In [76]: starts
Out[76]: array([ 1, 5, 10])
In [77]: stops
Out[77]: array([ 4, 7, 13])
Note on performance : For performance, we could use np.concatenate to replace np.r_ :
mask_ext = np.concatenate(( [False], mask, [False] ))
Here is a possibility:
import numpy as np
def valid_slices(array):
m = ~np.isnan(array)
idx = np.arange(len(array))[m]
idx_diff = np.diff(idx)
idx_change = np.where(idx_diff > 1)[0]
idx_start = np.concatenate([[0], idx_change + 1], axis=0)
idx_end = np.concatenate([idx_change, [len(idx) - 1]], axis=0)
return [slice(idx[start], idx[end] + 1) for start, end in zip(idx_start, idx_end)]
A = np.array([1.0,2.0,3.0,np.nan,4.0,3.0,np.nan,np.nan,np.nan,2.0,2.0,2.0])
print(valid_slices(A))
>>> [slice(0, 3, None), slice(4, 6, None), slice(9, 12, None)]
aa = np.array([2.0, np.NaN])
aa[aa>1.0] = np.NaN
On running the code above, I get the foll. warning, I understand the reason for this warning, but how to avoid it?
RuntimeWarning: invalid value encountered in greater
Store the indices of the valid ones (non - NaNs). First off, we will use these indices to index into the array and perform the comparison to get a mask and then again index into those indices with that mask to retrieve back the indices corresponding to original order. Using the original-ordered indices, we could then assign elements in the input array to NaNs.
Thus, an implementation/solution would be -
idx = np.flatnonzero(~np.isnan(aa))
aa[idx[aa[idx] > 1.0]] = np.nan
Sample run -
In [106]: aa # Input array with NaNs
Out[106]: array([ 0., 3., nan, 0., 9., 6., 6., nan, 18., 6.])
In [107]: idx = np.flatnonzero(~np.isnan(aa)) # Store valid indices
In [108]: idx
Out[108]: array([0, 1, 3, 4, 5, 6, 8, 9])
In [109]: aa[idx[aa[idx] > 1.0]] = np.nan # Do the assignment
In [110]: aa # Verify
Out[110]: array([ 0., nan, nan, 0., nan, nan, nan, nan, nan, nan])
I have a number of time series, each containing measurements across weeks of the year, but not all of them start and end on the same weeks. I know the offsets, that is I know in what weeks each one starts and ends. Now I would like to combine them into a matrix respecting the inherent offsets, such that all values will align with the correct week numbers.
If the horizontal direction contains the series and vertical direction represents the weeks, given two series a and b, where values correspond to week numbers:
a = np.array([[1,2,3,4,5,6]])
b = np.array([[0,1,2,3,4,5]])
I want to know if is it possible to combine them, e.g. using some method that takes an offset argument in a fashion like combine((a, b), axis=0, offset=-1), such that the resulting array (lets call it c) looks like this:
print c
[[NaN 1 2 3 4 5 6 ]
[0 1 2 3 4 5 NaN]]
What more is, since the time series are enormous, I must stream them through my program, and therefore cannot know all offsets at the same time. I thought of using Pandas because it has nice indexing, but I felt there had to be a simpler way, since the essence of what I'm trying to do is super simple.
Update:
This seems to work
def offset_stack(a, b, offset=0):
if offset < 0:
a = np.insert(a, [0] * abs(offset), np.nan)
b = np.append(b, [np.nan] * abs(offset))
if offset > 0:
a = np.append(a, [np.nan] * abs(offset))
b = np.insert(b, [0] * abs(offset), np.nan)
return np.concatenate(([a],[b]), axis=0)
You can do in numpy:
def f(a, b, n):
v = np.empty(abs(n))*np.nan
if np.sign(n)==-1:
return np.vstack((np.append(a,v), np.append(v,b)))
elif np.sign(n)==1:
return np.vstack((np.append(v,a), np.append(b,v)))
else:
return np.vstack((a,b))
#In [148]: a = np.array([23, 13, 4, 12, 4, 4])
#In [149]: b = np.array([4, 12, 3, 41, 45, 6])
#In [150]: f(a,b,-2)
#Out[150]:
#array([[ 23., 13., 4., 12., 4., 4., nan, nan],
# [ nan, nan, 4., 12., 3., 41., 45., 6.]])
#In [151]: f(a,b,2)
#Out[151]:
#array([[ nan, nan, 23., 13., 4., 12., 4., 4.],
# [ 4., 12., 3., 41., 45., 6., nan, nan]])
#In [152]: f(a,b,0)
#Out[152]:
#array([[23, 13, 4, 12, 4, 4],
# [ 4, 12, 3, 41, 45, 6]])
There is a real simple way to accomplish this.
You basically want to pad and then stack your arrays and for both there are numpy functions:
numpy.lib.pad() aka offset
a = np.array([[1,2,3,4,5,6]], dtype=np.float_) # float because NaN is a float value!
b = np.array([[0,1,2,3,4,5]], dtype=np.float_)
from numpy.lib import pad
print(pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan))
# [[ nan 1. 2. 3. 4. 5. 6.]]
print(pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan))
# [[ 0., 1., 2., 3., 4., 5., nan]]
The ((0,0)(1,0)) means just no padding in the first axis (top/bottom) and only pad one element left and no element on the right. So you have to tweak these if you want more/less shift.
numpy.vstack() aka stack along axis=0
import numpy as np
a_padded = pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan)
b_padded = pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan)
np.vstack([a_padded, b_padded])
# array([[ nan, 1., 2., 3., 4., 5., 6.],
# [ 0., 1., 2., 3., 4., 5., nan]])
Your function:
Combining these two would be very easy and is easy to extend:
from numpy.lib import pad
import numpy as np
def offset_stack(a, b, axis=0, offsets=(0, 1)):
if (len(offsets) != a.ndim) or (a.ndim != b.ndim):
raise ValueError('Offsets and dimensions of the arrays do not match.')
offset1 = [(0, -offset) if offset < 0 else (offset, 0) for offset in offsets]
offset2 = [(-offset, 0) if offset < 0 else (0, offset) for offset in offsets]
a_padded = pad(a, offset1, mode='constant', constant_values=np.nan)
b_padded = pad(b, offset2, mode='constant', constant_values=np.nan)
return np.concatenate([a_padded, b_padded], axis=axis)
offset_stack(a, b)
This function works for generalized offsets in arbitary dimensions and can stack in arbitary dimensions. It doesn't work in the same way as the original since you pad the second dimension just passing in offset=1 would pad in the first dimension. But if you keep track of the dimensions of your arrays it should work fine.
For example:
offset_stack(a, b, offsets=(1,2))
array([[ nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan, nan],
[ nan, nan, nan, nan, nan, nan, nan, nan]])
or for 3d arrays:
a = np.array([1,2,3], dtype=np.float_)[None, :, None] # makes it 3d
b = np.array([0,1,2], dtype=np.float_)[None, :, None] # makes it 3d
offset_stack(a, b, offsets=(0,1,0), axis=2)
array([[[ nan, 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., nan]]])
pad and concatenate (and the various stack and inserts) create a target array of the right size, and fill values from the input arrays. So we can do the same, and potentially do it faster.
Just for example using your 2 arrays and the 1 step offset:
In [283]: a = np.array([[1,2,3,4,5,6]])
In [284]: b = np.array([[0,1,2,3,4,5]])
create the target array, and fill it with the pad value. np.nan is a float (even though a is int):
In [285]: m=a.shape[0]+b.shape[0]
In [286]: n=a.shape[1]+1
In [287]: c=np.zeros((m,n),float)
In [288]: c.fill(np.nan)
Now just copy values into the right places on the target. More arrays and offsets will require some generalization here.
In [289]: c[:a.shape[0],1:]=a
In [290]: c[-b.shape[0]:,:-1]=b
In [291]: c
Out[291]:
array([[ nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan]])