Related
I would like to apply a sort operation, row per row, only keeping values above a given threshold.
For this, I see I can use a masked array to apply the threshold.
However, argsort keeps considering masked values (below the threshold) and replace them with a fill_value.
However, I simply don't want any result if the value has been replaced with a NaN.
a = np.array([[0.522235,0.128270,0.708973],
[0.994557,0.844426,0.366608],
[0.986669,0.143659,0.395891],
[0.291339,0.421843,0.278869],
[0.250303,0.861475,0.904534],
[0.973436,0.360466,0.751913]])
threshold = 0.5
m_a = np.ma.masked_less_equal(a, threshold)
argsorted = m_a.argsort(-1)
This gives me:
array([[0, 2, 1],
[1, 0, 2],
[0, 1, 2],
[0, 1, 2],
[1, 2, 0],
[2, 0, 1]])
But I would like to get:
array([[0, NaN, 1],
[1, 0, NaN],
[0, NaN, NaN],
[NaN, NaN, NaN],
[NaN, 0, 1],
[ 1, NaN, 0]])
Any idea to get to this result?
Thanks for your help!
Bests,
We can add one more argsort for an easier way to get to our desired output -
sidx = argsorted.argsort(1)
mask = sidx >= (a.shape[1]-m_a.mask.sum(1,keepdims=True))
out = np.where(mask,np.nan,sidx)
We can also start from scratch to avoid masked-arrays -
def thresholded_argsort(a, threshold):
m = a<threshold
ac = a.copy()
ac[m] = ac.max()+1
sidx = ac.argsort(1).argsort(1)
mask = sidx>=(ac.shape[1]-m.sum(1,keepdims=True))
return np.where(mask,np.nan,sidx)
Sample run -
In [46]: a
Out[46]:
array([[0.522235, 0.12827 , 0.708973],
[0.994557, 0.844426, 0.366608],
[0.986669, 0.143659, 0.395891],
[0.291339, 0.421843, 0.278869],
[0.250303, 0.861475, 0.904534],
[0.973436, 0.360466, 0.751913]])
In [47]: thresholded_argsort(a, threshold=0.5)
Out[47]:
array([[ 0., nan, 1.],
[ 1., 0., nan],
[ 0., nan, nan],
[nan, nan, nan],
[nan, 0., 1.],
[ 1., nan, 0.]])
Note : We can avoid the additional argsort with array-assignment for performance using argsort_unique. So, for 2D arrays along second axis, it would be -
def argsort_unique2D(idx):
m,n = idx.shape
idx_out = np.empty((m,n),dtype=int)
np.put_along_axis(idx_out, idx, np.arange(n), axis=1)
return idx_out
So, argsorted.argsort(1) could be replaced by argsort_unique2D(argsorted), while ac.argsort(1).argsort(1) with argsort_unique2D(ac.argsort(1)) in the earlier posted solutions.
If I understand correctly you dont want to consider NaN for for the sorting. In that case, I am not sure about the logic behind your expected result. You can try the following code. I believe this is what you are looking for:-
import numpy as np
a = np.array([[0.522235,0.128270,0.708973],
[0.994557,0.844426,0.366608],
[0.986669,0.143659,0.395891],
[0.291339,0.421843,0.278869],
[0.250303,0.861475,0.904534],
[0.973436,0.360466,0.751913]])
threshold = 0.5
m_a = np.ma.masked_less_equal(a, threshold).filled(np.nan)
result = np.where(
np.isnan(m_a),
np.nan, m_a.argsort(-1)
)
result
It should give you the following result :-
array([[ 0., nan, 1.],
[ 1., 0., nan],
[ 0., nan, nan],
[nan, nan, nan],
[nan, 2., 0.],
[ 2., nan, 1.]])
Hope this helps!!
a = np.array([[0.522235,0.128270,0.708973],
[0.994557,0.844426,0.366608],
[0.986669,0.143659,0.395891],
[0.291339,0.421843,0.278869],
[0.250303,0.861475,0.904534],
[0.973436,0.360466,0.751913]])
threshold = .5
def tri(ligne):
s = sorted(ligne, key=lambda x: x < threshold and float('inf') or x)
nv_liste = [s.index(v) for v in ligne]
for i in range(len(ligne)):
if ligne[i] < threshold:
nv_liste[i] = np.nan
return nv_liste
np.apply_along_axis(tri, 1, a)
gives you:
array([[ 0., nan, 1.],
[ 1., 0., nan],
[ 0., nan, nan],
[nan, nan, nan],
[nan, 0., 1.],
[ 1., nan, 0.]])
I have an array:
a = array([[1,2,3,1], [2,5,3,1], [0,0,0,0], [5,3,2,5]])
I want to iterate through the array, based off the last item in each row (0 or 1). If the last item in each row is 0, I want to change the 3 other items (only) in the row to np.nan.
For example:
a = array([[1,2,3,1], [2,5,3,1], [nan, nan, nan, 0], [5,3,2,5]])
I can do this using a for loop. i.e.:
for frames in range(len(a)):
if a[frames][3] == 0:
a[frames][0:2] = np.nan
Is there a more efficient way to do this using list comprehension? So far this is all I've come up with, but feel that it could be far more efficient:
a = np.array([[np.nan, np.nan, np.nan] if frames[3] == 0 else frames[0:3] for frames in a])
As this will create an array of arrays, as well as crop the last column
Thanks in advance!
IIUC, you don't need a list comprehension. Use indexing
>>> a = np.array([[1,2,3,1], [2,5,3,1], [0,0,0,0], [5,3,2,5]], dtype=float)
>>> a[a[:,-1] == 0, 0:3] = np.nan
array([[ 1., 2., 3., 1.],
[ 2., 5., 3., 1.],
[ nan, nan, nan, 0.],
[ 5., 3., 2., 5.]])
If you have a dict of these arrays, just index each of them
for a in data.values():
a[a[:,-1] == 0, 0:3] = np.nan
For example, if I have the 2D array as follows.
[[1,2,3,NAN],
[4,5,NAN,NAN],
[6,NAN,NAN,NAN]
]
The desired result is
[[1,2,3],
[4,5],
[6]
]
How should I transform?
I find using
x = x[~numpy.isnan(x)] can only generate [1,2,3,4,5,6], which has been squeezed into one dimensional array.
Thanks!
Just apply that isnan on a row by row basis
In [135]: [row[~np.isnan(row)] for row in arr]
Out[135]: [array([1., 2., 3.]), array([4., 5.]), array([6.])]
Boolean masking as in x[~numpy.isnan(x)] produces a flattened result because, in general, the result will be ragged like this, and can't be formed into a 2d array.
The source array must be float dtype - because np.nan is a float:
In [138]: arr = np.array([[1,2,3,np.nan],[4,5,np.nan,np.nan],[6,np.nan,np.nan,np.nan]])
In [139]: arr
Out[139]:
array([[ 1., 2., 3., nan],
[ 4., 5., nan, nan],
[ 6., nan, nan, nan]])
If object dtype, the numbers can be integer, but np.isnan(arr) won't work.
If the original is a list, rather than an array:
In [146]: alist = [[1,2,3,np.nan],[4,5,np.nan,np.nan],[6,np.nan,np.nan,np.nan]]
In [147]: alist
Out[147]: [[1, 2, 3, nan], [4, 5, nan, nan], [6, nan, nan, nan]]
In [148]: [[i for i in row if ~np.isnan(i)] for row in alist]
Out[148]: [[1, 2, 3], [4, 5], [6]]
The flat array could be turned into a list of arrays with split:
In [152]: np.split(arr[~np.isnan(arr)],(3,5))
Out[152]: [array([1., 2., 3.]), array([4., 5.]), array([6.])]
where the (3,5) split parameter could be determined by counting the non-nan in each row, but that's more work and doesn't promise to be faster than than the row iteration.
I have a large numpy 1d array which contains nans. I need to know all the slices that do not contain any nans:
import numpy as np
A=np.array([1.0,2.0,3.0,np.nan,4.0,3.0,np.nan,np.nan,np.nan,2.0,2.0,2.0])
The expected result for the example would be:
Slices=[slice(0,3),slice(4,6),slice(9,12)]
One approach to get such a list of slices with the idea of performing minimum work in a list comprehension -
def start_stop_nonNaN_slices(A):
mask = ~np.isnan(A)
mask_ext = np.r_[False, mask, False]
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1]).reshape(-1,2)
return [slice(i[0],i[1]) for i in idx]
Sample runs -
In [32]: A
Out[32]:
array([ 1., 2., 3., nan, 4., 3., nan, nan, nan, 2., 2.,
2.])
In [33]: start_stop_nonNaN_slices(A)
Out[33]: [slice(0, 3, None), slice(4, 6, None), slice(9, 12, None)]
In [35]: A
Out[35]:
array([ nan, 1., 2., 3., nan, 4., 3., nan, nan, nan, 2.,
2., 2.])
In [36]: start_stop_nonNaN_slices(A)
Out[36]: [slice(1, 4, None), slice(5, 7, None), slice(10, 13, None)]
Output in different formats
I. If you need those start, stop indices as pairs of tuples -
def start_stop_nonNaN_slices_v2(A):
mask = ~np.isnan(A)
mask_ext = np.r_[False, mask, False]
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
return zip(idx[::2], idx[1::2])
Sample run -
In [51]: A
Out[51]:
array([ nan, 1., 2., 3., nan, 4., 3., nan, nan, nan, 2.,
2., 2., nan, nan])
In [52]: start_stop_nonNaN_slices_v2(A)
Out[52]: [(1, 4), (5, 7), (10, 13)]
II. If you are okay with start and stop indices as two output arrays and this should be pretty efficient as we are avoiding any list-comprehension or zipping -
def start_stop_nonNaN_slices_v3(A):
mask = ~np.isnan(A)
mask_ext = np.r_[False, mask, False]
idx = np.flatnonzero(mask_ext[1:] != mask_ext[:-1])
return idx[::2], idx[1::2]
Sample run -
In [74]: A
Out[74]:
array([ nan, 1., 2., 3., nan, 4., 3., nan, nan, nan, 2.,
2., 2., nan, nan])
In [75]: starts, stops = start_stop_nonNaN_slices_v3(A)
In [76]: starts
Out[76]: array([ 1, 5, 10])
In [77]: stops
Out[77]: array([ 4, 7, 13])
Note on performance : For performance, we could use np.concatenate to replace np.r_ :
mask_ext = np.concatenate(( [False], mask, [False] ))
Here is a possibility:
import numpy as np
def valid_slices(array):
m = ~np.isnan(array)
idx = np.arange(len(array))[m]
idx_diff = np.diff(idx)
idx_change = np.where(idx_diff > 1)[0]
idx_start = np.concatenate([[0], idx_change + 1], axis=0)
idx_end = np.concatenate([idx_change, [len(idx) - 1]], axis=0)
return [slice(idx[start], idx[end] + 1) for start, end in zip(idx_start, idx_end)]
A = np.array([1.0,2.0,3.0,np.nan,4.0,3.0,np.nan,np.nan,np.nan,2.0,2.0,2.0])
print(valid_slices(A))
>>> [slice(0, 3, None), slice(4, 6, None), slice(9, 12, None)]
I have a number of time series, each containing measurements across weeks of the year, but not all of them start and end on the same weeks. I know the offsets, that is I know in what weeks each one starts and ends. Now I would like to combine them into a matrix respecting the inherent offsets, such that all values will align with the correct week numbers.
If the horizontal direction contains the series and vertical direction represents the weeks, given two series a and b, where values correspond to week numbers:
a = np.array([[1,2,3,4,5,6]])
b = np.array([[0,1,2,3,4,5]])
I want to know if is it possible to combine them, e.g. using some method that takes an offset argument in a fashion like combine((a, b), axis=0, offset=-1), such that the resulting array (lets call it c) looks like this:
print c
[[NaN 1 2 3 4 5 6 ]
[0 1 2 3 4 5 NaN]]
What more is, since the time series are enormous, I must stream them through my program, and therefore cannot know all offsets at the same time. I thought of using Pandas because it has nice indexing, but I felt there had to be a simpler way, since the essence of what I'm trying to do is super simple.
Update:
This seems to work
def offset_stack(a, b, offset=0):
if offset < 0:
a = np.insert(a, [0] * abs(offset), np.nan)
b = np.append(b, [np.nan] * abs(offset))
if offset > 0:
a = np.append(a, [np.nan] * abs(offset))
b = np.insert(b, [0] * abs(offset), np.nan)
return np.concatenate(([a],[b]), axis=0)
You can do in numpy:
def f(a, b, n):
v = np.empty(abs(n))*np.nan
if np.sign(n)==-1:
return np.vstack((np.append(a,v), np.append(v,b)))
elif np.sign(n)==1:
return np.vstack((np.append(v,a), np.append(b,v)))
else:
return np.vstack((a,b))
#In [148]: a = np.array([23, 13, 4, 12, 4, 4])
#In [149]: b = np.array([4, 12, 3, 41, 45, 6])
#In [150]: f(a,b,-2)
#Out[150]:
#array([[ 23., 13., 4., 12., 4., 4., nan, nan],
# [ nan, nan, 4., 12., 3., 41., 45., 6.]])
#In [151]: f(a,b,2)
#Out[151]:
#array([[ nan, nan, 23., 13., 4., 12., 4., 4.],
# [ 4., 12., 3., 41., 45., 6., nan, nan]])
#In [152]: f(a,b,0)
#Out[152]:
#array([[23, 13, 4, 12, 4, 4],
# [ 4, 12, 3, 41, 45, 6]])
There is a real simple way to accomplish this.
You basically want to pad and then stack your arrays and for both there are numpy functions:
numpy.lib.pad() aka offset
a = np.array([[1,2,3,4,5,6]], dtype=np.float_) # float because NaN is a float value!
b = np.array([[0,1,2,3,4,5]], dtype=np.float_)
from numpy.lib import pad
print(pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan))
# [[ nan 1. 2. 3. 4. 5. 6.]]
print(pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan))
# [[ 0., 1., 2., 3., 4., 5., nan]]
The ((0,0)(1,0)) means just no padding in the first axis (top/bottom) and only pad one element left and no element on the right. So you have to tweak these if you want more/less shift.
numpy.vstack() aka stack along axis=0
import numpy as np
a_padded = pad(a, ((0,0),(1,0)), mode='constant', constant_values=np.nan)
b_padded = pad(b, ((0,0),(0,1)), mode='constant', constant_values=np.nan)
np.vstack([a_padded, b_padded])
# array([[ nan, 1., 2., 3., 4., 5., 6.],
# [ 0., 1., 2., 3., 4., 5., nan]])
Your function:
Combining these two would be very easy and is easy to extend:
from numpy.lib import pad
import numpy as np
def offset_stack(a, b, axis=0, offsets=(0, 1)):
if (len(offsets) != a.ndim) or (a.ndim != b.ndim):
raise ValueError('Offsets and dimensions of the arrays do not match.')
offset1 = [(0, -offset) if offset < 0 else (offset, 0) for offset in offsets]
offset2 = [(-offset, 0) if offset < 0 else (0, offset) for offset in offsets]
a_padded = pad(a, offset1, mode='constant', constant_values=np.nan)
b_padded = pad(b, offset2, mode='constant', constant_values=np.nan)
return np.concatenate([a_padded, b_padded], axis=axis)
offset_stack(a, b)
This function works for generalized offsets in arbitary dimensions and can stack in arbitary dimensions. It doesn't work in the same way as the original since you pad the second dimension just passing in offset=1 would pad in the first dimension. But if you keep track of the dimensions of your arrays it should work fine.
For example:
offset_stack(a, b, offsets=(1,2))
array([[ nan, nan, nan, nan, nan, nan, nan, nan],
[ nan, nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan, nan],
[ nan, nan, nan, nan, nan, nan, nan, nan]])
or for 3d arrays:
a = np.array([1,2,3], dtype=np.float_)[None, :, None] # makes it 3d
b = np.array([0,1,2], dtype=np.float_)[None, :, None] # makes it 3d
offset_stack(a, b, offsets=(0,1,0), axis=2)
array([[[ nan, 0.],
[ 1., 1.],
[ 2., 2.],
[ 3., nan]]])
pad and concatenate (and the various stack and inserts) create a target array of the right size, and fill values from the input arrays. So we can do the same, and potentially do it faster.
Just for example using your 2 arrays and the 1 step offset:
In [283]: a = np.array([[1,2,3,4,5,6]])
In [284]: b = np.array([[0,1,2,3,4,5]])
create the target array, and fill it with the pad value. np.nan is a float (even though a is int):
In [285]: m=a.shape[0]+b.shape[0]
In [286]: n=a.shape[1]+1
In [287]: c=np.zeros((m,n),float)
In [288]: c.fill(np.nan)
Now just copy values into the right places on the target. More arrays and offsets will require some generalization here.
In [289]: c[:a.shape[0],1:]=a
In [290]: c[-b.shape[0]:,:-1]=b
In [291]: c
Out[291]:
array([[ nan, 1., 2., 3., 4., 5., 6.],
[ 0., 1., 2., 3., 4., 5., nan]])