Related
I am currently struggling with a really simple problem, but cannot seem to solve it. You can reproduce the issue with the following file and code:
test.csv
2020081217,28.6
2020081218,24.7
2020081219,-999.0
2020081220,-999.0
2020081221,-999.0
code
data = np.genfromtxt("C:/Users/col/Downloads/test.csv", delimiter=',', missing_values=["-999", "-999.0", -999, -999.0])
print(data)
output
[[ 2.02008122e+09 2.86000000e+01]
[ 2.02008122e+09 2.47000000e+01]
[ 2.02008122e+09 -9.99000000e+02]
[ 2.02008122e+09 -9.99000000e+02]
[ 2.02008122e+09 -9.99000000e+02]]
Why does none of the versions for missing_values catch the -999 in the file and replace them with NaNs or something alike? I feel like this should be simple (and probably already answered somewhere on this website), but I cannot figure it out... Thanks for any help.
There are two types of missing values. One is where the value is represent only by the delimiter. Default fill is nan, but we can define a separate fill:
In [93]: txt1="""2020081217,28.6
...: 2020081218,24.7
...: 2020081219,
...: 2020081220,
...: 2020081221,"""
In [94]: np.genfromtxt(txt1.splitlines(),delimiter=',',encoding=None)
Out[94]:
array([[2.02008122e+09, 2.86000000e+01],
[2.02008122e+09, 2.47000000e+01],
[2.02008122e+09, nan],
[2.02008122e+09, nan],
[2.02008122e+09, nan]])
In [95]: np.genfromtxt(txt1.splitlines(),delimiter=',',encoding=None,filling_val
...: ues=999)
Out[95]:
array([[2.02008122e+09, 2.86000000e+01],
[2.02008122e+09, 2.47000000e+01],
[2.02008122e+09, 9.99000000e+02],
[2.02008122e+09, 9.99000000e+02],
[2.02008122e+09, 9.99000000e+02]])
Your case has a specific string:
In [96]: txt="""2020081217,28.6
...: 2020081218,24.7
...: 2020081219,-999.0
...: 2020081220,-999.0
...: 2020081221,-999.0"""
The other answer suggests using usemask, returning a masked_array:
In [100]: np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None, missing_values=-999.0, usemask=True)
Out[100]:
masked_array(
data=[[2020081217.0, 28.6],
[2020081218.0, 24.7],
[2020081219.0, --],
[2020081220.0, --],
[2020081221.0, --]],
mask=[[False, False],
[False, False],
[False, True],
[False, True],
[False, True]],
fill_value=1e+20)
Looking at the code, I deduce that it's doing a string match, rather than a numeric one. It can also take one value per column (I don't think it does a per-row test):
In [106]: np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None,
missing_values=['2020081217','-999.0'], usemask=True, dtype=None)
Out[106]:
masked_array(data=[(--, 28.6), (2020081218, 24.7), (2020081219, --),
(2020081220, --), (2020081221, --)],
mask=[( True, False), (False, False), (False, True),
(False, True), (False, True)],
fill_value=(999999, 1.e+20),
dtype=[('f0', '<i8'), ('f1', '<f8')])
Here I gave it dtype=None, so it returned a structured array.
missing_values can also be dict, but I haven't figured out what it expects.
I haven't figured out how to make it replace the missing values with something (such as from the filling_values).
You do the replace after load
In [110]: data = np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None)
In [111]: data
Out[111]:
array([[ 2.02008122e+09, 2.86000000e+01],
[ 2.02008122e+09, 2.47000000e+01],
[ 2.02008122e+09, -9.99000000e+02],
[ 2.02008122e+09, -9.99000000e+02],
[ 2.02008122e+09, -9.99000000e+02]])
In [114]: data[data==-999] = np.nan
In [115]: data
Out[115]:
array([[2.02008122e+09, 2.86000000e+01],
[2.02008122e+09, 2.47000000e+01],
[2.02008122e+09, nan],
[2.02008122e+09, nan],
[2.02008122e+09, nan]])
It looks like genfromtxt constructs a converters from the missing and filling values, but I haven't followed the details. Here's a way of using our converter
In [138]: converters={1:lambda x: np.nan if x=='-999.0' else float(x)}
In [139]: data = np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None,
converters=converters)
In [140]: data
Out[140]:
array([[2.02008122e+09, 2.86000000e+01],
[2.02008122e+09, 2.47000000e+01],
[2.02008122e+09, nan],
[2.02008122e+09, nan],
[2.02008122e+09, nan]])
You need to add usemask=True.
data = np.genfromtxt("test.csv", delimiter=',', usemask=True, missing_values=-999.0)
Fill-in with NANs.
data = data.filled(np.nan)
Check for NANs.
np.isnan(data)
Output.
array([[False, False],
[False, False],
[False, True],
[False, True],
[False, True]])
Suppose I have two arrays, a=np.array([0,0,1,1,1,2]), b=np.array([1,2,4,2,6,5]). Elements in a mean the row indices of where b should be assigned. And if there are multiple elements in the same row, the values should be assigned in order.
So the result is a 2D array c:
c = np.zeros((3, 4))
counts = {k:0 for k in range(3)}
for i in range(a.shape[0]):
c[a[i], counts[a[i]]]=b[i]
counts[a[i]]+=1
print(c)
Is there a way to use some fancy indexing method in numpy to get such results faster (without a for loop) in case these arrays are big.
I had to run your code to actually see what it produced. There are limits to what I can 'run' in my head.
In [230]: c
Out[230]:
array([[1., 2., 0., 0.],
[4., 2., 6., 0.],
[5., 0., 0., 0.]])
In [231]: counts
Out[231]: {0: 2, 1: 3, 2: 1}
Omitting this information may be delaying possible answers. 'vectorization' requires thinking in whole-array terms, which is easiest if I can visualize the result, and look for a pattern.
This looks like a padding problem.
In [260]: u, c = np.unique(a, return_counts=True)
In [261]: u
Out[261]: array([0, 1, 2])
In [262]: c
Out[262]: array([2, 3, 1]) # cf with counts
Load data with rows of different sizes into Numpy array
Working from previous padding questions, I can construct a mask:
In [263]: mask = np.arange(4)<c[:,None]
In [264]: mask
Out[264]:
array([[ True, True, False, False],
[ True, True, True, False],
[ True, False, False, False]])
and use that to assign the b values to c:
In [265]: c = np.zeros((3,4),int)
In [266]: c[mask] = b
In [267]: c
Out[267]:
array([[1, 2, 0, 0],
[4, 2, 6, 0],
[5, 0, 0, 0]])
Since a is already sorted we might get the counts faster than with unique. Also it will have problems if a doesn't have any values for some row(s).
I'm calculating an aggregate value over smaller blocks in a 2D numpy array. I'd like to exclude values 0 from the aggregation operation in an efficient manner (rather than for and if statements).
I'm using skimage.measure.block_reduce and numpy.ma.masked_equal, but it looks like block_reduce ignores the mask.
import numpy as np
import skimage
a = np.array([[2,4,0,12,5,7],[6,0,8,4,3,9]])
zeros_included = skimage.measure.block_reduce(a,(2,2),np.mean)
includes 0s and (correctly) produces
zeros_included
array([[3., 6., 6.]])
I was hoping
masked = np.ma.masked_equal(a,0)
zeros_excluded = skimage.measure.block_reduce(masked,(2,2),np.mean)
would do the trick, but still produces
zeros_excluded
array([[3., 6., 6.]])
The desired result would be:
array([[4., 8., 6.]])
I'm looking for a pythonesque way to achieve the correct result, use of skimage is optional. Of course my actual arrays and blocks are much bigger than in this example, hence the need for efficiency.
Thanks for your interest.
You could use np.nanmean, but you'll have to modify original array or create a new one:
import numpy as np
import skimage
a = np.array([[2,4,0,12,5,7],[6,0,8,4,3,9]])
b = a.astype("float")
b[b==0] = np.nan
zeros_excluded = skimage.measure.block_reduce(b,(2,2), np.nanmean)
zeros_excluded
# array([[4., 8., 6.]])
The core code of block_reduce is
blocked = view_as_blocks(image, block_size)
return func(blocked, axis=tuple(range(image.ndim, blocked.ndim)))
view_as_blocks uses as_strided to create a different view of the array:
In [532]: skimage.util.view_as_blocks(a,(2,2))
Out[532]:
array([[[[ 2, 4],
[ 6, 0]],
[[ 0, 12],
[ 8, 4]],
[[ 5, 7],
[ 3, 9]]]])
When applied to the masked array it produces the same thing. In effect it works with masked.data, or np.asarray(masked). Some actions preserve subclasses, this does not.
In [533]: skimage.util.view_as_blocks(masked,(2,2))
Out[533]:
array([[[[ 2, 4],
[ 6, 0]],
...
That's why the np.mean applied to the (2,3) axes does not respond to the masking.
np.mean applied to a masked array delegates the action to the arrays own method, so is sensitive to the masking:
In [544]: np.mean(masked[:,:2])
Out[544]: 4.0
In [545]: masked[:,:2].mean()
Out[545]: 4.0
In [547]: [masked[:,i:i+2].mean() for i in range(0,6,2)]
Out[547]: [4.0, 8.0, 6.0]
np.nanmean works with view_as_blocks because it doesn't depend on the array being a special subclass.
I can define a function that applies masking to the block view:
def foo(arr,axis):
return np.ma.masked_equal(arr,0).mean(axis)
In [552]: skimage.measure.block_reduce(a,(2,2),foo)
Out[552]:
masked_array(data=[[4.0, 8.0, 6.0]],
mask=[[False, False, False]],
fill_value=1e+20)
====
Since your blocks aren't overlapping, I create the blocks with reshaping and swapping axes.
In [554]: masked.reshape(2,3,2).transpose(1,0,2)
Out[554]:
masked_array(
data=[[[2, 4],
[6, --]],
[[--, 12],
[8, 4]],
[[5, 7],
[3, 9]]],
mask=[[[False, False],
[False, True]],
[[ True, False],
[False, False]],
[[False, False],
[False, False]]],
fill_value=0)
and then apply mean to the last 2 axes:
In [555]: masked.reshape(2,3,2).transpose(1,0,2).mean((1,2))
Out[555]:
masked_array(data=[4.0, 8.0, 6.0],
mask=[False, False, False],
fill_value=1e+20)
My goal is to fill a 2D array with values from a 1D array that exactly matches the pattern of values in the 2D array. For example:
array_a =
([[nan,nan,0],
[0,nan,0],
[nan,0,0],
[0,0,nan]])
array_b =
([0.324,0.254,0.204,
0.469,0.381,0.292,
0.550])
And I want to get this:
array_c =
([[nan,nan,0.324],
[0.254,nan,0.204],
[nan,0.469,0.381],
[0.292,0.550,nan]])
The number of values that need to be filled in array_a will exactly match the number of values in array_b. The main issue is that I want to have the nan values in the appropiate order throughout the array and I'm not sure how best to do that.
boolean indexing does the job nicely:
Locate the nan:
In [229]: mask = np.isnan(array_a)
In [230]: mask
Out[230]:
array([[ True, True, False],
[False, True, False],
[ True, False, False],
[False, False, True]])
boolean mask applied to the array produces a 1d array:
In [231]: array_a[~mask]
Out[231]: array([0., 0., 0., 0., 0., 0., 0.])
Use that same array in a set context:
In [232]: array_a[~mask]=array_b
In [233]: array_a[~mask]
Out[233]: array([0.324, 0.254, 0.204, 0.469, 0.381, 0.292, 0.55 ])
In [234]: array_a
Out[234]:
array([[ nan, nan, 0.324],
[0.254, nan, 0.204],
[ nan, 0.469, 0.381],
[0.292, 0.55 , nan]])
You can also do:
np.place(array_a, array_a == 0, array_b)
array_a
array([[ nan, nan, 0.324],
[0.254, nan, 0.204],
[ nan, 0.469, 0.381],
[0.292, 0.55 , nan]])
This should do the trick, although there might be a pre-written solution or a list comprehension to do the same.
import numpy as np
b_index = 0
array_c = np.zeros(np.array(array_a).shape)
for row_index, row in enumerate(array_a):
for col_index, col in enumerate(row):
if not np.isnan(col):
array_c[row_index, col_index] = array_b[b_index]
b_index += 1
else:
array_c[row_index, col_index] = np.nan
>>> print(array_c)
[[ nan nan 0.324]
[0.254 nan 0.204]
[ nan 0.469 0.381]
[0.292 0.55 nan]]
I have a 3 dimensional numpy array with shape (x,y,R). For each (x,y) pair, I have a 1D numpy array of R values. I want to set the entire array to nan if any of the R values are nan or zero. I tried something like
# 3d np array is called: data
mask1 = (data==0).any(axis=2)
mask2 = (data==np.nan).any(axis=2)
data[np.logical_or(mask1, mask2)] = np.nan
But this doesn't seem to work, I think the problem is the way I am trying to subset the numpy array with the lower dimensional boolean area, but not quite sure how to solve this.
Some example data:
y = np.random.random(size=(2,2,3))
y[0,0,2] = np.nan
y[0,1,0] = np.nan
y[0,0,1] = np.nan
y[1,1,2] = 0.
so that:
y[0,0,:]
array([0.092718, nan, nan])
y[0,1,:]
array([ nan, 0.00243745, nan])
y[1,0,:]
array([0.5282173 , 0.7548559 , 0.08869139])
y[1,1,:]
array([0.19612415, 0.16969036, 0.0])
and the desired result:
y[0,0,:]
array([nan, nan, nan])
y[0,1,:]
array([nan, nan, nan])
y[1,0,:]
array([0.5282173 , 0.7548559 , 0.08869139])
y[1,1,:]
array([nan, nan, nan])
update
this seems to work, but perhaps there are more elegant appraoches:
mask1 = (y==0).any(axis=2)
y[np.logical_or(np.sum(np.isnan(y), axis=2) > 0, mask1)] = np.nan
y
array([[[ nan, nan, nan],
[ nan, nan, nan]],
[[0.5282173 , 0.7548559 , 0.08869139],
[ nan, nan, nan]]])
nan has the peculiar property of comparing not equal to anything, including nan itself:
>>> y = np.random.random(size=(2,2,3))
>>> y[0,0,2] = np.nan
>>> y[0,1,0] = np.nan
>>> y[0,0,1] = np.nan
>>> y[0,1,2] = np.nan
>>>
>>> y
array([[[0.03161193, nan, nan],
[ nan, 0.55789282, nan]],
[[0.78047397, 0.06949872, 0.65225197],
[0.84801579, 0.11298244, 0.07627531]]])
>>>
>>> y == np.nan
array([[[False, False, False],
[False, False, False]],
[[False, False, False],
[False, False, False]]])
To check for nan you have to use np.isnan
>>> np.isnan(y)
array([[[False, True, True],
[ True, False, True]],
[[False, False, False],
[False, False, False]]])
With this little modification your code will actually work:
>>> mask1 = (y==0).any(axis=2)
>>> mask2 = np.isnan(y).any(axis=2)
>>> y[np.logical_or(mask1, mask2)] = np.nan
>>>
>>> y
array([[[ nan, nan, nan],
[ nan, nan, nan]],
[[0.78047397, 0.06949872, 0.65225197],
[0.84801579, 0.11298244, 0.07627531]]])
As an addendum to #PaulPanzer's answer, I have attempted to get the same result with the minimum number of temp arrays. This answer is here for fun, and does not provide any benefits to outweigh the clarity and legibility of PaulPanzer's answer.
Instead of ndarray.any, you can check for zeros directly with ndarray.all and flip the 2D array in-place instead of the 3D, avoiding a temp array. You can use the property that any number added (or subtracted, multiplied, divided, etc) to nan results in nan. Instead of using ndarray.any, you can use ufunc.reduce to make your 2D matrix, which will save you another 3D boolean array. You can't use the fact that np.isnan is a ufunc directly, because it is a unary function which does not support the reduce operation.
# Check for zeros
mask = y.all(axis=2) # Straight to 2D, no temp arrays
mask = np.logical_not(mask, out=mask) # In place negation, no temp arrays
# Check for nans
nans = np.add.reduce(y, axis=2) # 2D temp array, not 3D
mask |= np.isnan(nans) # Another temp array, also 2D
I chose to use np.add because it is not likely to run into problems that cause false nans to appear (unlike say np.divide). Any overflows will become +/-inf, which will not trigger the isnan check.