numpy genfromtxt not applying missing_values - python

I am currently struggling with a really simple problem, but cannot seem to solve it. You can reproduce the issue with the following file and code:
test.csv
2020081217,28.6
2020081218,24.7
2020081219,-999.0
2020081220,-999.0
2020081221,-999.0
code
data = np.genfromtxt("C:/Users/col/Downloads/test.csv", delimiter=',', missing_values=["-999", "-999.0", -999, -999.0])
print(data)
output
[[ 2.02008122e+09 2.86000000e+01]
[ 2.02008122e+09 2.47000000e+01]
[ 2.02008122e+09 -9.99000000e+02]
[ 2.02008122e+09 -9.99000000e+02]
[ 2.02008122e+09 -9.99000000e+02]]
Why does none of the versions for missing_values catch the -999 in the file and replace them with NaNs or something alike? I feel like this should be simple (and probably already answered somewhere on this website), but I cannot figure it out... Thanks for any help.

There are two types of missing values. One is where the value is represent only by the delimiter. Default fill is nan, but we can define a separate fill:
In [93]: txt1="""2020081217,28.6
...: 2020081218,24.7
...: 2020081219,
...: 2020081220,
...: 2020081221,"""
In [94]: np.genfromtxt(txt1.splitlines(),delimiter=',',encoding=None)
Out[94]:
array([[2.02008122e+09, 2.86000000e+01],
[2.02008122e+09, 2.47000000e+01],
[2.02008122e+09, nan],
[2.02008122e+09, nan],
[2.02008122e+09, nan]])
In [95]: np.genfromtxt(txt1.splitlines(),delimiter=',',encoding=None,filling_val
...: ues=999)
Out[95]:
array([[2.02008122e+09, 2.86000000e+01],
[2.02008122e+09, 2.47000000e+01],
[2.02008122e+09, 9.99000000e+02],
[2.02008122e+09, 9.99000000e+02],
[2.02008122e+09, 9.99000000e+02]])
Your case has a specific string:
In [96]: txt="""2020081217,28.6
...: 2020081218,24.7
...: 2020081219,-999.0
...: 2020081220,-999.0
...: 2020081221,-999.0"""
The other answer suggests using usemask, returning a masked_array:
In [100]: np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None, missing_values=-999.0, usemask=True)
Out[100]:
masked_array(
data=[[2020081217.0, 28.6],
[2020081218.0, 24.7],
[2020081219.0, --],
[2020081220.0, --],
[2020081221.0, --]],
mask=[[False, False],
[False, False],
[False, True],
[False, True],
[False, True]],
fill_value=1e+20)
Looking at the code, I deduce that it's doing a string match, rather than a numeric one. It can also take one value per column (I don't think it does a per-row test):
In [106]: np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None,
missing_values=['2020081217','-999.0'], usemask=True, dtype=None)
Out[106]:
masked_array(data=[(--, 28.6), (2020081218, 24.7), (2020081219, --),
(2020081220, --), (2020081221, --)],
mask=[( True, False), (False, False), (False, True),
(False, True), (False, True)],
fill_value=(999999, 1.e+20),
dtype=[('f0', '<i8'), ('f1', '<f8')])
Here I gave it dtype=None, so it returned a structured array.
missing_values can also be dict, but I haven't figured out what it expects.
I haven't figured out how to make it replace the missing values with something (such as from the filling_values).
You do the replace after load
In [110]: data = np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None)
In [111]: data
Out[111]:
array([[ 2.02008122e+09, 2.86000000e+01],
[ 2.02008122e+09, 2.47000000e+01],
[ 2.02008122e+09, -9.99000000e+02],
[ 2.02008122e+09, -9.99000000e+02],
[ 2.02008122e+09, -9.99000000e+02]])
In [114]: data[data==-999] = np.nan
In [115]: data
Out[115]:
array([[2.02008122e+09, 2.86000000e+01],
[2.02008122e+09, 2.47000000e+01],
[2.02008122e+09, nan],
[2.02008122e+09, nan],
[2.02008122e+09, nan]])
It looks like genfromtxt constructs a converters from the missing and filling values, but I haven't followed the details. Here's a way of using our converter
In [138]: converters={1:lambda x: np.nan if x=='-999.0' else float(x)}
In [139]: data = np.genfromtxt(txt.splitlines(),delimiter=',',encoding=None,
converters=converters)
In [140]: data
Out[140]:
array([[2.02008122e+09, 2.86000000e+01],
[2.02008122e+09, 2.47000000e+01],
[2.02008122e+09, nan],
[2.02008122e+09, nan],
[2.02008122e+09, nan]])

You need to add usemask=True.
data = np.genfromtxt("test.csv", delimiter=',', usemask=True, missing_values=-999.0)
Fill-in with NANs.
data = data.filled(np.nan)
Check for NANs.
np.isnan(data)
Output.
array([[False, False],
[False, False],
[False, True],
[False, True],
[False, True]])

Related

How to delete decimal values from an array in a pythonic way

I am trying to delete an element from an array. When trying to delete integer values(using numpy.delete) it's working but it doesn't work for decimal values.
For integer deletion
X = [1. 2. 2.5 5.7 3. 6. ]
to_delete_key = [3, 7.3]
Y = np.delete(X, to_delete_key, None)
Output is [1. 2. 2.5 5.7 6. ]
The value 3 got deleted
Whereas in the case of decimal deletion
For decimal deletion
X = [6. 7.3 9.1]
to_delete_key = [3, 7.3]
Y = np.delete(X, to_delete_key, None)
Output is [6. 7.3 9.1]
The value 7.3 didn't get deleted.
I know how to do it the normal way but is there any efficient pythonic way to do it
In [249]: X = np.array([1., 2., 2.5, 5.7, 3., 6. ])
...: to_delete_key = [3, 7.3]
In [252]: np.delete(X, to_delete_key)
Traceback (most recent call last):
File "<ipython-input-252-f9031065a548>", line 1, in <module>
np.delete(X, to_delete_key)
File "<__array_function__ internals>", line 5, in delete
File "/usr/local/lib/python3.8/dist-packages/numpy/lib/function_base.py", line 4406, in delete
keep[obj,] = False
IndexError: arrays used as indices must be of integer (or boolean) type
Using an integer:
In [253]: np.delete(X, 3)
Out[253]: array([1. , 2. , 2.5, 3. , 6. ])
It was the 5.7 that was deleted, X[3].
np.delete does not delete by value! From the docs:
obj : slice, int or array of ints
Indicate indices of sub-arrays to remove along the specified axis.
We can look for value matches
In [267]: vals = [3, 2.5]
In [268]: X[:,None]==vals
Out[268]:
array([[False, False],
[False, False],
[False, True],
[False, False],
[ True, False],
[False, False]])
But equality match on floats can be unreliable. isclose operates with a tolerance:
In [269]: np.isclose(X[:,None],vals)
Out[269]:
array([[False, False],
[False, False],
[False, True],
[False, False],
[ True, False],
[False, False]])
Then find the rows where there's a match:
In [270]: _.any(axis=1)
Out[270]: array([False, False, True, False, True, False])
In [271]: X[_]
Out[271]: array([2.5, 3. ])
In [272]: X[~__]
Out[272]: array([1. , 2. , 5.7, 6. ])
Lists have a remove by value:
In [284]: alist=X.tolist()
In [285]: alist.remove(3.0)
In [286]: alist.remove(2.5)
In [287]: alist
Out[287]: [1.0, 2.0, 5.7, 6.0]
You are dealing with floating-point numbers that cannot be compared exactly. Google out "What every programmer should know about floating-point numbers".
1/3 + 1/3 + 1/3 might not be equal to 1 due to rounding errors.
So the explanation is that your value of 7.3 is not found. Numpy probably converted 7.3 to a 32-bit float or whatever that is not exactly equal to what is in the array.
As mentioned by #elPastor, you are misusing Numpy.

replace masked with nan in numpy masked_array

>> masks = [[1,1],[0,0]]
>> [np.ma.masked_array(data=np.array([1.0,2.0]), mask=m, fill_value=np.nan).mean() for m in masks]
[masked, 1.5]
I'd like to replace the masked result with nan. Is there a way to do that directly with numpy's masked_array?
In [232]: M = np.ma.masked_array(data=np.array([1.0,2.0]),mask=[True, False])
filled method replaces the masked values with the fill value:
In [233]: M.filled()
Out[233]: array([1.e+20, 2.e+00])
In [234]: M.filled(np.nan) # or with a value of your choice.
Out[234]: array([nan, 2.])
Or as you do, specify the fill value when defining the array:
In [235]: M = np.ma.masked_array(data=np.array([1.0,2.0]),mask=[True, False],
...: fill_value=np.nan)
In [236]: M
Out[236]:
masked_array(data=[--, 2.0],
mask=[ True, False],
fill_value=nan)
In [237]: M.filled()
Out[237]: array([nan, 2.])
The masked mean method skips over the filled values:
In [238]: M.mean()
Out[238]: 2.0
In [239]: M.filled().mean()
Out[239]: nan
In [241]: np.nanmean(M.filled()) # so does the `nanmean` function
In [242]: M.data.mean() # mean of the underlying data
Out[242]: 1.5
I think you can do np.ones
masks=np.array([[1,1],[0,0]])
np.ma.masked_array(data=np.array([1.0,2.0])*np.ones(masks.shape), mask=masks, fill_value=np.nan).mean(axis=1)
Out[145]:
masked_array(data=[--, 1.5],
mask=[ True, False],
fill_value=1e+20)

Replacing fill values in 2D array with values from a 1D array

My goal is to fill a 2D array with values from a 1D array that exactly matches the pattern of values in the 2D array. For example:
array_a =
([[nan,nan,0],
[0,nan,0],
[nan,0,0],
[0,0,nan]])
array_b =
([0.324,0.254,0.204,
0.469,0.381,0.292,
0.550])
And I want to get this:
array_c =
([[nan,nan,0.324],
[0.254,nan,0.204],
[nan,0.469,0.381],
[0.292,0.550,nan]])
The number of values that need to be filled in array_a will exactly match the number of values in array_b. The main issue is that I want to have the nan values in the appropiate order throughout the array and I'm not sure how best to do that.
boolean indexing does the job nicely:
Locate the nan:
In [229]: mask = np.isnan(array_a)
In [230]: mask
Out[230]:
array([[ True, True, False],
[False, True, False],
[ True, False, False],
[False, False, True]])
boolean mask applied to the array produces a 1d array:
In [231]: array_a[~mask]
Out[231]: array([0., 0., 0., 0., 0., 0., 0.])
Use that same array in a set context:
In [232]: array_a[~mask]=array_b
In [233]: array_a[~mask]
Out[233]: array([0.324, 0.254, 0.204, 0.469, 0.381, 0.292, 0.55 ])
In [234]: array_a
Out[234]:
array([[ nan, nan, 0.324],
[0.254, nan, 0.204],
[ nan, 0.469, 0.381],
[0.292, 0.55 , nan]])
You can also do:
np.place(array_a, array_a == 0, array_b)
array_a
array([[ nan, nan, 0.324],
[0.254, nan, 0.204],
[ nan, 0.469, 0.381],
[0.292, 0.55 , nan]])
This should do the trick, although there might be a pre-written solution or a list comprehension to do the same.
import numpy as np
b_index = 0
array_c = np.zeros(np.array(array_a).shape)
for row_index, row in enumerate(array_a):
for col_index, col in enumerate(row):
if not np.isnan(col):
array_c[row_index, col_index] = array_b[b_index]
b_index += 1
else:
array_c[row_index, col_index] = np.nan
>>> print(array_c)
[[ nan nan 0.324]
[0.254 nan 0.204]
[ nan 0.469 0.381]
[0.292 0.55 nan]]

setting nan to entire axis if any nan exists in axis in numpy

I have a 3 dimensional numpy array with shape (x,y,R). For each (x,y) pair, I have a 1D numpy array of R values. I want to set the entire array to nan if any of the R values are nan or zero. I tried something like
# 3d np array is called: data
mask1 = (data==0).any(axis=2)
mask2 = (data==np.nan).any(axis=2)
data[np.logical_or(mask1, mask2)] = np.nan
But this doesn't seem to work, I think the problem is the way I am trying to subset the numpy array with the lower dimensional boolean area, but not quite sure how to solve this.
Some example data:
y = np.random.random(size=(2,2,3))
y[0,0,2] = np.nan
y[0,1,0] = np.nan
y[0,0,1] = np.nan
y[1,1,2] = 0.
so that:
y[0,0,:]
array([0.092718, nan, nan])
y[0,1,:]
array([ nan, 0.00243745, nan])
y[1,0,:]
array([0.5282173 , 0.7548559 , 0.08869139])
y[1,1,:]
array([0.19612415, 0.16969036, 0.0])
and the desired result:
y[0,0,:]
array([nan, nan, nan])
y[0,1,:]
array([nan, nan, nan])
y[1,0,:]
array([0.5282173 , 0.7548559 , 0.08869139])
y[1,1,:]
array([nan, nan, nan])
update
this seems to work, but perhaps there are more elegant appraoches:
mask1 = (y==0).any(axis=2)
y[np.logical_or(np.sum(np.isnan(y), axis=2) > 0, mask1)] = np.nan
y
array([[[ nan, nan, nan],
[ nan, nan, nan]],
[[0.5282173 , 0.7548559 , 0.08869139],
[ nan, nan, nan]]])
nan has the peculiar property of comparing not equal to anything, including nan itself:
>>> y = np.random.random(size=(2,2,3))
>>> y[0,0,2] = np.nan
>>> y[0,1,0] = np.nan
>>> y[0,0,1] = np.nan
>>> y[0,1,2] = np.nan
>>>
>>> y
array([[[0.03161193, nan, nan],
[ nan, 0.55789282, nan]],
[[0.78047397, 0.06949872, 0.65225197],
[0.84801579, 0.11298244, 0.07627531]]])
>>>
>>> y == np.nan
array([[[False, False, False],
[False, False, False]],
[[False, False, False],
[False, False, False]]])
To check for nan you have to use np.isnan
>>> np.isnan(y)
array([[[False, True, True],
[ True, False, True]],
[[False, False, False],
[False, False, False]]])
With this little modification your code will actually work:
>>> mask1 = (y==0).any(axis=2)
>>> mask2 = np.isnan(y).any(axis=2)
>>> y[np.logical_or(mask1, mask2)] = np.nan
>>>
>>> y
array([[[ nan, nan, nan],
[ nan, nan, nan]],
[[0.78047397, 0.06949872, 0.65225197],
[0.84801579, 0.11298244, 0.07627531]]])
As an addendum to #PaulPanzer's answer, I have attempted to get the same result with the minimum number of temp arrays. This answer is here for fun, and does not provide any benefits to outweigh the clarity and legibility of PaulPanzer's answer.
Instead of ndarray.any, you can check for zeros directly with ndarray.all and flip the 2D array in-place instead of the 3D, avoiding a temp array. You can use the property that any number added (or subtracted, multiplied, divided, etc) to nan results in nan. Instead of using ndarray.any, you can use ufunc.reduce to make your 2D matrix, which will save you another 3D boolean array. You can't use the fact that np.isnan is a ufunc directly, because it is a unary function which does not support the reduce operation.
# Check for zeros
mask = y.all(axis=2) # Straight to 2D, no temp arrays
mask = np.logical_not(mask, out=mask) # In place negation, no temp arrays
# Check for nans
nans = np.add.reduce(y, axis=2) # 2D temp array, not 3D
mask |= np.isnan(nans) # Another temp array, also 2D
I chose to use np.add because it is not likely to run into problems that cause false nans to appear (unlike say np.divide). Any overflows will become +/-inf, which will not trigger the isnan check.

Equations containing outer products of vectors

where x is a column vector.
We know from the diagonal elements in A, the value of x entries. But signs of them remains unknown. For example:
import numpy as np
A = array([[ 1.562, -0.833, -0.833, -0.031, -0.031, 0.167],
[-0.833, 0.795, 0.167, -0.149, 0.167, -0.146],
[-0.833, 0.167, 0.795, 0.167, -0.149, -0.146],
[-0.031, -0.149, 0.167, 1.68 , -0.833, -0.833],
[-0.031, 0.167, -0.149, -0.833, 1.68 , -0.833],
[ 0.167, -0.146, -0.146, -0.833, -0.833, 1.792]])
np.sqrt(A.diagonal())
>>> array([ 1.24979998, 0.89162773, 0.89162773, 1.29614814, 1.29614814,
1.33865604])
But we still dont know the signs. With a mask we have the product signs:
A > 0
>>> array([[ True, False, False, False, False, True],
[False, True, True, False, True, False],
[False, True, True, True, False, False],
[False, False, True, True, False, False],
[False, True, False, False, True, False],
[ True, False, False, False, False, True]], dtype=bool)
How can I find x elements signs.
Note that (-x)(-x)^T = (x)(x)^T, so you can't distinguish x from -x. Given that, you can determine the sign pattern (i.e. you can determine whether two elements have the same or opposite signs). In fact, since each row of A is a scalar multiple of x, each row gives you the sign pattern (unless the row is all 0, which is possible if an element of x is 0). The same holds for the columns.
Note that your example A can not be a product of the form (x)(x)^T. It has full rank. The maximum possible rank of (x)(x)^T is 1.
For example,
In [14]: x = np.array([1.0, -2.0, -3.0, 4.0])
In [15]: np.outer(x, x)
Out[15]:
array([[ 1., -2., -3., 4.],
[ -2., 4., 6., -8.],
[ -3., 6., 9., -12.],
[ 4., -8., -12., 16.]])
Note the sign pattern in the product. Each row (and each column) is either (+, -, -, +) or (-, +, +, -).
In general, you can't.
For example, imagine the matrix A == [1].
How should anyone know whether x is [1] or [-1]?

Categories