int array with Missing values numpy - python

Numpy int arrays can't store missing values.
>>> import numpy as np
>>> np.arange(10)
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
>>> myArray = np.arange(10)
>>> myArray.dtype
dtype('int32')
>>> myArray[0] = None
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
TypeError: int() argument must be a string, a bytes-like object or a number, not 'NoneType'
>>> myArray.astype( dtype = 'float')
array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
>>> myFloatArray = myArray.astype( dtype = 'float')
>>> myFloatArray[0] = None
>>> myFloatArray
array([ nan, 1., 2., 3., 4., 5., 6., 7., 8., 9.])
Pandas warns about this in the docs - Caveats and Gotchas, Support for int NA. Wes McKinney also reiterates the point in this stack question
I need to be able to store missing values in an int array. I'm INSERTing rows into my database which I've set up to accept only ints of varying sizes.
My current work around is to store the array as an object, which can hold both ints and None-types as elements.
>>> myArray.astype( dtype = 'object')
array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
>>> myObjectArray = myArray.astype( dtype = 'object')
>>> myObjectArray[0] = None
>>> myObjectArray
array([None, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=object)
This seems to be memory intensive and slow for large data-sets. I was wondering if anyone has a better solution while the numpy development is underway.

I found a very quick way to convert all the missing values in my dataframe into None types. the .where method
mydata = mydata.where( pd.notnull( mydata ), None )
It is much less memory intensive than what I was doing before.

Related

find infinity values and replace with maximum per vector in a numpy array

Suppose I have the following array with shape (3, 5) :
array = np.array([[1, 2, 3, inf, 5],
[10, 9, 8, 7, 6],
[4, inf, 2, 6, inf]])
Now I want to find the infinity values per vector and replace them with the maximum of that vector, with a lower limit of 1.
So the output for this example shoud be:
array_solved = np.array([[1, 2, 3, 5, 5],
[10, 9, 8, 7, 6],
[4, 6, 2, 6, 6]])
I could do this by looping over every vector of the array and apply:
idx_inf = np.isinf(array_vector)
max_value = np.max(np.append(array_vector[~idx_inf], 1.0))
array_vector[idx_inf] = max_value
But I guess there is a faster way.
Anyone an idea?
One way is to first convert infs to NaNs with np.isinf masking and then NaNs to max values of rows with np.nanmax:
array[np.isinf(array)] = np.nan
array[np.isnan(array)] = np.nanmax(array, axis=1)
to get
>>> array
array([[ 1., 2., 3., 5., 5.],
[10., 9., 8., 7., 6.],
[ 4., 10., 2., 6., 6.]])
import numpy as np
array = np.array([[1, 2, 3, np.inf, 5],
[10, 9, 8, 7, 6],
[4, np.inf, 2, 6, np.inf]])
n, m = array.shape
array[np.isinf(array)] = -np.inf
mx_array = np.repeat(np.max(array, axis=1), m).reshape(n, m)
ind = np.where(np.isinf(array))
array[ind] = mx_array[ind]
Output array:
array([[ 1., 2., 3., 5., 5.],
[10., 9., 8., 7., 6.],
[ 4., 6., 2., 6., 6.]])

Unique entries in columns of a 2D numpy array

I have an array of integers:
import numpy as np
demo = np.array([[1, 2, 3],
[1, 5, 3],
[4, 5, 6],
[7, 8, 9],
[4, 2, 3],
[4, 2, 12],
[10, 11, 13]])
And I want an array of unique values in the columns, padded with something if necessary (e.g. nan):
[[1, 4, 7, 10, nan],
[2, 5, 8, 11, nan],
[3, 6, 9, 12, 13]]
It does work when I iterate over the transposed array and use a boolean_indexing solution from a previous question. But I was hoping there would be a built-in method:
solution = []
for row in np.unique(demo.T, axis=1):
solution.append(np.unique(row))
def boolean_indexing(v, fillval=np.nan):
lens = np.array([len(item) for item in v])
mask = lens[:,None] > np.arange(lens.max())
out = np.full(mask.shape,fillval)
out[mask] = np.concatenate(v)
return out
print(boolean_indexing(solution))
AFAIK, there are no builtin solution for that. That being said, your solution seems a bit complex to me. You could create an array with initialized values and fill it with a simple loop (since you already use loops anyway).
solution = [np.unique(row) for row in np.unique(demo.T, axis=1)]
result = np.full((len(solution), max(map(len, solution))), np.nan)
for i,arr in enumerate(solution):
result[i][:len(arr)] = arr
If you want to avoid the loop you could do:
demo = demo.astype(np.float32) # nan only works on floats
sort = np.sort(demo, axis=0)
diff = np.diff(sort, axis=0)
np.place(sort[1:], diff == 0, np.nan)
sort.sort(axis=0)
edge = np.argmax(sort, axis=0).max()
result = sort[:edge]
print(result.T)
Output:
array([[ 1., 4., 7., 10., nan],
[ 2., 5., 8., 11., nan],
[ 3., 6., 9., 12., 13.]], dtype=float32)
Not sure if this is any faster than the solution given by Jérôme.
EDIT
A slightly better solution
demo = demo.astype(np.float32)
sort = np.sort(demo, axis=0)
mask = np.full(sort.shape, False, dtype=bool)
np.equal(sort[1:], sort[:-1], out=mask[1:])
np.place(sort, mask, np.nan)
edge = (~mask).sum(0).max()
result = np.sort(sort, axis=0)[:edge]
print(result.T)
Output:
array([[ 1., 4., 7., 10., nan],
[ 2., 5., 8., 11., nan],
[ 3., 6., 9., 12., 13.]], dtype=float32)

np.where equivalent for 1-D arrays

I am trying to fill nan values in an array with values from another array. Since the arrays I am working on are 1-D np.where is not working. However, following the tip in the documentation I tried the following:
import numpy as np
sample = [1, 2, np.nan, 4, 5, 6, np.nan]
replace = [3, 7]
new_sample = [new_value if condition else old_value for (new_value, condition, old_value) in zip(replace, np.isnan(sample), sample)]
However, instead output I expected [1, 2, 3, 4, 5, 6, 7] I get:
[Out]: [1, 2]
What I am doing wrong?
np.where works
In [561]: sample = np.array([1, 2, np.nan, 4, 5, 6, np.nan])
Use isnan to identify the nan values (don't use ==)
In [562]: np.isnan(sample)
Out[562]: array([False, False, True, False, False, False, True])
In [564]: np.where(np.isnan(sample))
Out[564]: (array([2, 6], dtype=int32),)
Either one, the boolean or the where tuple can index the nan values:
In [565]: sample[Out[564]]
Out[565]: array([nan, nan])
In [566]: sample[Out[562]]
Out[566]: array([nan, nan])
and be used to replace:
In [567]: sample[Out[562]]=[1,2]
In [568]: sample
Out[568]: array([1., 2., 1., 4., 5., 6., 2.])
The three parameter also works - but returns a copy.
In [571]: np.where(np.isnan(sample),999,sample)
Out[571]: array([ 1., 2., 999., 4., 5., 6., 999.])
You can use numpy.argwhere. But #hpaulj shows that numpy.where works just as well.
import numpy as np
sample = np.array([1, 2, np.nan, 4, 5, 6, np.nan])
replace = np.array([3, 7])
sample[np.argwhere(np.isnan(sample)).ravel()] = replace
# array([ 1., 2., 3., 4., 5., 6., 7.])

How to substitute NaNs in a numpy array with elements in another list

I'm facing an issue with a basic substitution. I have two arrays, one of them contains numbers and NaN, and the other one numbers that are supposed to replace the NaN, obviously ordered as I wish. As an example:
x1 = [NaN, 2, 3, 4, 5, NaN, 7, 8, NaN, 10] and
fill = [1, 6, 9] and I want to obtain by index-wise replacement an array like:
x1_final = [1, 2, 3, 4, 5, NaN, 7, 8, NaN, 10]
I have written this idiotic line of code, which substitutes all the NaN with the first element of the fill array:
for j in range(0,len(x1)):
if np.isnan(x1[j]).any():
for i in range(0,len(fill)):
x1[j] = fill[i]
How do I manage to achieve my result?
Does this work for you?
train = np.array([2, 4, 4, 8, 32, np.NaN, 12, np.NaN])
fill = [1,3]
train[np.isnan(train)] = fill
print(train)
Output:
[ 2. 4. 4. 8. 32. 1. 12. 3.]
The following should work even if the size of fill doesn't match the number of nans
>>> x1 = np.random.randint(0, 4, (10,))
>>> x1 = x1/x1 + x1
>>>
>>> x1
array([ 4., nan, nan, 4., nan, 3., nan, 2., 3., 4.])
>>>
>>> fill = np.arange(3)
>>>
>>> loc, = np.where(np.isnan(x1))
>>>
>>> x1[loc[:len(fill)]] = fill[:len(loc)]
>>>
>>> x1
array([ 4., 0., 1., 4., 2., 3., nan, 2., 3., 4.])
The answer from #chrisz is the correct one, because you have the power of numpy, so use it :-)
But if you still want to do it the way that you started, you can fix the code like this:
import numpy as np
x1 = [np.NaN, 2, 3, 4, 5, np.NaN, 7, 8, np.NaN, 10]
fill = [1, 6, 9]
i = 0
for j in range(0, len(x1)):
if np.isnan(x1[j]).any():
x1[j] = fill[i]
i += 1
print x1
You were almost there, you just needed to count correctly the index of the fill (maybe adding some check for an out of bounds index).
But, as I said, definitely go the numpy way, it's faster and cleaner.

What does .shape[] do in "for i in range(Y.shape[0])"?

I'm trying to break down a program line by line. Y is a matrix of data but I can't find any concrete data on what .shape[0] does exactly.
for i in range(Y.shape[0]):
if Y[i] == -1:
This program uses numpy, scipy, matplotlib.pyplot, and cvxopt.
The shape attribute for numpy arrays returns the dimensions of the array. If Y has n rows and m columns, then Y.shape is (n,m). So Y.shape[0] is n.
In [46]: Y = np.arange(12).reshape(3,4)
In [47]: Y
Out[47]:
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11]])
In [48]: Y.shape
Out[48]: (3, 4)
In [49]: Y.shape[0]
Out[49]: 3
shape is a tuple that gives dimensions of the array..
>>> c = arange(20).reshape(5,4)
>>> c
array([[ 0, 1, 2, 3],
[ 4, 5, 6, 7],
[ 8, 9, 10, 11],
[12, 13, 14, 15],
[16, 17, 18, 19]])
c.shape[0]
5
Gives the number of rows
c.shape[1]
4
Gives number of columns
shape is a tuple that gives you an indication of the number of dimensions in the array. So in your case, since the index value of Y.shape[0] is 0, your are working along the first dimension of your array.
From
Link
An array has a shape given by the number of elements along each axis:
>>> a = floor(10*random.random((3,4)))
>>> a
array([[ 7., 5., 9., 3.],
[ 7., 2., 7., 8.],
[ 6., 8., 3., 2.]])
>>> a.shape
(3, 4)
and http://www.scipy.org/Numpy_Example_List#shape has some more
examples.
In python, Suppose you have loaded up the data in some variable train:
train = pandas.read_csv('file_name')
>>> train
train([[ 1., 2., 3.],
[ 5., 1., 2.]],)
I want to check what are the dimensions of the 'file_name'. I have stored the file in train
>>>train.shape
(2,3)
>>>train.shape[0] # will display number of rows
2
>>>train.shape[1] # will display number of columns
3
In Python shape() is use in pandas to give number of row/column:
Number of rows is given by:
train = pd.read_csv('fine_name') //load the data
train.shape[0]
Number of columns is given by
train.shape[1]
shape() consists of array having two arguments rows and columns.
if you search shape[0] then it will gave you the number of rows.
shape[1] will gave you number of columns.

Categories