Compare two numpy arrays and insert nans - python

I have two numpy arrays with the dimensions (120, 360), one of the arrays consists of integers and zeros the second consists of floats. I want to replace the values of the second array with nans everywhere there is an integer in the first array. Is there an easy and efficient way to do this?
Also I'd like to replace the integers in the first array with nans and change zeros to ones. Thanks in advance.

You can achieve this easily with logical indexing into the array,
arr2[ arr1 != 0 ] = numpy.NaN
However integer arrays don't support NaNs so you'd have to convert your first array to a float array, i.e.
arr1 = arr1.astype(float)
arr1[arr1 != 0.0] = numpy.NaN
arr1[arr1 == 0.0] = 1.0

Setup the arrays:
>>> import numpy as np
>>> x = np.array([[1,0],[0,4]], dtype=int)
>>> y = np.array([[1.1, 2.2],[3.3, 4.4]], dtype=float)
You can easily set the second array to nan where you want, like this:
>>> y[x != 0] = np.nan
>>> y
array([[ nan, 2.2],
[ 3.3, nan]])
Then convert the first array to floats (since NaN is not an integer) and set the values you want:
>>> x = x.astype(float)
>>> x[x != 0] = np.nan
>>> x[x == 0] = 1
>>> x
array([[ nan, 1.],
[ 1., nan]])

As a comment on the previous answers, I don't think comparing floats with == is that a good idea, and I think some operations are wasted. What about creating a temporary array mask = (X != 0) and use it as index ?
>>> X = X.astype(float)
>>> X[~mask] = np.nan
>>> X[mask] = 1

I don't know your purpose of replacing values with NaNs, but you may want to consider using numpy's masked arrays instead (similar to Pierre's answer, but numpy has builtin mask support!):
import numpy.ma
# mask out values when there is a non-zero integer in arr1
arr2 = numpy.ma.masked_array(arr2, mask=arr1)
# mask out values in arr2 for non-zero integers, and set all remaining values (the zeros) to 1
arr1 = numpy.ma(arr1, mask=(arr1 != 0))
arr1[~arr1.mask] = 1
No integer to float conversion needed, and this allows you to use a lot of numpy's functionality without getting into problems. E.g., calculating the mean of an array with NaNs is certainly a bad idea, with a masked array, this is no problem.

Related

Python: Plot an array of strings with repeated entries vs float without for loop

Hi I am trying to plot a numpy array of strings in y axis, for example
arr = np.array(['a','a','bas','dgg','a']) #The actual strings are about 11 characters long
vs a float array with equal length. The string array I am working with is very large ~ 100 million entries. One of the solutions I had in mind was to convert the string array to unique integer ids, for example,
vocab = np.unique(arr)
vocab = list(vocab)
arrId = np.zeros(len(arr))
for i in range(len(arr)):
arrId[i] = vocab.index(arr[i])
and then matplotlib.pyplot.plot(arrId). But I cannot afford to run a for loop to convert the array of strings to an array of unique integer ids. In an initial search I could not find a way to map strings to an unique id without using a loop. Maybe I am missing something, but is there a smart way to do this in python?
EDIT -
Thanks. The solutions provided use vocab,ind = np.unique(arr, return_index = True) where idx is the returned unique integer array. But it seems like np.unique is O(N*log(N)) according to this ( numpy.unique with order preserved), but pandas.unique is of order O(N). But I am not sure how to get ind from pandas.unique. plotting data i guess can be done in O(N). So I was wondering is there a way to do this O(N)? perhaps by hashing of some sort?
numpy.unique used with the return_inverse argument allows you to obtain the inverted index.
arr = np.array(['a','a','bas','dgg','a'])
unique, rev = np.unique(arr, return_inverse=True)
#unique: ['a' 'bas' 'dgg']
#rev: [0 0 1 2 0]
such that unique[rev] returns the original array ['a' 'a' 'bas' 'dgg' 'a'].
This can be easily used to plot the data.
import numpy as np
import matplotlib.pyplot as plt
arr = np.array(['a','a','bas','dgg','a'])
x = np.array([1,2,3,4,5])
unique, rev = np.unique(arr, return_inverse=True)
print unique
print rev
print unique[rev]
fig,ax=plt.subplots()
ax.scatter(x, rev)
ax.set_yticks(range(len(unique)))
ax.set_yticklabels(unique)
plt.show()
you can factorize your strings:
In [75]: arr = np.array(['a','a','bas','dgg','a'])
In [76]: cats, idx = np.unique(arr, return_inverse=True)
In [77]: plt.plot(idx)
Out[77]: [<matplotlib.lines.Line2D at 0xf82da58>]
In [78]: cats
Out[78]:
array(['a', 'bas', 'dgg'],
dtype='<U3')
In [79]: idx
Out[79]: array([0, 0, 1, 2, 0], dtype=int64)
You can use the numpy unique funciton to return a unique array of values?
print(np.unique(arr))
['a' 'bas' 'dgg']
collections.counter also return the value and number of counts:
print(collections.Counter(arr))
Counter({'a': 3, 'bas': 1, 'dgg': 1})
Does this help at all?

increment values in a numpy array multiple times [duplicate]

Simple Version:
if I do this:
import numpy as np
a = np.zeros(2)
a[[1, 1]] += np.array([1, 1])
I get [0, 1] as an output. but I would like [0, 2]. Is that possible somehow, using implicit numpy looping instead of looping over it myself?
What-I-actually-need-to-do version:
I have a structured array that contains an index, a value, and some boolean value. I would like to sum those values at those indices, based on the boolean. Clearly that can be done with a simple loop, but it seems like it should be possible with clever numpy indexing (as above).
For example, I have an array with 5 elements that I want to populate from the array with values, indices, and conditions:
import numpy as np
size = 5
nvalues = 10
np.random.seed(1)
a = np.zeros(nvalues, dtype=[('val', float), ('ix', int), ('cond', bool)])
a = np.rec.array(a)
a.val = np.random.rand(nvalues)
a.cond = (np.random.rand(nvalues) > 0.3)
a.ix = np.random.randint(size, size=nvalues)
# obvious solution
obvssum = np.zeros(size)
for i in a:
if i.cond:
obvssum[i.ix] += i.val
# is something this possible?
doesntwork = np.zeros(size)
doesntwork[a[a.cond].ix] += a[a.cond].val
print(doesntwork)
print(obvssum)
Output:
[ 0. 0. 0.61927097 0.02592623 0.29965467]
[ 0. 0. 1.05459336 0.02592623 1.27063303]
I think what's happening here is if a[a.cond].ix were guaranteed to be unique, my method would work just fine, as noted in the simple example.
This is what the at method of NumPy ufuncs is for:
output = numpy.zeros(size)
numpy.add.at(output, a[a.cond].ix, a[a.cond].val)

How do I stack vectors of different lengths in NumPy?

How do I stack column-wise n vectors of shape (x,) where x could be any number?
For example,
from numpy import *
a = ones((3,))
b = ones((2,))
c = vstack((a,b)) # <-- gives an error
c = vstack((a[:,newaxis],b[:,newaxis])) #<-- also gives an error
hstack works fine but concatenates along the wrong dimension.
Short answer: you can't. NumPy does not support jagged arrays natively.
Long answer:
>>> a = ones((3,))
>>> b = ones((2,))
>>> c = array([a, b])
>>> c
array([[ 1. 1. 1.], [ 1. 1.]], dtype=object)
gives an array that may or may not behave as you expect. E.g. it doesn't support basic methods like sum or reshape, and you should treat this much as you'd treat the ordinary Python list [a, b] (iterate over it to perform operations instead of using vectorized idioms).
Several possible workarounds exist; the easiest is to coerce a and b to a common length, perhaps using masked arrays or NaN to signal that some indices are invalid in some rows. E.g. here's b as a masked array:
>>> ma.array(np.resize(b, a.shape[0]), mask=[False, False, True])
masked_array(data = [1.0 1.0 --],
             mask = [False False  True],
       fill_value = 1e+20)
This can be stacked with a as follows:
>>> ma.vstack([a, ma.array(np.resize(b, a.shape[0]), mask=[False, False, True])])
masked_array(data =
[[1.0 1.0 1.0]
[1.0 1.0 --]],
mask =
[[False False False]
[False False True]],
fill_value = 1e+20)
(For some purposes, scipy.sparse may also be interesting.)
In general, there is an ambiguity in putting together arrays of different length because alignment of data might matter. Pandas has different advanced solutions to deal with that, e.g. to merge series into dataFrames.
If you just want to populate columns starting from first element, what I usually do is build a matrix and populate columns. Of course you need to fill the empty spaces in the matrix with a null value (in this case np.nan)
a = ones((3,))
b = ones((2,))
arraylist=[a,b]
outarr=np.ones((np.max([len(ps) for ps in arraylist]),len(arraylist)))*np.nan #define empty array
for i,c in enumerate(arraylist): #populate columns
outarr[:len(c),i]=c
In [108]: outarr
Out[108]:
array([[ 1., 1.],
[ 1., 1.],
[ 1., nan]])
There is a new library for efficiently handling this type of arrays: https://github.com/scikit-hep/awkward-array
I know this is a really old post and that there may be a better way of doing this, BUT why not just use append for such an operation:
import numpy as np
a = np.ones((3,))
b = np.ones((2,))
c = np.append(a, b)
print(c)
output:
[1. 1. 1. 1. 1.]
If you definitely want to use NumPy, you can match the shapes with np.nan and then "unpack" the nan-filled array later. Here is an example with functions.
import numpy as np
from numpy import *
a = np.array([[3,3,3]]).astype(float)
b = np.array([[2,2]]).astype(float)
# Extend each vector in array with Nan to reach same shape
def Pack_Matrices_with_NaN(List_of_matrices, Matrix_size):
Matrix_with_nan = np.arange(Matrix_size)
for array in List_of_matrices:
start_position = len(array[0])
for x in range(start_position,Matrix_size):
array = np.insert(array, (x), np.nan, axis=1)
Matrix_with_nan = np.vstack([Matrix_with_nan, array])
Matrix_with_nan = Matrix_with_nan[1:]
return Matrix_with_nan
arrays = [a,b]
packed_matrices = Pack_Matrices_with_NaN(arrays, 5)
print(packed_matrices)
Output:
[[ 3. 3. 3. nan nan]
[ 2. 2. nan nan nan]]
However, the easiest way would be to append the arrays to a list:
import numpy as np
a = np.array([3,3,3])
b = np.array([2,2])
c = []
c.append(a)
c.append(b)
print(c)
Output:
[array([3, 3, 3]), array([2, 2])]
I used the following code to combine lists of different length in a numpy array and to keep the length information in a second array:
import numpy as np
# create an example list (number can be increased):
my_list=[np.ones(i) for i in np.arange(1000)]
# measure and store length and find max:
dlc=np.array([len(i) for i in my_list]) #list contains the data length code
max_length=max(dlc)
# now we allocate an empty array
result=np.empty(max_length*len(my_list)).reshape(len(my_list),max_length)
# populate:
for i in np.arange(len(dlc)):
result[i][np.arange(dlc[i])]=my_list[i]
# check how the 10th element looks like
print(result[10],dlc[10])
I'm sure the code can be improved in case of the loops. But it already works quite quick because the memory is pre allocated by the empty array.

How can I check whether a numpy array is empty or not?

How can I check whether a numpy array is empty or not?
I used the following code, but this fails if the array contains a zero.
if not self.Definition.all():
Is this the solution?
if self.Definition == array([]):
You can always take a look at the .size attribute. It is defined as an integer, and is zero (0) when there are no elements in the array:
import numpy as np
a = np.array([])
if a.size == 0:
# Do something when `a` is empty
https://numpy.org/devdocs/user/quickstart.html (2020.04.08)
NumPy’s main object is the homogeneous multidimensional array. It is a table of elements (usually numbers), all of the same type, indexed by a tuple of non-negative integers. In NumPy dimensions are called axes.
(...) NumPy’s array class is called ndarray. (...) The more important attributes of an ndarray object are:
ndarray.ndim
the number of axes (dimensions) of the array.
ndarray.shape
the dimensions of the array. This is a tuple of integers indicating the size of the array in each dimension. For a matrix with n rows and m columns, shape will be (n,m). The length of the shape tuple is therefore the number of axes, ndim.
ndarray.size
the total number of elements of the array. This is equal to the product of the elements of shape.
One caveat, though.
Note that np.array(None).size returns 1!
This is because a.size is equivalent to np.prod(a.shape),
np.array(None).shape is (), and an empty product is 1.
>>> import numpy as np
>>> np.array(None).size
1
>>> np.array(None).shape
()
>>> np.prod(())
1.0
Therefore, I use the following to test if a numpy array has elements:
>>> def elements(array):
... return array.ndim and array.size
>>> elements(np.array(None))
0
>>> elements(np.array([]))
0
>>> elements(np.zeros((2,3,4)))
24
Why would we want to check if an array is empty? Arrays don't grow or shrink in the same that lists do. Starting with a 'empty' array, and growing with np.append is a frequent novice error.
Using a list in if alist: hinges on its boolean value:
In [102]: bool([])
Out[102]: False
In [103]: bool([1])
Out[103]: True
But trying to do the same with an array produces (in version 1.18):
In [104]: bool(np.array([]))
/usr/local/bin/ipython3:1: DeprecationWarning: The truth value
of an empty array is ambiguous. Returning False, but in
future this will result in an error. Use `array.size > 0` to
check that an array is not empty.
#!/usr/bin/python3
Out[104]: False
In [105]: bool(np.array([1]))
Out[105]: True
and bool(np.array([1,2]) produces the infamous ambiguity error.
edit
The accepted answer suggests size:
In [11]: x = np.array([])
In [12]: x.size
Out[12]: 0
But I (and most others) check the shape more than the size:
In [13]: x.shape
Out[13]: (0,)
Another thing in its favor is that it 'maps' on to an empty list:
In [14]: x.tolist()
Out[14]: []
But there are other other arrays with 0 size, that aren't 'empty' in that last sense:
In [15]: x = np.array([[]])
In [16]: x.size
Out[16]: 0
In [17]: x.shape
Out[17]: (1, 0)
In [18]: x.tolist()
Out[18]: [[]]
In [19]: bool(x.tolist())
Out[19]: True
np.array([[],[]]) is also size 0, but shape (2,0) and len 2.
While the concept of an empty list is well defined, an empty array is not well defined. One empty list is equal to another. The same can't be said for a size 0 array.
The answer really depends on
what do you mean by 'empty'?
what are you really test for?

How do I remove all zero elements from a NumPy array?

I have a rank-1 numpy.array of which I want to make a boxplot. However, I want to exclude all values equal to zero in the array. Currently, I solved this by looping the array and copy the value to a new array if not equal to zero. However, as the array consists of 86 000 000 values and I have to do this multiple times, this takes a lot of patience.
Is there a more intelligent way to do this?
For a NumPy array a, you can use
a[a != 0]
to extract the values not equal to zero.
This is a case where you want to use masked arrays, it keeps the shape of your array and it is automatically recognized by all numpy and matplotlib functions.
X = np.random.randn(1e3, 5)
X[np.abs(X)< .1]= 0 # some zeros
X = np.ma.masked_equal(X,0)
plt.boxplot(X) #masked values are not plotted
#other functionalities of masked arrays
X.compressed() # get normal array with masked values removed
X.mask # get a boolean array of the mask
X.mean() # it automatically discards masked values
I decided to compare the runtime of the different approaches mentioned here. I've used my library simple_benchmark for this.
The boolean indexing with array[array != 0] seems to be the fastest (and shortest) solution.
For smaller arrays the MaskedArray approach is very slow compared to the other approaches however is as fast as the boolean indexing approach. However for moderately sized arrays there is not much difference between them.
Here is the code I've used:
from simple_benchmark import BenchmarkBuilder
import numpy as np
bench = BenchmarkBuilder()
#bench.add_function()
def boolean_indexing(arr):
return arr[arr != 0]
#bench.add_function()
def integer_indexing_nonzero(arr):
return arr[np.nonzero(arr)]
#bench.add_function()
def integer_indexing_where(arr):
return arr[np.where(arr != 0)]
#bench.add_function()
def masked_array(arr):
return np.ma.masked_equal(arr, 0)
#bench.add_arguments('array size')
def argument_provider():
for exp in range(3, 25):
size = 2**exp
arr = np.random.random(size)
arr[arr < 0.1] = 0 # add some zeros
yield size, arr
r = bench.run()
r.plot()
You can index with a Boolean array. For a NumPy array A:
res = A[A != 0]
You can use Boolean array indexing as above, bool type conversion, np.nonzero, or np.where. Here's some performance benchmarking:
# Python 3.7, NumPy 1.14.3
np.random.seed(0)
A = np.random.randint(0, 5, 10**8)
%timeit A[A != 0] # 768 ms
%timeit A[A.astype(bool)] # 781 ms
%timeit A[np.nonzero(A)] # 1.49 s
%timeit A[np.where(A)] # 1.58 s
I would like to suggest you to simply utilize NaN for cases like this, where you'll like to ignore some values, but still want to keep the procedure statistical as meaningful as possible. So
In []: X= randn(1e3, 5)
In []: X[abs(X)< .1]= NaN
In []: isnan(X).sum(0)
Out[: array([82, 84, 71, 81, 73])
In []: boxplot(X)
A simple line of code can get you an array that excludes all '0' values:
np.argwhere(*array*)
example:
import numpy as np
array = [0, 1, 0, 3, 4, 5, 0]
array2 = np.argwhere(array)
print array2
[1, 3, 4, 5]
[i for i in Array if i != 0.0] if the numbers are float
or [i for i in SICER if i != 0] if the numbers are int.

Categories