NumPy - Faster Operations on Masked Array? - python

I have a numpy array:
import numpy as np
arr = np.random.rand(100)
If I want to find its maximum value, I run np.amax which runs 155,357 times a second on my machine.
However, for some reasons, I have to mask some of its values. Lets, for example, mask just one cell:
import numpy.ma as ma
arr = ma.masked_array(arr, mask=[0]*99 + [1])
Now, finding the max is much slower, running 26,574 times a second.
This is only 17% of the speed of this operation on a none-masked array.
Other operations, for example, are the subtract, add, and multiply. Although on a masked array they operate on ALL OF THE VALUES, it is only 4% of the speed compared to a none-masked array (15,343/497,663)
I'm looking for a faster way to operate on masked arrays like this, whether its using numpy or not.
(I need to run this on real data, which is arrays with multiple dimensions, and millions of cells)

MaskedArray is a subclass of the base numpy ndarray. It does not have compiled code of its own. Look at the numpy/ma/ directory for details, or the main file:
/usr/local/lib/python3.6/dist-packages/numpy/ma/core.py
A masked array has to key attributes, data and mask, one is the data array you used to create it, the other a boolean array of the same size.
So all operations have to take those two arrays into account. Not only does it calculate new data, it also has to calculate a new mask.
It can take several approaches (depending on the operation):
use the data as is
use compressed data - a new array with the masked values removed
use filled data, where the masked values are replaced by the fillvalue or some innocuous value (e.g. 0 when doing addition, 1 when doing multiplication).
The number of masked values, 0 or all, makes little, if any, difference is speed.
So the speed differences that you see are not surprising. There's a lot of extra calculation going on. The ma.core.py file says this package was first developed in pre-numpy days, and incorporated into numpy around 2005. While there have been changes to keep it up to date, I don't think it has been significantly reworked.
Here's the code for np.ma.max method:
def max(self, axis=None, out=None, fill_value=None, keepdims=np._NoValue):
kwargs = {} if keepdims is np._NoValue else {'keepdims': keepdims}
_mask = self._mask
newmask = _check_mask_axis(_mask, axis, **kwargs)
if fill_value is None:
fill_value = maximum_fill_value(self)
# No explicit output
if out is None:
result = self.filled(fill_value).max(
axis=axis, out=out, **kwargs).view(type(self))
if result.ndim:
# Set the mask
result.__setmask__(newmask)
# Get rid of Infs
if newmask.ndim:
np.copyto(result, result.fill_value, where=newmask)
elif newmask:
result = masked
return result
# Explicit output
....
The key steps are
fill_value = maximum_fill_value(self) # depends on dtype
self.filled(fill_value).max(
axis=axis, out=out, **kwargs).view(type(self))
You can experiment with filled to see what happens with your array.
In [40]: arr = np.arange(10.)
In [41]: arr
Out[41]: array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
In [42]: Marr = np.ma.masked_array(arr, mask=[0]*9 + [1])
In [43]: Marr
Out[43]:
masked_array(data=[0.0, 1.0, 2.0, 3.0, 4.0, 5.0, 6.0, 7.0, 8.0, --],
mask=[False, False, False, False, False, False, False, False,
False, True],
fill_value=1e+20)
In [44]: np.ma.maximum_fill_value(Marr)
Out[44]: -inf
In [45]: Marr.filled()
Out[45]:
array([0.e+00, 1.e+00, 2.e+00, 3.e+00, 4.e+00, 5.e+00, 6.e+00, 7.e+00,
8.e+00, 1.e+20])
In [46]: Marr.filled(_44)
Out[46]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., -inf])
In [47]: arr.max()
Out[47]: 9.0
In [48]: Marr.max()
Out[48]: 8.0

Related

numpy - Multidimensional boolean mask

I'm quite new to Python and numpy and I just cannot get this to work without manual iteration.
I have an n-dimensional data array with floating point values and an equally shaped boolean "mask" array. From that I need to get a new array in the same shape as the both others with all values from the data array where the mask array at the same position is True. Everything else should be 0.:
# given
data = np.array([[1., 2.], [3., 4.]])
mask = np.array([[True, False], [False, True]])
# target
[[1., 0.], [0., 4.]]
Seems like numpy.where() might offer this but I could not get it to work.
Bonus: Don't create new array but replace data values in-position where mask is False to prevent new memory allocation.
Thanks!
This should work
data[~mask] = 0
Numpy boolean array can be used as index (https://docs.scipy.org/doc/numpy-1.15.0/user/basics.indexing.html#boolean-or-mask-index-arrays). The operation will be applied only on pixels with the value "True". Here you first need to invert your mask so False becomes True. You need the inversion because you want to operate on pixels with a False value.
Also, you can just multiply them. Because 'True' and 'False' is treated as '1' and '0' respectively when a boolean array is input in mathematical operations. So,
#element-wise multiplication
data*mask
or
np.multiply(data, mask)

Elegant solution to appending vector to matrix in Numpy?

I've seen others post on this, but it's not clear to me if there's a better solution. I've got a 2D NumPy array, and I'd like to append a column to it. For example:
import numpy as np
A = np.array([[2., 3.],[-1., -2.]])
e = np.ones(2)
print(A)
print(e)
B = np.hstack((A,e.reshape((2,1))))
print(B)
does exactly what I want. But is there a way to avoid this clunky use of reshape?
If you want to avoid using reshape then you have to be appending a column of the right dimensions:
e = np.ones((2, 1))
B = np.hstack((A,e))
Note the modification to the call to ones. The reason you have to use reshape at the moment is that numpy does not regard an array of dimension 2 to be the same as an array of dimension (2, 1). The second is a 2D array where the size of one of the dimensions is 1.
My nomination for a direct solution is
np.concatenate((A, e[:, None]), axis=1)
The [:,None] turns e into a (2,1) which can be joined to the (2,2) to produce a (2,3). Reshape does the same, but isn't as syntactically pretty.
Solutions using hstack, vstack, and c_ do the same thing but hide one or more details.
In this case I think column_stack hides the most details.
np.column_stack((A, e))
Under the covers this does:
np.concatenate((A, np.array(e, copy=False, ndmin=2).T), axis=1)
That np.array(... ndmin=2).T is yet another way of doing the reshape.
There are many solutions. I like np.c_ which treats 1d inputs as columns (hence c) resulting in a concise, clutter-free, easy to read:
np.c_[A, e]
# array([[ 2., 3., 1.],
# [-1., -2., 1.]])
As Tim B says, to hstack you need a (2,1) array. Alternatively (keeping your e as a one-dimensional array), vstack to the transpose, and take the transpose:
In [11]: np.vstack((A.T, e)).T
Out[11]:
array([[ 2., 3., 1.],
[-1., -2., 1.]])

fill off diagonal of numpy array fails

I'm trying to the fill the offset diagonals of a matrix:
loss_matrix = np.zeros((125,125))
np.diagonal(loss_matrix, 3).fill(4)
ValueError: assignment destination is read-only
Two questions:
1) Without iterating over indexes, how can I set the offset diagonals of a numpy array?
2) Why is the result of np.diagonal read only? The documentation for numpy.diagonal reads: "In NumPy 1.10, it will return a read/write view and writing to the returned array will alter your original array."
np.__version__
'1.10.1'
Judging by the discussion on the NumPy issue tracker, it looks like the feature is stuck in limbo and they never got around to fixing the documentation to say it was delayed.
If you need writability, you can force it. This will only work on NumPy 1.9 and up, since np.diagonal makes a copy on lower versions:
diag = np.diagonal(loss_matrix, 3)
# It's not writable. MAKE it writable.
diag.setflags(write=True)
diag.fill(4)
In an older version, diagflat constructs an array from a diagonal.
In [180]: M=np.diagflat(np.ones(125-3)*4,3)
In [181]: M.shape
Out[181]: (125, 125)
In [182]: M.diagonal(3)
Out[182]:
array([ 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4., 4.,... 4.])
In [183]: np.__version__
Out[183]: '1.8.2'
Effectively it does this (working from its Python code)
res = np.zeros((125, 125))
i = np.arange(122)
fi = i+3+i*125
res.flat[fi] = 4
That is, it finds the flatten array equivalent indices of the diagonal.
I can also get fi with:
In [205]: i=np.arange(0,122)
In [206]: np.ravel_multi_index((i,i+3),(125,125))

Normalise 2D Numpy Array: Zero Mean Unit Variance

I have a 2D Numpy array, in which I want to normalise each column to zero mean and unit variance. Since I'm primarily used to C++, the method in which I'm doing is to use loops to iterate over elements in a column and do the necessary operations, followed by repeating this for all columns. I wanted to know about a pythonic way to do so.
Let class_input_data be my 2D array. I can get the column mean as:
column_mean = numpy.sum(class_input_data, axis = 0)/class_input_data.shape[0]
I then subtract the mean from all columns by:
class_input_data = class_input_data - column_mean
By now, the data should be zero mean. However, the value of:
numpy.sum(class_input_data, axis = 0)
isn't equal to 0, implying that I have done something wrong in my normalisation. By isn't equal to 0, I don't mean very small numbers which can be attributed to floating point inaccuracies.
Something like:
import numpy as np
eg_array = 5 + (np.random.randn(10, 10) * 2)
normed = (eg_array - eg_array.mean(axis=0)) / eg_array.std(axis=0)
normed.mean(axis=0)
Out[14]:
array([ 1.16573418e-16, -7.77156117e-17, -1.77635684e-16,
9.43689571e-17, -2.22044605e-17, -6.09234885e-16,
-2.22044605e-16, -4.44089210e-17, -7.10542736e-16,
4.21884749e-16])
normed.std(axis=0)
Out[15]: array([ 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

Why doesn't scipy's interpolate average over colocated values?

If I were to run the following code:
>>> from scipy.interpolate import interpolate
>>> import numpy as np
>>> data = np.arange(10)
>>> times = np.r_[np.arange(5),np.arange(5)]
>>> new_times = np.arange(5)
>>> f = interpolate.interp1d(times,data)
>>> interp_data = f(new_times)
I would naively (and hopefully) expect the following:
>>> interp_data
array([2.5, 3.5, 4.5, 5.5, 6.5])
based on the assumption that colocated values would be averaged and weighted accordingly in the interpolation. But, in fact, the result is:
>>> interp_data
array([ 0., 6., 7., 8., 9.])
What is causing this behaviour, and how could it be rectified?
From the interp1d documentation:
assume_sorted : bool, optional If False, values of x can be in any
order and they are sorted first. If True, x has to be an array of
monotonically increasing values.
I can only get the result you got by explicity forcing assume_sorted to be True:
>>> f = interpolate.interp1d(times,data, assume_sorted=True)
>>> interp_data = f(new_times)
>>> interp_data
array([ 0., 6., 7., 8., 9.])
It appears from your code that assume_sorted defaulted to True, which is giving the answer you don't expect.
If you explicitly set it to False, according to the documentation, interp1d sorts it automatically, and then does the interpolation, giving
>>> f = interpolate.interp1d(times,data)
>>> interp_data = f(new_times)
>>> interp_data
array([ nan, 1., 2., 3., 4.])
which is consistent with the documentation.
I'm not sure exact what you want but it seems interp may not be the best way to achieve this. An interpolation function, f, should relate a single input to a single output, i.e.
from scipy.interpolate import interpolate
import numpy as np
data = np.arange(2.,8.)
times = np.arange(data.shape[0])
new_times = np.arange(0.5,5.,1.)
f = interpolate.interp1d(times,data)
interp_data = f(new_times)
Alternatively, maybe an answer like:
Get sums of pairs of elements in a numpy array
may be what you wanted?
No, interp1d would not weight, average or do anything else to the data for you.
It expects the data to be sorted. If your scipy is recent enough (0.14 or above), it has assume_sorted keyword which you can set to False and then it'll just sort it for you. The precise behavior for unsorted data is undefined.

Categories