is Numpy's masked array memory efficient? - python

I was wondering: are numpy's masked arrays able to store a compact representation of the available values? In other words, if I have a numpy array with no set values, will it be stored in memory with negligible size?
Actually, this is not just a casual question, but I need such memory optimization for an application I am developing.

No a masked array is not more compact.
In [344]: m = np.ma.masked_array([1,2,3,4],[1,0,0,1])
In [345]: m
Out[345]:
masked_array(data = [-- 2 3 --],
mask = [ True False False True],
fill_value = 999999)
In [346]: m.data
Out[346]: array([1, 2, 3, 4])
In [347]: m.mask
Out[347]: array([ True, False, False, True], dtype=bool)
It contains both the original (full) array, and a mask. The mask may be a scalar, or it may be a boolean array with the same shape as the data.
scipy.sparse stores just the nonzero values of an array, though the space savings depends on the storage format and the sparsity. So you might simulate your masking with sparsity. Or you could take ideas from that representation.
What do you plan to do with these arrays? Just access items, or do calculations?
Masked arrays are most useful for data that is mostly good, with a modest number of 'bad' values. For example, real life data series with occasional glitches, or monthly data padded to 31 days. Masking lets you keep the data in a rectangular arrangement, and still calculate things like the mean and sum without useing the masked vales.

Related

numpy - Multidimensional boolean mask

I'm quite new to Python and numpy and I just cannot get this to work without manual iteration.
I have an n-dimensional data array with floating point values and an equally shaped boolean "mask" array. From that I need to get a new array in the same shape as the both others with all values from the data array where the mask array at the same position is True. Everything else should be 0.:
# given
data = np.array([[1., 2.], [3., 4.]])
mask = np.array([[True, False], [False, True]])
# target
[[1., 0.], [0., 4.]]
Seems like numpy.where() might offer this but I could not get it to work.
Bonus: Don't create new array but replace data values in-position where mask is False to prevent new memory allocation.
Thanks!
This should work
data[~mask] = 0
Numpy boolean array can be used as index (https://docs.scipy.org/doc/numpy-1.15.0/user/basics.indexing.html#boolean-or-mask-index-arrays). The operation will be applied only on pixels with the value "True". Here you first need to invert your mask so False becomes True. You need the inversion because you want to operate on pixels with a False value.
Also, you can just multiply them. Because 'True' and 'False' is treated as '1' and '0' respectively when a boolean array is input in mathematical operations. So,
#element-wise multiplication
data*mask
or
np.multiply(data, mask)

numpy: Why is there a difference between (x,1) and (x, ) dimensionality

I am wondering why in numpy there are one dimensional array of dimension (length, 1) and also one dimensional array of dimension (length, ) w/o a second value.
I am running into this quite frequently, e.g. when using np.concatenate() which then requires a reshape step beforehand (or I could directly use hstack/vstack).
I can't think of a reason why this behavior is desirable. Can someone explain?
Edit:
It was suggested by one of the comments that my question is a possible duplicate. I am more interested in the underlying working logic of Numpy and not that there is a distinction between 1d and 2d arrays which I think is the point of the mentioned thread.
The data of a ndarray is stored as a 1d buffer - just a block of memory. The multidimensional nature of the array is produced by the shape and strides attributes, and the code that uses them.
The numpy developers chose to allow for an arbitrary number of dimensions, so the shape and strides are represented as tuples of any length, including 0 and 1.
In contrast MATLAB was built around FORTRAN programs that were developed for matrix operations. In the early days everything in MATLAB was a 2d matrix. Around 2000 (v3.5) it was generalized to allow more than 2d, but never less. The numpy np.matrix still follows that old 2d MATLAB constraint.
If you come from a MATLAB world you are used to these 2 dimensions, and the distinction between a row vector and column vector. But in math and physics that isn't influenced by MATLAB, a vector is a 1d array. Python lists are inherently 1d, as are c arrays. To get 2d you have to have lists of lists or arrays of pointers to arrays, with x[1][2] style of indexing.
Look at the shape and strides of this array and its variants:
In [48]: x=np.arange(10)
In [49]: x.shape
Out[49]: (10,)
In [50]: x.strides
Out[50]: (4,)
In [51]: x1=x.reshape(10,1)
In [52]: x1.shape
Out[52]: (10, 1)
In [53]: x1.strides
Out[53]: (4, 4)
In [54]: x2=np.concatenate((x1,x1),axis=1)
In [55]: x2.shape
Out[55]: (10, 2)
In [56]: x2.strides
Out[56]: (8, 4)
MATLAB adds new dimensions at the end. It orders its values like a order='F' array, and can readily change a (n,1) matrix to a (n,1,1,1). numpy is default order='C', and readily expands an array dimension at the start. Understanding this is essential when taking advantage of broadcasting.
Thus x1 + x is a (10,1)+(10,) => (10,1)+(1,10) => (10,10)
Because of broadcasting a (n,) array is more like a (1,n) one than a (n,1) one. A 1d array is more like a row matrix than a column one.
In [64]: np.matrix(x)
Out[64]: matrix([[0, 1, 2, 3, 4, 5, 6, 7, 8, 9]])
In [65]: _.shape
Out[65]: (1, 10)
The point with concatenate is that it requires matching dimensions. It does not use broadcasting to adjust dimensions. There are a bunch of stack functions that ease this constraint, but they do so by adjusting the dimensions before using concatenate. Look at their code (readable Python).
So a proficient numpy user needs to be comfortable with that generalized shape tuple, including the empty () (0d array), (n,) 1d, and up. For more advanced stuff understanding strides helps as well (look for example at the strides and shape of a transpose).
Much of it is a matter of syntax. This tuple (x) isn't a tuple at all (just a redundancy). (x,), however, is.
The difference between (x,) and (x,1) goes even further. You can take a look into the examples of previous questions like this. Quoting the example from it, this is an 1D numpy array:
>>> np.array([1, 2, 3]).shape
(3,)
But this one is 2D:
>>> np.array([[1, 2, 3]]).shape
(1, 3)
Reshape does not make a copy unless it needs to so it should be safe to use.

Numpy Masking with Array

I'm not certain of the best way of asking this question, so I apologize ahead of time.
I'm trying find a peak on each row of an NxM numpy array of audio signals. Each row in the array is treated individually and I'd like to get all values a certain number of standard deviations above the noise floor for each N in the array in frequency space. In this experiment I know that I do not have a signal above 400Hz so I'm using that as my noise floor. I'm running into issues when trying to mask. Here is my code snippet:
from scipy import signal
import numpy as np
Pxx_den = signal.periodogram(input, fs=sampleRate ,nfft=sampleRate,axis=1)
p = np.array(Pxx_den)[1].astype(np.float)
noiseFloor = np.mean(p[:,400:],axis=1)
stdFloor = np.std(p[:,400:],axis=1)
p = np.ma.masked_less(p,noiseFloor+stdFloor*2)
This example will generate an error of:
ValueError: operands could not be broadcast together with shapes (91,5001) (91,)
I've deduced that this is because ma.masked_less works with a single value and does not take in an array. I would like the output to be an NxM array of values greater than the condition. Is there a Numpy way of doing what I'd like or an efficient alternative?
I've also looked at some peak detection routines such as peakUtils and scipy.signal.find_peaks_cwt() but they seem to only act on 1D arrays.
Thanks in advance
Before getting too far into using masked arrays makes sure that the following code handles them. It has to be aware of how masked arrays works, or defer to masked array methods.
As to the specific problem, I think this recreates it:
In [612]: x=np.arange(10).reshape(2,5)
In [613]: np.ma.masked_less(x,np.array([3,6]))
...
ValueError: operands could not be broadcast together with shapes (2,5) (2,)
I have a 2d array, and I try to apply the < mask with different values for each row.
Instead I can generate the mask as a 2d array matching x:
In [627]: mask= x<np.array([3,6])[:,None]
In [628]: np.ma.masked_where(mask,x)
Out[628]:
masked_array(data =
[[-- -- -- 3 4]
[-- 6 7 8 9]],
mask =
[[ True True True False False]
[ True False False False False]],
fill_value = 999999)
I can also select the values, though in a way that looses the 2d structure.
In [631]: x[~mask]
Out[631]: array([3, 4, 6, 7, 8, 9])
In [632]: np.ma.masked_where(mask,x).compressed()
Out[632]: array([3, 4, 6, 7, 8, 9])

NetCDF4 [[--]] value for lat, long interpolation

Some of my requests to a netCDF4 object return a [[--]] value for invalid. The real numeric value for some locations is [[someNumerical]] .
How can I catch this? It's not documented in the http://matplotlib.org/basemap/api/basemap_api.html interp documentation?
The reason why I am getting it is that my lat, long are out of bounds for reasonable interpolation, but I simply do not understand how to catch this return value.
Here's my call to it:
value = interp(theData, longitudes, latitudes, np.asarray( [[ convertLongitude(longitude)]] ), np.asarray( [[ convertLatitude(latitude) ]] ), checkbounds=True, masked=True, order=1)
Well, a workaround is of course to do
if str(value) == '[[--]]':
doSomething
Your question is unclear as to where you think the problem is - in the values fetched via NetCDF4, or values returned by interp.
However when looking at the documentation for interp I find:
masked
If True, points outside the range of xin and yin are masked (in a masked array). If masked is set to a number, then points outside the range of xin and yin will be set to that number. Default False.
http://matplotlib.org/basemap/api/basemap_api.html#mpl_toolkits.basemap.interp
The [--] value makes sense in the context of masked array.
In a masked array, masked values (usually non-valid ones) are displayed with a --:
In [380]: x=np.ma.masked_greater(np.arange(4), 2)
In [381]: x
Out[381]:
masked_array(data = [0 1 2 --],
mask = [False False False True],
fill_value = 999999)
You need to read up on masked array if you want to use the masked=True parameter.
You can do things like replace the masked elements with a fillvalue
In [387]: x.filled()
Out[387]: array([ 0, 1, 2, 999999])
In [388]: x.filled(-1)
Out[388]: array([ 0, 1, 2, -1])
or remove them
In [389]: x.compressed()
Out[389]: array([0, 1, 2])
The fact that you are seeing [[--]] suggests that values might be a 2d array. If so compressed might not be useful.
But a key point is that values array does not actually have -- values. That is what is displayed, as a filler.

Basic NumPy data comparison

I have an array of N-dimensional values arranged in a 2D array. Something like:
import numpy as np
data = np.array([[[1,2],[3,4]],[[5,6],[1,2]]])
I also have a single value x that I want to compare against each data point, and I want to get a 2D array of boolean values showing whether my data is equal to x.
x = np.array([1,2])
If I do:
data == x
I get
# array([[[ True, True],
# [False, False]],
#
# [[False, False],
# [ True, True]]], dtype=bool)
I could easily combine these to get the result I want. However, I don't want to iterate over each of these slices, especially when data.shape[2] is larger. What I am looking for is a direct way of getting:
array([[ True, False],
[False, True]])
Any ideas for this seemingly easy task?
Well, (data == x).all(axis=-1) gives you what you want. It's still constructing a 3-d array of results and iterating over it, but at least that iteration isn't at Python-level, so it should be reasonably fast.

Categories