How to do equivalent of block_reduce on a masked array? - python

I'm calculating an aggregate value over smaller blocks in a 2D numpy array. I'd like to exclude values 0 from the aggregation operation in an efficient manner (rather than for and if statements).
I'm using skimage.measure.block_reduce and numpy.ma.masked_equal, but it looks like block_reduce ignores the mask.
import numpy as np
import skimage
a = np.array([[2,4,0,12,5,7],[6,0,8,4,3,9]])
zeros_included = skimage.measure.block_reduce(a,(2,2),np.mean)
includes 0s and (correctly) produces
zeros_included
array([[3., 6., 6.]])
I was hoping
masked = np.ma.masked_equal(a,0)
zeros_excluded = skimage.measure.block_reduce(masked,(2,2),np.mean)
would do the trick, but still produces
zeros_excluded
array([[3., 6., 6.]])
The desired result would be:
array([[4., 8., 6.]])
I'm looking for a pythonesque way to achieve the correct result, use of skimage is optional. Of course my actual arrays and blocks are much bigger than in this example, hence the need for efficiency.
Thanks for your interest.

You could use np.nanmean, but you'll have to modify original array or create a new one:
import numpy as np
import skimage
a = np.array([[2,4,0,12,5,7],[6,0,8,4,3,9]])
b = a.astype("float")
b[b==0] = np.nan
zeros_excluded = skimage.measure.block_reduce(b,(2,2), np.nanmean)
zeros_excluded
# array([[4., 8., 6.]])

The core code of block_reduce is
blocked = view_as_blocks(image, block_size)
return func(blocked, axis=tuple(range(image.ndim, blocked.ndim)))
view_as_blocks uses as_strided to create a different view of the array:
In [532]: skimage.util.view_as_blocks(a,(2,2))
Out[532]:
array([[[[ 2, 4],
[ 6, 0]],
[[ 0, 12],
[ 8, 4]],
[[ 5, 7],
[ 3, 9]]]])
When applied to the masked array it produces the same thing. In effect it works with masked.data, or np.asarray(masked). Some actions preserve subclasses, this does not.
In [533]: skimage.util.view_as_blocks(masked,(2,2))
Out[533]:
array([[[[ 2, 4],
[ 6, 0]],
...
That's why the np.mean applied to the (2,3) axes does not respond to the masking.
np.mean applied to a masked array delegates the action to the arrays own method, so is sensitive to the masking:
In [544]: np.mean(masked[:,:2])
Out[544]: 4.0
In [545]: masked[:,:2].mean()
Out[545]: 4.0
In [547]: [masked[:,i:i+2].mean() for i in range(0,6,2)]
Out[547]: [4.0, 8.0, 6.0]
np.nanmean works with view_as_blocks because it doesn't depend on the array being a special subclass.
I can define a function that applies masking to the block view:
def foo(arr,axis):
return np.ma.masked_equal(arr,0).mean(axis)
In [552]: skimage.measure.block_reduce(a,(2,2),foo)
Out[552]:
masked_array(data=[[4.0, 8.0, 6.0]],
mask=[[False, False, False]],
fill_value=1e+20)
====
Since your blocks aren't overlapping, I create the blocks with reshaping and swapping axes.
In [554]: masked.reshape(2,3,2).transpose(1,0,2)
Out[554]:
masked_array(
data=[[[2, 4],
[6, --]],
[[--, 12],
[8, 4]],
[[5, 7],
[3, 9]]],
mask=[[[False, False],
[False, True]],
[[ True, False],
[False, False]],
[[False, False],
[False, False]]],
fill_value=0)
and then apply mean to the last 2 axes:
In [555]: masked.reshape(2,3,2).transpose(1,0,2).mean((1,2))
Out[555]:
masked_array(data=[4.0, 8.0, 6.0],
mask=[False, False, False],
fill_value=1e+20)

Related

Assign values to numpy array using row indices

Suppose I have two arrays, a=np.array([0,0,1,1,1,2]), b=np.array([1,2,4,2,6,5]). Elements in a mean the row indices of where b should be assigned. And if there are multiple elements in the same row, the values should be assigned in order.
So the result is a 2D array c:
c = np.zeros((3, 4))
counts = {k:0 for k in range(3)}
for i in range(a.shape[0]):
c[a[i], counts[a[i]]]=b[i]
counts[a[i]]+=1
print(c)
Is there a way to use some fancy indexing method in numpy to get such results faster (without a for loop) in case these arrays are big.
I had to run your code to actually see what it produced. There are limits to what I can 'run' in my head.
In [230]: c
Out[230]:
array([[1., 2., 0., 0.],
[4., 2., 6., 0.],
[5., 0., 0., 0.]])
In [231]: counts
Out[231]: {0: 2, 1: 3, 2: 1}
Omitting this information may be delaying possible answers. 'vectorization' requires thinking in whole-array terms, which is easiest if I can visualize the result, and look for a pattern.
This looks like a padding problem.
In [260]: u, c = np.unique(a, return_counts=True)
In [261]: u
Out[261]: array([0, 1, 2])
In [262]: c
Out[262]: array([2, 3, 1]) # cf with counts
Load data with rows of different sizes into Numpy array
Working from previous padding questions, I can construct a mask:
In [263]: mask = np.arange(4)<c[:,None]
In [264]: mask
Out[264]:
array([[ True, True, False, False],
[ True, True, True, False],
[ True, False, False, False]])
and use that to assign the b values to c:
In [265]: c = np.zeros((3,4),int)
In [266]: c[mask] = b
In [267]: c
Out[267]:
array([[1, 2, 0, 0],
[4, 2, 6, 0],
[5, 0, 0, 0]])
Since a is already sorted we might get the counts faster than with unique. Also it will have problems if a doesn't have any values for some row(s).

replace masked with nan in numpy masked_array

>> masks = [[1,1],[0,0]]
>> [np.ma.masked_array(data=np.array([1.0,2.0]), mask=m, fill_value=np.nan).mean() for m in masks]
[masked, 1.5]
I'd like to replace the masked result with nan. Is there a way to do that directly with numpy's masked_array?
In [232]: M = np.ma.masked_array(data=np.array([1.0,2.0]),mask=[True, False])
filled method replaces the masked values with the fill value:
In [233]: M.filled()
Out[233]: array([1.e+20, 2.e+00])
In [234]: M.filled(np.nan) # or with a value of your choice.
Out[234]: array([nan, 2.])
Or as you do, specify the fill value when defining the array:
In [235]: M = np.ma.masked_array(data=np.array([1.0,2.0]),mask=[True, False],
...: fill_value=np.nan)
In [236]: M
Out[236]:
masked_array(data=[--, 2.0],
mask=[ True, False],
fill_value=nan)
In [237]: M.filled()
Out[237]: array([nan, 2.])
The masked mean method skips over the filled values:
In [238]: M.mean()
Out[238]: 2.0
In [239]: M.filled().mean()
Out[239]: nan
In [241]: np.nanmean(M.filled()) # so does the `nanmean` function
In [242]: M.data.mean() # mean of the underlying data
Out[242]: 1.5
I think you can do np.ones
masks=np.array([[1,1],[0,0]])
np.ma.masked_array(data=np.array([1.0,2.0])*np.ones(masks.shape), mask=masks, fill_value=np.nan).mean(axis=1)
Out[145]:
masked_array(data=[--, 1.5],
mask=[ True, False],
fill_value=1e+20)

np.where equivalent for 1-D arrays

I am trying to fill nan values in an array with values from another array. Since the arrays I am working on are 1-D np.where is not working. However, following the tip in the documentation I tried the following:
import numpy as np
sample = [1, 2, np.nan, 4, 5, 6, np.nan]
replace = [3, 7]
new_sample = [new_value if condition else old_value for (new_value, condition, old_value) in zip(replace, np.isnan(sample), sample)]
However, instead output I expected [1, 2, 3, 4, 5, 6, 7] I get:
[Out]: [1, 2]
What I am doing wrong?
np.where works
In [561]: sample = np.array([1, 2, np.nan, 4, 5, 6, np.nan])
Use isnan to identify the nan values (don't use ==)
In [562]: np.isnan(sample)
Out[562]: array([False, False, True, False, False, False, True])
In [564]: np.where(np.isnan(sample))
Out[564]: (array([2, 6], dtype=int32),)
Either one, the boolean or the where tuple can index the nan values:
In [565]: sample[Out[564]]
Out[565]: array([nan, nan])
In [566]: sample[Out[562]]
Out[566]: array([nan, nan])
and be used to replace:
In [567]: sample[Out[562]]=[1,2]
In [568]: sample
Out[568]: array([1., 2., 1., 4., 5., 6., 2.])
The three parameter also works - but returns a copy.
In [571]: np.where(np.isnan(sample),999,sample)
Out[571]: array([ 1., 2., 999., 4., 5., 6., 999.])
You can use numpy.argwhere. But #hpaulj shows that numpy.where works just as well.
import numpy as np
sample = np.array([1, 2, np.nan, 4, 5, 6, np.nan])
replace = np.array([3, 7])
sample[np.argwhere(np.isnan(sample)).ravel()] = replace
# array([ 1., 2., 3., 4., 5., 6., 7.])

Append selected values from a multi dimensional array to a new array

Hello :) I am a python beginner and i started working with numpy lately, basically i got a nd-array: data.shape = {55000, 784} filled with float32 values. Based on a condition i made, i want to append specific rows and their columns to a new array, its important that the formating stays the same. e.g. i want data[5][0-784] appended to an empty array.. i heard about something called fancy indexing, still couldn't figure out how to use it, an example would help me out big time. I would appreciate every help from you guys! - Greets
I'd recommend skimming through the documentation for Indexing. But, here is an example to demonstrate.
import numpy as np
data = np.array([[0, 1, 2], [3, 4, 5]])
print(data.shape)
(2, 3)
print(data)
[[0 1 2]
[3 4 5]]
selection = data[1, 1:3]
print(selection)
[4 5]
Fancy indexing is an advanced indexing function which allows indexing using integer arrays. Here is an example.
fancy_selection = data[[0, 1], [0, 2]]
print(fancy_selection)
[0 5]
Since you also asked about appending, have a look at Append a NumPy array to a NumPy array. Here is an example anyway.
data_two = np.array([[6, 7, 8]])
appended_array = np.concatenate((data, data_two))
print(appended_array)
[[0 1 2]
[3 4 5]
[6 7 8]]
As #hpaulj recommends in his comment appending to arrays is possible but inefficient and should be avoided. Let's turn to your example but make the numbers a bit smaller.
a = np.sum(np.ogrid[1:5, 0.1:0.39:0.1])
a
# array([[ 1.1, 1.2, 1.3],
# [ 2.1, 2.2, 2.3],
# [ 3.1, 3.2, 3.3],
# [ 4.1, 4.2, 4.3]])
a.shape
# (4, 3)
Selecting an element:
a[1,2]
# 2.3
Selecting an entire row:
a[2, :] # or a[2] or a 2[, ...]
# array([ 3.1, 3.2, 3.3])
or column:
a[:, 1] # or a[..., 1]
# array([ 1.2, 2.2, 3.2, 4.2])
fancy indexing, observe that the first index is not a slice but a list or array:
a[[3,0,0,1], :] # or a[[3,0,0,1]]
# array([[ 4.1, 4.2, 4.3],
# [ 1.1, 1.2, 1.3],
# [ 1.1, 1.2, 1.3],
# [ 2.1, 2.2, 2.3]])
fancy indexing can be used on multiple axes to select arbitrary elements and assemble them to a new shape for example you could make a 2x2x2 array like so:
a[ [[[0,1], [1,2]], [[3,3], [3,2]]], [[[2,1], [1,1]], [[2,1], [0,0]]] ]
# array([[[ 1.3, 2.2],
# [ 2.2, 3.2]],
#
# [[ 4.3, 4.2],
# [ 4.1, 3.1]]])
There is also logical indexing
mask = np.isclose(a%1.1, 1.0)
mask
# array([[False, False, False],
# [ True, False, False],
# [False, True, False],
# [False, False, True]], dtype=bool)
a[mask]
# array([ 2.1, 3.2, 4.3])
To combine arrays, collect them in a list and use concatenate
np.concatenate([a[1:, :2], a[:0:-1, [2,0]]], axis=1)
# array([[ 2.1, 2.2, 4.3, 4.1],
# [ 3.1, 3.2, 3.3, 3.1],
# [ 4.1, 4.2, 2.3, 2.1]])
Hope that help getting you started.

ndarray to structured_array and float to int

The problem I encounter is that, by using ndarray.view(np.dtype) to get a structured array from a classic ndarray seems to miscompute the float to int conversion.
Example talks better:
In [12]: B
Out[12]:
array([[ 1.00000000e+00, 1.00000000e+00, 0.00000000e+00,
0.00000000e+00, 4.43600000e+01, 0.00000000e+00],
[ 1.00000000e+00, 2.00000000e+00, 7.10000000e+00,
1.10000000e+00, 4.43600000e+01, 1.32110000e+02],
[ 1.00000000e+00, 3.00000000e+00, 9.70000000e+00,
2.10000000e+00, 4.43600000e+01, 2.04660000e+02],
...,
[ 1.28900000e+03, 1.28700000e+03, 0.00000000e+00,
9.99999000e+05, 4.75600000e+01, 3.55374000e+03],
[ 1.28900000e+03, 1.28800000e+03, 1.29000000e+01,
5.40000000e+00, 4.19200000e+01, 2.08400000e+02],
[ 1.28900000e+03, 1.28900000e+03, 0.00000000e+00,
0.00000000e+00, 4.19200000e+01, 0.00000000e+00]])
In [14]: B.view(A.dtype)
Out[14]:
array([(4607182418800017408, 4607182418800017408, 0.0, 0.0, 44.36, 0.0),
(4607182418800017408, 4611686018427387904, 7.1, 1.1, 44.36, 132.11),
(4607182418800017408, 4613937818241073152, 9.7, 2.1, 44.36, 204.66),
...,
(4653383897399164928, 4653375101306142720, 0.0, 999999.0, 47.56, 3553.74),
(4653383897399164928, 4653379499352653824, 12.9, 5.4, 41.92, 208.4),
(4653383897399164928, 4653383897399164928, 0.0, 0.0, 41.92, 0.0)],
dtype=[('i', '<i8'), ('j', '<i8'), ('tnvtc', '<f8'), ('tvtc', '<f8'), ('tf', '<f8'), ('tvps', '<f8')])
The 'i' and 'j' columns are true integers:
Here you have two further check I have done, the problem seems to come from the ndarray.view(np.int)
In [21]: B[:,:2]
Out[21]:
array([[ 1.00000000e+00, 1.00000000e+00],
[ 1.00000000e+00, 2.00000000e+00],
[ 1.00000000e+00, 3.00000000e+00],
...,
[ 1.28900000e+03, 1.28700000e+03],
[ 1.28900000e+03, 1.28800000e+03],
[ 1.28900000e+03, 1.28900000e+03]])
In [22]: B[:,:2].view(np.int)
Out[22]:
array([[4607182418800017408, 4607182418800017408],
[4607182418800017408, 4611686018427387904],
[4607182418800017408, 4613937818241073152],
...,
[4653383897399164928, 4653375101306142720],
[4653383897399164928, 4653379499352653824],
[4653383897399164928, 4653383897399164928]])
In [23]: B[:,:2].astype(np.int)
Out[23]:
array([[ 1, 1],
[ 1, 2],
[ 1, 3],
...,
[1289, 1287],
[1289, 1288],
[1289, 1289]])
What am I doing wrong? Can't I change the type due to numpy allocation memory? Is there another way to do this (fromarrays, was accusing a shape mismatch ?
This is the difference between doing somearray.view(new_dtype) and calling astype.
What you're seeing is exactly the expected behavior, and it's very deliberate, but it's uprising the first time you come across it.
A view with a different dtype interprets the underlying memory buffer of the array as the given dtype. No copies are made. It's very powerful, but you have to understand what you're doing.
A key thing to remember is that calling view never alters the underlying memory buffer, just the way that it's viewed by numpy (e.g. dtype, shape, strides). Therefore, view deliberately avoids altering the data to the new type and instead just interprets the "old bits" as the new dtype.
For example:
In [1]: import numpy as np
In [2]: x = np.arange(10)
In [3]: x
Out[3]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [4]: x.dtype
Out[4]: dtype('int64')
In [5]: x.view(np.int32)
Out[5]: array([0, 0, 1, 0, 2, 0, 3, 0, 4, 0, 5, 0, 6, 0, 7, 0, 8, 0, 9, 0],
dtype=int32)
In [6]: x.view(np.float64)
Out[6]:
array([ 0.00000000e+000, 4.94065646e-324, 9.88131292e-324,
1.48219694e-323, 1.97626258e-323, 2.47032823e-323,
2.96439388e-323, 3.45845952e-323, 3.95252517e-323,
4.44659081e-323])
If you want to make a copy of the array with a new dtype, use astype instead:
In [7]: x
Out[7]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
In [8]: x.astype(np.int32)
Out[8]: array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9], dtype=int32)
In [9]: x.astype(float)
Out[9]: array([ 0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])
However, using astype with structured arrays will probably surprise you. Structured arrays treat each element of the input as a C-like struct. Therefore, if you call astype, you'll run into several suprises.
Basically, you want the columns to have a different dtype. In that case, don't put them in the same array. Numpy arrays are expected to be homogenous. Structured arrays are handy in certain cases, but they're probably not what you want if you're looking for something to handle separate columns of data. Just use each column as its own array.
Better yet, if you're working with tabular data, you'll probably find its easier to use pandas than to use numpy arrays directly. pandas is oriented towards tabular data (where columns are expected to have different types), while numpy is oriented towards homogenous arrays.
Actually, from_arrays work, but it doesn't explain this weird comportment.
Here is the solution I've found:
np.core.records.fromarrays(B.T, dtype=A.dtype)
The only solution which worked for me in similar situation:
np.array([tuple(row) for row in B], dtype=A.dtype)

Categories