Summing numpy array blockwise to form a smaller array [duplicate] - python

This question already has answers here:
How to evaluate the sum of values within array blocks
(3 answers)
Closed 3 years ago.
We have a matrix N x N consisting of n x n blocks. So we have (N/n) x (N/n) blocks. We further divide it into large blocks so that each large block contains m x m number of smaller blocks. And then we need to sum (block-wise) smaller blocks inside each larger block. For example here each A is nxn and m = 2.
enter image description here
What is the simplest and possibly fast way of doing that with numpy array?

One fast way of doing this is to reshape your (N, N) array into (m, n, m, n) and then sum along the axes of size m:
import numpy as np
m = 3
n = 2
N = m * n
arr = np.arange((N * N)).reshape((N, N))
print(arr)
# [[ 0 1 2 3 4 5]
# [ 6 7 8 9 10 11]
# [12 13 14 15 16 17]
# [18 19 20 21 22 23]
# [24 25 26 27 28 29]
# [30 31 32 33 34 35]]
reshaped = arr.reshape((m, n, m, n))
summed = np.sum(reshaped, axis=(0, 2))
print(summed)
# [[126 135]
# [180 189]]
# ...checking a couple of blocks
# the "first m" (index 0) identifies blocks along rows
# the "second m" (index 2) identifies blocks along columns
print(reshaped[0, :, 0, :])
# [[0 1]
# [6 7]]
print(reshaped[1, :, 2, :])
# [[16 17]
# [22 23]]
# ...manually checking that the (0, 0) element of `summed` is correct
sum([0, 2, 4, 12, 14, 16, 24, 26, 28])
# 126

Related

Iterate over last axis of a numpy array

Let's say we have a (20, 5) array. We can iterate over each row very pythonically:
import numpy as np
xs = np.array(range(100)).reshape(20, 5)
for x in xs:
print(x)
If we want to iterate over another axis (here in the example, iterate over columns, but I'm looking for a solution for each possible axis in a ndarray), it's less direct, we can use the method from Iterating over arbitrary dimension of numpy.array:
for i in range(xs.shape[-1]):
x = xs[..., i]
print(x)
Is there a more direct way to iterate over another axis, like (pseudo-code):
for x in xs.iterator(axis=-1):
print(x)
?
I think that as_strided from the stride tricks module should do the work here.
It creates a view into the array and not a copy (as stated by the docs).
Here is a simple demonstration of as_stided capabilities:
from numpy.lib.stride_tricks import as_strided
import numpy as np
xs = np.array(range(3 *3 * 4)).reshape(3,3, 4)
for x in xs:
print(x)
output:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
[[24 25 26 27]
[28 29 30 31]
[32 33 34 35]]
function to iterate over array specific axis:
def iterate_over_axis(arr, axis=0):
strides = arr.strides
strides_ = [strides[axis], *strides[0:axis], *strides[(axis+1):]]
shape = arr.shape
shape_ = [shape[axis], *shape[0:axis], *shape[(axis+1):]]
return as_strided(arr, strides=strides_, shape=shape_)
for x in iterate_over_axis(xs, axis=1):
print(x)
output:
[[ 0 1 2 3]
[12 13 14 15]
[24 25 26 27]]
[[ 4 5 6 7]
[16 17 18 19]
[28 29 30 31]]
[[ 8 9 10 11]
[20 21 22 23]
[32 33 34 35]]

Merge 3D numpy array into pandas Dataframe + 1D vector

I have a dataset which is a numpy array with shape (1536 x 16 x 48). A quick explanation of these dimensions that might be helpful:
The dataset consists of data collected by EEG sensors at 256Hz rate (1 second = 256 measures/values);
1536 values represent 6 seconds of EEG data (256 * 6 = 1536);
16 is the number of electrodes used to collect data;
48 is the number of samples.
In summary: i have 48 samples of 6 seconds (1536 values) of EEG data, collected by 16 electrodes.
I need to create a pandas dataframe with all this data, and therefore turn this 3D array into 2D. The depth dimension (48) can be removed if i stack all samples one above another. So the new dataset will be shaped (1536 * 48) x 16.
In addition to that, since this is a classification problem, i have a vector with 48 values that represents the class of each EEG sample. The new dataset should also has this as a "class" column, and then the real shape would be: (1536 * 48) x 16 + 1 (class).
I could easily do that looping through the depth dimension of the 3D array and concatenate everything into a 2D new one. But this looks bad since i will be dealing with many datasets like this one. Performance is an issue. I would like to know if there's any more clever way of doing it.
I tried to provide the maximum of information i could for this question, but since it is not a trivial task feel free to ask further details if needed.
Thanks in advance.
Setup
>>> import numpy as np
>>> import pandas as pd
>>> a = np.zeros((4,3,3),dtype=int) + [0,1,2]
>>> a *= 10
>>> a += np.array([1,2,3,4])[:,None,None]
>>> a
array([[[ 1, 11, 21],
[ 1, 11, 21],
[ 1, 11, 21]],
[[ 2, 12, 22],
[ 2, 12, 22],
[ 2, 12, 22]],
[[ 3, 13, 23],
[ 3, 13, 23],
[ 3, 13, 23]],
[[ 4, 14, 24],
[ 4, 14, 24],
[ 4, 14, 24]]])
Split evenly along the last dimension; stack those elements, reshape, feed to DataFrame. Using the lengths of the array's dimensions simplifies the process.
>>> d0,d1,d2 = a.shape
>>> pd.DataFrame(np.stack(np.dsplit(a,d2)).reshape(d0*d2,d1))
0 1 2
0 1 1 1
1 2 2 2
2 3 3 3
3 4 4 4
4 11 11 11
5 12 12 12
6 13 13 13
7 14 14 14
8 21 21 21
9 22 22 22
10 23 23 23
11 24 24 24
>>>
Using your shape.
>>> b = np.random.random((1536, 16, 48))
>>> d0,d1,d2 = b.shape
>>> df = pd.DataFrame(np.stack(np.dsplit(b,d2)).reshape(d0*d2,d1))
>>> df.shape
(73728, 16)
>>>
After making the DataFrame from the 3d array, add the classification column to it, df['class'] = data. - Column selection, addition, deletion
For the numpy part
x = np.random.random((1536, 16, 48)) # ndarray with simillar shape
x = x.swapaxes(1,2) # swap axes 1 and 2 i.e 16 and 48
x = x.reshape((-1, 16), order='C') # order is important, you may want to check the docs
c = np.zeros((x.shape[0], 1)) # class column, shape=(73728, 1)
x = np.hstack((x, c)) # final dataset
x.shape
Output
(73728, 17)
or in one line
x = np.hstack((x.swapaxes(1,2).reshape((-1, 16), order='C'), c))
Finally,
x = pd.DataFrame(x)

Padding sequence with numpy and combining a feature array with the number of sequence array

I have a number of sequences stored in an 2D-array [[first_seq,first_seq],[first_seq,first_seq],[sec_seq,sec_seq]],...
Each vector-sequence varies in length.. some are 55 rows long others are 68 rows long.
The sequence 2D-array(features) is shaped (427,227) (, features) and I have another 1D-array(num_seq) (5,) which contains how long each sequence is [55,68,200,42,62] (e.g. first seq is 55 rows long, sencond seq is 68 rows long etc.). len(1D-array) = number of seq
Now, I need each sequence to be equally long - namely each sequence to be 200. Since I have 5 sequences in this example the resulting array should be structured_seq = np.zeros(5,200,227)
If the sequence is shorter than 200 all other values of that sequence should be zero.
Therfore, I tried to fill structured_seq doing something like:
for counter, sent in enumerate(num_seq):
for j, feat in enumerate(features):
if num_sent[counter] < 200:
structured_seq[counter,feat,]
but Im stuck..
So to be precise: The first sequence is the first 55 rows of the 2D-array(features), all reamining 145 should be filled with zeros. And so on..
This is one way you can do that with np.insert:
import numpy as np
# Sizes of sequences
sizes = np.array([5, 2, 4, 6])
# Number of sequences
n = len(sizes)
# Number of elements in the second dimension
m = 3
# Sequence data
data = np.arange(sizes.sum() * m).reshape(-1, m)
# Size to which the sequences need to be padded
min_size = 6
# Number of zeros to add per sequence
num_pads = min_size - sizes
# Zeros
pad = np.zeros((num_pads.sum(), m), data.dtype)
# Position of the new zeros
pad_pos = np.repeat(np.cumsum(sizes), num_pads)
# Insert zeros
out = np.insert(data, pad_pos, pad, axis=0)
# Reshape
out = out.reshape(n, min_size, m)
print(out)
Output:
[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]
[ 0 0 0]]
[[15 16 17]
[18 19 20]
[ 0 0 0]
[ 0 0 0]
[ 0 0 0]
[ 0 0 0]]
[[21 22 23]
[24 25 26]
[27 28 29]
[30 31 32]
[ 0 0 0]
[ 0 0 0]]
[[33 34 35]
[36 37 38]
[39 40 41]
[42 43 44]
[45 46 47]
[48 49 50]]]

Removing rows from a multi dimensional numpy array

I have a rather big 3 dimensional numpy (2000,2500,32) array that I need to manipulate.Some rows are bad so I would need to delete several rows.
In order to detect which row is "bad" I using the following function
def badDetect(x):
for i in xrange(10,19):
ptp = np.ptp(x[i*100:(i+1)*100])
if ptp < 0.01:
return True
return False
which marks as bad any sequence of 2000 that has a range of 100 values with peak to peak value less than 0.01.
When this is the case I want to remove that sequence of 2000 values (which can be selected from numpy with a[:,x,y])
Numpy delete seems to be accepting indexes but only for 2 dimensional arrays.
You will definitely have to reshape your input array, because cutting out "rows" from a 3D cube leaves a structure that cannot be properly addressed.
As we don't have your data, I'll use a different example first to explain how this possible solution works:
>>> import numpy as np
>>> from numpy.lib.stride_tricks import as_strided
>>>
>>> threshold = 18
>>> a = np.arange(5*3*2).reshape(5,3,2) # your dataset of 2000x2500x32
>>> # Taint the data:
... a[0,0,0] = 5
>>> a[a==22]=20
>>> print(a)
[[[ 5 1]
[ 2 3]
[ 4 5]]
[[ 6 7]
[ 8 9]
[10 11]]
[[12 13]
[14 15]
[16 17]]
[[18 19]
[20 21]
[20 23]]
[[24 25]
[26 27]
[28 29]]]
>>> a2 = a.reshape(-1, np.prod(a.shape[1:]))
>>> print(a2) # Will prove to be much easier to work with!
[[ 5 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 20 23]
[24 25 26 27 28 29]]
As you can see, from the representation above, it already becomes much clearer now over which windows you want to compute the peak to peak value. And you'll need this form if you're going to remove "rows" (now they have been transformed to columns) from this datastructure, something you couldn't do in 3 dimensions!
>>> isize = a.itemsize # More generic, in case you have another dtype
>>> slice_size = 4 # How big each continuous slice is over which the Peak2Peak value is calculated
>>> slices = as_strided(a2,
... shape=(a2.shape[0] + 1 - slice_size, slice_size, a2.shape[1]),
... strides=(isize*a2.shape[1], isize*a2.shape[1], isize))
>>> print(slices)
[[[ 5 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 20 23]]
[[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 20 23]
[24 25 26 27 28 29]]]
So I took, as an example, a window size of 4 elements: If the peak to peak value within any of these 4 element slices (per dataset, so per column) is less than a certain threshold, I want to exclude it. That can be done like this:
>>> mask = np.all(slices.ptp(axis=1) >= threshold, axis=0) # These are the ones that are of interest
>>> print(a2[:,mask])
[[ 1 2 3 5]
[ 7 8 9 11]
[13 14 15 17]
[19 20 21 23]
[25 26 27 29]]
You can now clearly see that the tainted data has been removed. But remember, you could not have simply removed that data from a 3D array (but you could've masked it then).
Obviously, you'll have to set the threshold to .01 in your use-case, and the slice_size to 100.
Beware, while the as_strided form is extremely memory-efficient, computing the peak to peak values of this array and storing that result does require a good amount of memory in your case: 1901x(2500x32) in the full case scenario, so when you do not ignore the first 1000 slices. In your case, where you're only interested in the slices from 1000:1900, you would have to add that to the code like so:
mask = np.all(slices[1000:1900,:,:].ptp(axis=1) >= threshold, axis=0)
And that would reduce the memory required to store this mask to "only" 900x(2500x32) values (of whatever data type you were using).

Numpy fancy indexing in multiple dimensions

Let's say I have an numpy array A of size n x m x k and another array B of size n x m that has indices from 1 to k.
I want to access each n x m slice of A using the index given at this place in B,
giving me an array of size n x m.
Edit: that is apparently not what I want!
[[ I can achieve this using take like this:
A.take(B)
]] end edit
Can this be achieved using fancy indexing?
I would have thought A[B] would give the same result, but that results
in an array of size n x m x m x k (which I don't really understand).
The reason I don't want to use take is that I want to be able to assign this portion something, like
A[B] = 1
The only working solution that I have so far is
A.reshape(-1, k)[np.arange(n * m), B.ravel()].reshape(n, m)
but surely there has to be an easier way?
Suppose
import numpy as np
np.random.seed(0)
n,m,k = 2,3,5
A = np.arange(n*m*k,0,-1).reshape((n,m,k))
print(A)
# [[[30 29 28 27 26]
# [25 24 23 22 21]
# [20 19 18 17 16]]
# [[15 14 13 12 11]
# [10 9 8 7 6]
# [ 5 4 3 2 1]]]
B = np.random.randint(k, size=(n,m))
print(B)
# [[4 0 3]
# [3 3 1]]
To create this array,
print(A.reshape(-1, k)[np.arange(n * m), B.ravel()])
# [26 25 17 12 7 4]
as a nxm array using fancy indexing:
i,j = np.ogrid[0:n, 0:m]
print(A[i, j, B])
# [[26 25 17]
# [12 7 4]]

Categories