Slicing array with numpy? - python

import numpy as np
r = np.arange(36)
r.resize((6, 6))
print(r)
# prints:
# [[ 0 1 2 3 4 5]
# [ 6 7 8 9 10 11]
# [12 13 14 15 16 17]
# [18 19 20 21 22 23]
# [24 25 26 27 28 29]
# [30 31 32 33 34 35]]
print(r[:,::7])
# prints:
# [[ 0]
# [ 6]
# [12]
# [18]
# [24]
# [30]]
print(r[:,0])
# prints:
# [ 0 6 12 18 24 30]
The r[:,::7] gives me a column, the r[:,0] gives me a row, they both have the same numbers. Would be glad if someone could explain to me why?

Because the step argument is greater than the corresponding shape so you'll just get the first "row". However these are not identical (even if they contain the same numbers) because the scalar index in [:, 0] flattens the corresponding dimension (so you'll get a 1D array). But [:, ::7] will keep the number of dimensions intact but alters the shape of the step-sliced dimension.

Related

Iterate over last axis of a numpy array

Let's say we have a (20, 5) array. We can iterate over each row very pythonically:
import numpy as np
xs = np.array(range(100)).reshape(20, 5)
for x in xs:
print(x)
If we want to iterate over another axis (here in the example, iterate over columns, but I'm looking for a solution for each possible axis in a ndarray), it's less direct, we can use the method from Iterating over arbitrary dimension of numpy.array:
for i in range(xs.shape[-1]):
x = xs[..., i]
print(x)
Is there a more direct way to iterate over another axis, like (pseudo-code):
for x in xs.iterator(axis=-1):
print(x)
?
I think that as_strided from the stride tricks module should do the work here.
It creates a view into the array and not a copy (as stated by the docs).
Here is a simple demonstration of as_stided capabilities:
from numpy.lib.stride_tricks import as_strided
import numpy as np
xs = np.array(range(3 *3 * 4)).reshape(3,3, 4)
for x in xs:
print(x)
output:
[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]
[[24 25 26 27]
[28 29 30 31]
[32 33 34 35]]
function to iterate over array specific axis:
def iterate_over_axis(arr, axis=0):
strides = arr.strides
strides_ = [strides[axis], *strides[0:axis], *strides[(axis+1):]]
shape = arr.shape
shape_ = [shape[axis], *shape[0:axis], *shape[(axis+1):]]
return as_strided(arr, strides=strides_, shape=shape_)
for x in iterate_over_axis(xs, axis=1):
print(x)
output:
[[ 0 1 2 3]
[12 13 14 15]
[24 25 26 27]]
[[ 4 5 6 7]
[16 17 18 19]
[28 29 30 31]]
[[ 8 9 10 11]
[20 21 22 23]
[32 33 34 35]]

Padding sequence with numpy and combining a feature array with the number of sequence array

I have a number of sequences stored in an 2D-array [[first_seq,first_seq],[first_seq,first_seq],[sec_seq,sec_seq]],...
Each vector-sequence varies in length.. some are 55 rows long others are 68 rows long.
The sequence 2D-array(features) is shaped (427,227) (, features) and I have another 1D-array(num_seq) (5,) which contains how long each sequence is [55,68,200,42,62] (e.g. first seq is 55 rows long, sencond seq is 68 rows long etc.). len(1D-array) = number of seq
Now, I need each sequence to be equally long - namely each sequence to be 200. Since I have 5 sequences in this example the resulting array should be structured_seq = np.zeros(5,200,227)
If the sequence is shorter than 200 all other values of that sequence should be zero.
Therfore, I tried to fill structured_seq doing something like:
for counter, sent in enumerate(num_seq):
for j, feat in enumerate(features):
if num_sent[counter] < 200:
structured_seq[counter,feat,]
but Im stuck..
So to be precise: The first sequence is the first 55 rows of the 2D-array(features), all reamining 145 should be filled with zeros. And so on..
This is one way you can do that with np.insert:
import numpy as np
# Sizes of sequences
sizes = np.array([5, 2, 4, 6])
# Number of sequences
n = len(sizes)
# Number of elements in the second dimension
m = 3
# Sequence data
data = np.arange(sizes.sum() * m).reshape(-1, m)
# Size to which the sequences need to be padded
min_size = 6
# Number of zeros to add per sequence
num_pads = min_size - sizes
# Zeros
pad = np.zeros((num_pads.sum(), m), data.dtype)
# Position of the new zeros
pad_pos = np.repeat(np.cumsum(sizes), num_pads)
# Insert zeros
out = np.insert(data, pad_pos, pad, axis=0)
# Reshape
out = out.reshape(n, min_size, m)
print(out)
Output:
[[[ 0 1 2]
[ 3 4 5]
[ 6 7 8]
[ 9 10 11]
[12 13 14]
[ 0 0 0]]
[[15 16 17]
[18 19 20]
[ 0 0 0]
[ 0 0 0]
[ 0 0 0]
[ 0 0 0]]
[[21 22 23]
[24 25 26]
[27 28 29]
[30 31 32]
[ 0 0 0]
[ 0 0 0]]
[[33 34 35]
[36 37 38]
[39 40 41]
[42 43 44]
[45 46 47]
[48 49 50]]]

pandas multiply each dataset row by multiple vectors

df = {1,2,3
4,5,6
7,8,9,
10,11,12
}
weights={[1,3,3],[2,2,2],[3,1,1]}
I want to multiply my df with every line of matrix weights(so I'll have like three different df, one for each vector of weights, and to combine each df by keeping the biggest line of values). Ex:
df0=df * weights[0]={1,6,9
4,15,18,
7,24,27
10,33,36
}
df1=df*wieghts[1]={2,4,6,
8,19,12,
14,16,18,
20,22,24
}
df2=df*wieghts[2]={3,2,3,
12,5,6,
21,8,9,
30,11,12
}
and
final_df_lines=max{df0,df1,df2}={1,6,9 - max line line from df0,
4,15,18, - max line from df0,
7,24,27 - max line from df0,
10,33,36 - max line from df0,
}
In this example all max were from df0 ... but they could be from any of the three df. Max line is just adding the numbers from the same line..
I need to do this things vectorized(without any loops or if...) how do I do this? is it possible at least? I really need welp :( for 2 days I'm searching the internet to do this... I did not work in python for too long...
you can try of concatenating all weights mulitpied columns as one dataframe with suffix of column represeting each weight ,
and by grouping with respect to the weight it multiplied get max summation of index
with max index weight you can multiply the dataframe
df2 = pd.concat([(df*i).add_suffix('__'+str(i)) for i in weights],axis=1).T
0 1 2 3
0__[1, 3, 3] 1 4 7 10
1__[1, 3, 3] 6 15 24 33
2__[1, 3, 3] 9 18 27 36
0__[2, 2, 2] 2 8 14 20
1__[2, 2, 2] 4 10 16 22
2__[2, 2, 2] 6 12 18 24
0__[3, 1, 1] 3 12 21 30
1__[3, 1, 1] 2 5 8 11
2__[3, 1, 1] 3 6 9 12
# by grouping with respect to the weight it multiplied, get max index
a = df2.groupby(df2.index.str.split('__').str[1]).apply(lambda x: x.sum()).idxmax()
# max weights with respect to summation of rows
df['idxmax'] = a.str.slice(1,-1).str.split(',').apply(lambda x: list(map(int,x)))
c [1, 3, 3]
d [1, 3, 3]
3 [1, 3, 3]
4 [1, 3, 3]
dtype: object
df.apply(lambda x: x.loc[df.columns.difference(['idxmax'])] * x['idxmax'],1)
0 1 2
0 1 6 9
1 4 15 18
2 7 24 27
3 10 33 36
EDIT: As question has been updated I had to update too:
You have to align matrices first to be able to make an element-wise matrix operation without using any loop:
import numpy as np
a = [
[1,2,3],
[4,5,6],
[7,8,9],
[10,11,12]
]
weights = [
[1,3,3],
[2,2,2],
[3,1,1]
]
w_s = np.array( (4 * [weights[0]], 4 * [weights[1]], 4 * [weights[2]]) )
a_s = np.array(3 * [a])
result_matrix1 = w_s * a_s[0]
result_matrix2 = w_s * a_s[1]
result_matrix3 = w_s * a_s[2]
print(result_matrix1)
print(result_matrix2)
print(result_matrix3)
Output:
[[[ 1 6 9]
[ 4 15 18]
[ 7 24 27]
[10 33 36]]
[[ 2 4 6]
[ 8 10 12]
[14 16 18]
[20 22 24]]
[[ 3 2 3]
[12 5 6]
[21 8 9]
[30 11 12]]]
[[[ 1 6 9]
[ 4 15 18]
[ 7 24 27]
[10 33 36]]
[[ 2 4 6]
[ 8 10 12]
[14 16 18]
[20 22 24]]
[[ 3 2 3]
[12 5 6]
[21 8 9]
[30 11 12]]]
[[[ 1 6 9]
[ 4 15 18]
[ 7 24 27]
[10 33 36]]
[[ 2 4 6]
[ 8 10 12]
[14 16 18]
[20 22 24]]
[[ 3 2 3]
[12 5 6]
[21 8 9]
[30 11 12]]]
The solution is numpy, but you can do it with pandas as well, if you prefer it, of course.

Shifting the location of tensor3 elements based on an offset vector

I have a Theano tensor3 (i.e., a 3-dimensional array) x:
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
as well as a Theano vector (i.e., a 1-dimensional array) y, which we will refer as an "offset" vector, since it specifies the desired offset:
[2, 1]
I want to shift the location of elements of x based on vector y, so that the output be as follows (the shift is performed on the second dimension):
[[[ a b c d]
[ e f g h]
[ 0 1 2 3]]
[[ i j k l]
[12 13 14 15]
[16 17 18 19]]]
where the a, b, …, l could be any number.
For example, a valid output could be:
[[[ 0 0 0 0]
[ 0 0 0 0]
[ 0 1 2 3]]
[[ 0 0 0 0]
[12 13 14 15]
[16 17 18 19]]]
Another valid output could be:
[[[ 4 5 6 7]
[ 8 9 10 11]
[ 0 1 2 3]]
[[20 21 22 23]
[12 13 14 15]
[16 17 18 19]]]
I am aware of the function theano.tensor.roll(x, shift, axis=None), however the shift can only take a scalar as input, i.e. it shifts all elements with the same offset.
E.g., the code:
import theano.tensor
from theano import shared
import numpy as np
x = shared(np.arange(24).reshape((2,3,4)))
print('theano.tensor.roll(x, 2, axis=1).eval(): \n{0}'.
format(theano.tensor.roll(x, 2, axis=1).eval()))
outputs:
theano.tensor.roll(x, 2, axis=1).eval():
[[[ 4 5 6 7]
[ 8 9 10 11]
[ 0 1 2 3]]
[[16 17 18 19]
[20 21 22 23]
[12 13 14 15]]]
which is not what I want.
How can I shift the location of tensor3 elements based on an offset vector? (note that in the code provided in this example, the tensor3 is a shared variable for convenience, but in my actual code it will be a symbolic variable)
I couldn't find any dedicated function for that purpose, so I simply ended up using theano.scan:
import theano
import theano.tensor
from theano import shared
import numpy as np
y = shared(np.array([2,1]))
x = shared(np.arange(24).reshape((2,3,4)))
print('x.eval():\n{0}\n'.format(x.eval()))
def shift_and_reverse_row(matrix, y):
'''
Shift and reverse the matrix in the direction of the first dimension (i.e., rows)
matrix: matrix
y: scalar
'''
new_matrix = theano.tensor.zeros_like(matrix)
new_matrix = theano.tensor.set_subtensor(new_matrix[:y,:], matrix[y-1::-1,:])
return new_matrix
new_x, updates = theano.scan(shift_and_reverse_row, outputs_info=None,
sequences=[x, y[::-1]] )
new_x = new_x[:, ::-1, :]
print('new_x.eval(): \n{0}'.format(new_x.eval()))
output:
x.eval():
[[[ 0 1 2 3]
[ 4 5 6 7]
[ 8 9 10 11]]
[[12 13 14 15]
[16 17 18 19]
[20 21 22 23]]]
new_x.eval():
[[[ 0 0 0 0]
[ 0 0 0 0]
[ 0 1 2 3]]
[[ 0 0 0 0]
[12 13 14 15]
[16 17 18 19]]]

Removing rows from a multi dimensional numpy array

I have a rather big 3 dimensional numpy (2000,2500,32) array that I need to manipulate.Some rows are bad so I would need to delete several rows.
In order to detect which row is "bad" I using the following function
def badDetect(x):
for i in xrange(10,19):
ptp = np.ptp(x[i*100:(i+1)*100])
if ptp < 0.01:
return True
return False
which marks as bad any sequence of 2000 that has a range of 100 values with peak to peak value less than 0.01.
When this is the case I want to remove that sequence of 2000 values (which can be selected from numpy with a[:,x,y])
Numpy delete seems to be accepting indexes but only for 2 dimensional arrays.
You will definitely have to reshape your input array, because cutting out "rows" from a 3D cube leaves a structure that cannot be properly addressed.
As we don't have your data, I'll use a different example first to explain how this possible solution works:
>>> import numpy as np
>>> from numpy.lib.stride_tricks import as_strided
>>>
>>> threshold = 18
>>> a = np.arange(5*3*2).reshape(5,3,2) # your dataset of 2000x2500x32
>>> # Taint the data:
... a[0,0,0] = 5
>>> a[a==22]=20
>>> print(a)
[[[ 5 1]
[ 2 3]
[ 4 5]]
[[ 6 7]
[ 8 9]
[10 11]]
[[12 13]
[14 15]
[16 17]]
[[18 19]
[20 21]
[20 23]]
[[24 25]
[26 27]
[28 29]]]
>>> a2 = a.reshape(-1, np.prod(a.shape[1:]))
>>> print(a2) # Will prove to be much easier to work with!
[[ 5 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 20 23]
[24 25 26 27 28 29]]
As you can see, from the representation above, it already becomes much clearer now over which windows you want to compute the peak to peak value. And you'll need this form if you're going to remove "rows" (now they have been transformed to columns) from this datastructure, something you couldn't do in 3 dimensions!
>>> isize = a.itemsize # More generic, in case you have another dtype
>>> slice_size = 4 # How big each continuous slice is over which the Peak2Peak value is calculated
>>> slices = as_strided(a2,
... shape=(a2.shape[0] + 1 - slice_size, slice_size, a2.shape[1]),
... strides=(isize*a2.shape[1], isize*a2.shape[1], isize))
>>> print(slices)
[[[ 5 1 2 3 4 5]
[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 20 23]]
[[ 6 7 8 9 10 11]
[12 13 14 15 16 17]
[18 19 20 21 20 23]
[24 25 26 27 28 29]]]
So I took, as an example, a window size of 4 elements: If the peak to peak value within any of these 4 element slices (per dataset, so per column) is less than a certain threshold, I want to exclude it. That can be done like this:
>>> mask = np.all(slices.ptp(axis=1) >= threshold, axis=0) # These are the ones that are of interest
>>> print(a2[:,mask])
[[ 1 2 3 5]
[ 7 8 9 11]
[13 14 15 17]
[19 20 21 23]
[25 26 27 29]]
You can now clearly see that the tainted data has been removed. But remember, you could not have simply removed that data from a 3D array (but you could've masked it then).
Obviously, you'll have to set the threshold to .01 in your use-case, and the slice_size to 100.
Beware, while the as_strided form is extremely memory-efficient, computing the peak to peak values of this array and storing that result does require a good amount of memory in your case: 1901x(2500x32) in the full case scenario, so when you do not ignore the first 1000 slices. In your case, where you're only interested in the slices from 1000:1900, you would have to add that to the code like so:
mask = np.all(slices[1000:1900,:,:].ptp(axis=1) >= threshold, axis=0)
And that would reduce the memory required to store this mask to "only" 900x(2500x32) values (of whatever data type you were using).

Categories