extracting vectors from an array using logical indexing - python

I have the following numpy arrays:
a truth table of (nx1), and a matrix of (nxk) where n is 5 and k is 2 in this example.
btable = np.array([[True],[False],[False],[True],[True]])
bb=np.array([[1.842,4.607],[5.659,4.799],[6.352,3.290],[2.904,4.612],[3.231,4.939]])
I would like to extract the vectors in bb according the indexing values in btable.
I tried choicebb=bb[btable==True] which gets me the result
[ 1.84207953 2.90401653 3.23197916]
choicebb=bb[btable] gets me the same results as well.
What I want instead is
[[1.842,4.607]
[2.904,4.612]
[3.231,4.939]]
I also tried
choicebb=bb[btable==True,:]
but then I would get
---> 13 choicebb=bb[btable==True,:]
14 print(choicebb)
IndexError: too many indices for array
This can be easily done in matlab with choicebb=bb(btable,:);

Get the 1D version of the mask with np.ravel() or slice out the first column with [:,0] and use it for logical indexing into the data array, like so -
bb[btable.ravel()]
bb[btable[:,0]]
Note that bb[btable.ravel()] is essentially - bb[btable.ravel(),:]. In NumPy, we could skip mentioning the trailing axes if all elements are to be selected, that's why it simplified to bb[btable.ravel()].
Explanataion : To index into a single axis and such that it select all elements along the rest of the axes, we need to feed in a 1D array (boolean or integer array) along that axis and use : along the leftover axes. In our case, we are indexing into the first axis to select rows, so we need to feed in a boolean array along that axis and : along the rest of axes.
When we are feeding the 2D version of the mask, it indexes along those corresponding multiple axes. So, when we feed in (N,1) shaped boolean array, we are selecting correct rows, but also only selecting the first column elements, which is not the intended output.

Related

What is the Numpy slicing notation in this code?

# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
Can someone explain the second line of code with reference to specific documentation? I know its slicing but the I couldn't find any reference for the notation ":-1" anywhere. Please give the specific documentation portion.
Thank you
It results in slicing, most probably using numpy and it is being done on a data of shape (610, 14)
Per the docs:
Indexing on ndarrays
ndarrays can be indexed using the standard Python x[obj] syntax, where x is the array and obj the selection. There are different kinds of indexing available depending on obj: basic indexing, advanced indexing and field access.
1D array
Slicing a 1-dimensional array is much like slicing a list
import numpy as np
np.random.seed(0)
array_1d = np.random.random((5,))
print(len(array_1d.shape))
1
NOTE: The len of the array shape tells you the number of dimensions.
We can use standard python list slicing on the 1D array.
# get the last element
print(array_1d[-1])
0.4236547993389047
# get everything up to but excluding the last element
print(array_1d[:-1])
[0.5488135 0.71518937 0.60276338 0.54488318]
2D array
array_2d = np.random.random((5, 1))
print(len(array_2d.shape))
2
Think of a 2-dimensional array like a data frame. It has rows (the 0th axis) and columns (the 1st axis). numpy grants us the ability to slice these axes independently by separating them with a comma (,).
# the 0th row and all columns
# the 0th row and all columns
print(array_2d[0, :])
[0.79172504]
# the 1st row and everything after + all columns
print(array_2d[1:, :])
[[0.52889492]
[0.56804456]
[0.92559664]
[0.07103606]]
# the 1st through second to last row + the last column
print(array_2d[1:-1, -1])
[0.52889492 0.56804456 0.92559664]
Your Example
# split into inputs and outputs
X, y = data[:, :-1], data[:, -1]
print(X.shape, y.shape)
Note that data.shape is >= 2 (otherwise you'd get an IndexError).
This means data[:, :-1] is keeping all "rows" and slicing up to, but not including, the last "column". Likewise, data[:, -1] is keeping all "rows" and selecting only the last "column".
It's important to know that when you slice an ndarray using a colon (:), you will get an array with the same dimensions.
print(len(array_2d[1:, :-1].shape)) # 2
But if you "select" a specific index (i.e. don't use a colon), you may reduce the dimensions.
print(len(array_2d[1, :-1].shape)) # 1, because I selected a single index value on the 0th axis
print(len(array_2d[1, -1].shape)) # 0, because I selected a single index value on both the 0th and 1st axes
You can, however, select a list of indices on either axis (assuming they exist).
print(len(array_2d[[1], [-1]].shape)) # 1
print(len(array_2d[[1, 3], :].shape)) # 2
This slicing notation is explained here https://docs.python.org/3/tutorial/introduction.html#strings
-1 means last element, -2 - second from last, etc. For example, if there are 8 elements in a list, -1 is equivalent to 7 (not 8 because indexing starts from 0)
Keep in mind that "normal" python slicing for nested lists looks like [1:3][5:7], while numpy arrays also have a slightly different syntax ([8:10, 12:14]) that lets you slice multidimensional arrays. However, -1 always means the same thing. Here is the numpy documentation for slicing https://numpy.org/doc/stable/user/basics.indexing.html

Is there a way to write a python function that will create 'N' arrays? (see body)

I have an numpy array that is shape 20, 3. (So 20 3 by 1 arrays. Correct me if I'm wrong, I am still pretty new to python)
I need to separate it into 3 arrays of shape 20,1 where the first array is 20 elements that are the 0th element of each 3 by 1 array. Second array is also 20 elements that are the 1st element of each 3 by 1 array, etc.
I am not sure if I need to write a function for this. Here is what I have tried:
Essentially I'm trying to create an array of 3 20 by 1 arrays that I can later index to get the separate 20 by 1 arrays.
a = np.load() #loads file
num=20 #the num is if I need to change array size
num_2=3
for j in range(0,num):
for l in range(0,num_2):
array_elements = np.zeros(3)
array_elements[l] = a[j:][l]
This gives the following error:
'''
ValueError: setting an array element with a sequence
'''
I have also tried making it a dictionary and making the dictionary values lists that are appended, but it only gives the first or last value of the 20 that I need.
Your array has shape (20, 3), this means it's a 2-dimensional array with 20 rows and 3 columns in each row.
You can access data in this array by indexing using numbers or ':' to indicate ranges. You want to split this in to 3 arrays of shape (20, 1), so one array per column. To do this you can pick the column with numbers and use ':' to mean 'all of the rows'. So, to access the three different columns: a[:, 0], a[:, 1] and a[:, 2].
You can then assign these to separate variables if you wish e.g. arr = a[:, 0] but this is just a reference to the original data in array a. This means any changes in arr will also be made to the corresponding data in a.
If you want to create a new array so this doesn't happen, you can easily use the .copy() function. Now if you set arr = a[:, 0].copy(), arr is completely separate to a and changes made to one will not affect the other.
Essentially you want to group your arrays by their index. There are plenty of ways of doing this. Since numpy does not have a group by method, you have to horizontally split the arrays into a new array and reshape it.
old_length = 3
new_length = 20
a = np.array(np.hsplit(a, old_length)).reshape(old_length, new_length)
Edit: It appears you can achieve the same effect by rotating the array -90 degrees. You can do this by using rot90 and setting k=-1 or k=3 telling numpy to rotate by 90 k times.
a = np.rot90(a, k=-1)

Converting 2D numpy array to 3D array without looping

I have a 2D array of shape (t*40,6) which I want to convert into a 3D array of shape (t,40,5) for the LSTM's input data layer. The description on how the conversion is desired in shown in the figure below. Here, F1..5 are the 5 input features, T1...40 are the time steps for LSTM and C1...t are the various training examples. Basically, for each unique "Ct", I want a "T X F" 2D array, and concatenate all along the 3rd dimension. I do not mind losing the value of "Ct" as long as each Ct is in a different dimension.
I have the following code to do this by looping over each unique Ct, and appending the "T X F" 2D arrays in 3rd dimension.
# load 2d data
data = pd.read_csv('LSTMTrainingData.csv')
trainX = []
# loop over each unique ct and append the 2D subset in the 3rd dimension
for index, ct in enumerate(data.ct.unique()):
trainX.append(data[data['ct'] == ct].iloc[:, 1:])
However, there are over 1,800,000 such Ct's so this makes it quite slow to loop over each unique Ct. Looking for suggestions on doing this operation faster.
EDIT:
data_3d = array.reshape(t,40,6)
trainX = data_3d[:,:,1:]
This is the solution for the original question posted.
Updating the question with an additional problem: the T1...40 time steps can have the highest number of steps = 40, but it could be less than 40 as well. The rest of the values can be 'np.nan' out of the 40 slots available.
Since all Ct have not the same length , you have no other choice than rebuild a new block.
But use of data[data['ct'] == ct] can be O(n²) so it's a bad way to do it.
Here a solution using Panel . cumcount renumber each Ct line :
t=5
CFt=randint(0,t,(40*t,6)).astype(float) # 2D data
df= pd.DataFrame(CFt)
df2=df.set_index([df[0],df.groupby(0).cumcount()]).sort_index()
df3=df2.to_panel()
This automatically fills missing data with Nan. But It warns :
DeprecationWarning:
Panel is deprecated and will be removed in a future version.
The recommended way to represent these types of 3-dimensional data are with a MultiIndex on a DataFrame, via the Panel.to_frame() method
So perhaps working with df2 is the recommended way to manage your data.

sampling 2-D numpy array using another array

I have a pretty big numpy matrix (2-D array with more than 1000 * 1000 cells), and another 2-D array of indexes in the following form: [[x1,y1],[x2,y2],...,[xn,yn]], which is also quite large (n > 1000). I want to extract all the cells in the matrix that their (x,y) coordinates appear in the array as efficient as possible, i.e. without loops. If the array was an array of tuples I could just to
cells = matrix[array]
and get what I want, but the array is not in that format, and I couldn't find an efficient way to convert it to the desired form...
You can make your array into a tuple of arrays like this:
tuple(array.T)
This matches the output style of np.where(), which can be indexed.
cells=matrix[tuple(array.T)]
you can also do standard numpy array slicing and get Divakar's answer in the comments:
cells=matrix[array[:,0],array[:,1]]

Numpy minimum with functional arg

Is there a way of computing a minimum index value of an array after application of a function (i.e. the equivalent of matlab find)?
In other words consider something like:
a = [1,-3,-10,3]
np.find_max(a,lambda x:abs(x))
Should return 2.
I could write a loop for this obviously but I assume it would be faster to use an inbuilt numpy function if one existed.
Use argmax, according to the documentation:
numpy.argmax(a, axis=None, out=None)
Returns the indices of
the maximum values along an axis.
Parameters: a : array_like Input array. axis : int, optional By
default, the index is into the flattened array, otherwise along the
specified axis. out : array, optional If provided, the result will be
inserted into this array. It should be of the appropriate shape and
dtype. Returns: index_array : ndarray of ints Array of indices into
the array. It has the same shape as a.shape with the dimension along
axis removed. See also ndarray.argmax, argmin
amax The maximum value along a given axis. unravel_index Convert a
flat index into an index tuple. Notes
In case of multiple occurrences of the maximum values, the indices
corresponding to the first occurrence are returned.
import numpy as np
a = [1, -3, -10, 3]
print(np.argmax(np.abs(a)))
Output:
2

Categories