Multidimension array indexing and column-accessing - python

I have a 3 dimensional array like
[[[ 1 4 4 ..., 952 0 0]
[ 2 4 4 ..., 33 0 0]
[ 3 4 4 ..., 1945 0 0]
...,
[4079 1 1 ..., 0 0 0]
[4080 2 2 ..., 0 0 0]
[4081 1 1 ..., 0 0 0]]
[[ 1 4 4 ..., 952 0 0]
[ 2 4 4 ..., 33 0 0]
[ 3 4 4 ..., 1945 0 0]
...,
[4079 1 1 ..., 0 0 0]
[4080 2 2 ..., 0 0 0]
[4081 1 1 ..., 0 0 0]]
.....
[[ 1 4 4 ..., 952 0 0]
[ 2 4 4 ..., 33 0 0]
[ 3 4 4 ..., 1945 0 0]
...,
[4079 1 1 ..., 0 0 0]
[4080 2 2 ..., 0 0 0]
[4081 1 1 ..., 0 0 0]]]
This array has total 5 data blocks. Each data blocks have 4081 lines and 9 columns.
My question here is about accessing to column, in data-block-wise.
I hope to index data-blocks, lines, and columns, and access to the columns, and do some works with if loops. I know how to access to columns in 2D array, like:
column_1 = [row[0] for row in inputfile]
but how can I access to columns per each data block?
I tried like ( inputfile = 3d array above )
for i in range(len(inputfile)):
AAA[i] = [row[0] for row in inputfile]
print AAA[2]
But it says 'name 'AAA' is not defined. How can I access to the column, for each data blocks? Should I need to make [None] arrays? Are there any other way without using empty arrays?
Also, how can I access to the specific elements of the accessed columns? Like AAA[i][j] = i-th datablock, and j-th line of first column. Shall I use one more for loop for line-wise accessing?
ps) I tried to analyze this 3d array in a way like
for i in range(len(inputfile)): ### number of datablock = 5
for j in range(len(inputfile[i])): ### number of lines per a datablock = 4081
AAA = inputfile[i][j] ### Store first column for each datablocks to AAA
print AAA[0] ### Working as I intended to access 1st column.
print AAA[0][1] ### Not working, invalid index to scalar variable. I can't access to the each elemnt.
But this way, I cannot access to the each elements of 1st column, AAA[0]. How can I access to the each elements in here?
I thought maybe 2 indexes were not enough, so I used 3 for-loops as:
for i in range(len(inputfile)): ### number of datablock = 5
for j in range(len(inputfile[i])): ### number of lines per a datablock = 4081
for k in range(len(inputfile[i][j])): ### number of columns per line = 9
AAA = inputfile[i][j][0]
print AAA[0]
Still, I cannot access to the each elements of 1st column, it says 'invalid index to scalar variable'. Also, AAA contains nine of each elements, just like
>>> print AAA
1
1
1
1
1
1
1
1
1
2
2
...
4080
4080
4080
4081
4081
4081
4081
4081
4081
4081
4081
4081
Like this, each elements repeats 9 times, which is not what I want.
I hope to use indices during my analysis, will use index as element during analysis. I want to access to the columns, and access to the each elements with all indices, in this 3d array. How can I do this?

A good practice in to leverage zip:
For example:
>>> a = [1,2,3]
>>> b = [4,5,6]
>>> for i in a:
... for j in b:
... print i, b
...
1 [4, 5, 6]
1 [4, 5, 6]
1 [4, 5, 6]
2 [4, 5, 6]
2 [4, 5, 6]
2 [4, 5, 6]
3 [4, 5, 6]
3 [4, 5, 6]
3 [4, 5, 6]
>>> for i,j in zip(a,b):
... print i,j
...
1 4
2 5
3 6

Unless you're using something like NumPy, Python doesn't have multi-dimensional arrays as such. Instead, the structure you've shown is a list of lists of lists of integers. (Your choice of inputfile as the variable name is confusing here; such a variable would usually contain a file handle, iterating over which would yield one string per line, but I digress...)
Unfortunately, I'm having trouble understanding exactly what you're trying to accomplish, but at one point, you seem to want a single list consisting of the first column of each row. That's as simple as:
column = [row[0] for block in inputfile for row in block]
Granted, this isn't really a column in the mathematical sense, but it might possibly perhaps be what you want.
Now, as to why your other attempts failed:
for i in range(len(inputfile)):
AAA[i] = [row[0] for row in inputfile]
print AAA[2]
As the error message states, AAA is not defined. Python doesn't let you assign to an index of an undefined variable, because it doesn't know whether that variable is supposed to be a list, a dict, or something more exotic. For lists in particular, it also doesn't let you assign to an index that doesn't yet exist; instead, the append or extend methods are used for that:
AAA = []
for i, block in enumerate(inputfile):
for j, row in enumerate(block):
AAA.append(row[0])
print AAA[2]
(However, that isn't quite as efficient as the list comprehension above.)
for i in range(len(inputfile)): ### number of datablock = 5
for j in range(len(inputfile)): ### number of lines per a datablock = 4081
AAA = inputfile[i][j] ### Store first column for each datablocks to AAA
print AAA[0] ### Working as I intended to access 1st column.
print AAA[0][1] ### Not working, invalid index to scalar variable. I can't access to the each elemnt.
There's an obvious problem with the range in the second line, and an inefficiency in looking up inputfile[i] multiple times, but the real problem is in the last line. At this point, AAA refers to one of the rows of one of the blocks; for example, on the first time through, given your dataset above,
AAA == [ 1 4 4 ..., 952 0 0]
It's a single list, with no references to the data structure as a whole. AAA[0] works to access the number in the first column, 1, because that's how lists operate. The second column of that row will be in AAA[1], and so on. But AAA[0][1] throws an error, because it's equivalent to (AAA[0])[1], which in this case is equal to (1)[1], but numbers can't be indexed. (What's the second element of the number 1?)
for i in range(len(inputfile)): ### number of datablock = 5
for j in range(len(inputfile[i])): ### number of lines per a datablock = 4081
for k in range(len(inputfile[i][j])): ### number of columns per line = 9
AAA = inputfile[i][j][0]
print AAA[0]
This time, your for loops, though still inefficient, are at least correct, if you want to iterate over every number in the whole data structure. At the bottom, you'll find that inputfile[i][j][k] is integer k in row j in block i of the data structure. However, you're throwing out k entirely, and printing the first element of the row, once for each item in the row. (The fact that it's repeated exactly as many times as you have columns should have been a clue.) And once again, you can't index any further once you get to the integers; there is no inputfile[i][j][0][0].
Granted, once you get to an element, you can look at nearby elements by changing the indexes. For example, a three-dimensional cellular automaton might want to look at each of its neighbors. With proper corrections for the edges of the data, and checks to ensure that each block and row are the right length (Python won't do that for you), that might look something like:
for i, block in enumerate(inputfile):
for j, row in enumerate(block):
for k, num in enumerate(row):
neighbors = sum(
inputfile[i][j][k-1],
inputfile[i][j][k+1],
inputfile[i][j-1][k],
inputfile[i][j+1][k],
inputfile[i-1][j][k],
inputfile[i+1][j][k],
)
alive = 3 <= neigbors <= 4

Related

Creating submatrix in python

Given a matrix S and a binary matrix W, I want to create a submatrix of S corresponding to the non zero coordinates of W.
For example:
S = [[1,1],[1,2],[1,3],[1,4],[1,5]]
W = [[1,0,0],[1,1,0],[1,1,1],[0,1,1],[0,0,1]]
I want to get matrices
S_1 = [[1,1],[1,2],[1,3]]
S_2 = [[1,2],[1,3],[1,4]]
S_3 = [[1,3],[1,4],[1,5]]
I couldn't figure out a slick way to do this in python. The best I could do for each S_i is
S_1 = S[0,:]
for i in range(np.shape(W)[0]):
if W[i, 0] == 1:
S_1 = np.vstack((S_1, S[i, :]))
but if i want to change the dimensions of the problem and have, say, 100 S_i's, writing a for loop for each one seems a bit ugly. (Side note: S_1 should be initialized to some empty 2d array but I couldn't get that to work, so initialized it to S[0,:] as a placeholder).
EDIT: To clarify what I mean:
I have a matrix S
1 1
1 2
1 3
1 4
1 5
and I have a binary matrix
1 0 0
1 1 0
1 1 1
0 1 1
0 0 1
Given the first column of the binary matrix W
1
1
1
0
0
The 1's are in the first, second, and third positions. So I want to create a corresponding submatrix of S with just the first, second and third positions of every column, so S_1 (corresponding to the 1st column of W) is
1 1
1 2
1 3
Similarly, if we look at the third column of W
0
0
1
1
1
The 1's are in the last three coordinates and so I want a submatrix of S with just the last three coordinates of every column, called S_3
1 3
1 4
1 5
So given any ith column of the binary matrix, I'm looking to generate a submatrix S_i where the columns of S_i contain the columns of S, but only the entries corresponding to the positions of the 1's in the ith column of the binary matrix.
It probably is more useful to work with the transpose of W rather than W itself, both for human-readability and to facilitate writing the code. This means that the entries that affect each S_i are grouped together in one of the inner parentheses of W, i.e. in a row of W rather than a column as you have it now.
Then, S_i = np.array[S[j,:] for j in np.shape(S)[0] if W_T[i,j] == 1], where W_T is the transpose of W. If you need/want to stick with W as is, you need to reverse the indices i and j.
As for the outer loop, you could try to nest this in another similar comprehension without an if statement--however this might be awkward since you aren't actually building one output matrix (the S_i can easily be different dimensions, unless you're somehow guaranteed to have the same number of 1s in every column of W). This in fact raises the question of what you want--a list of these arrays S_i? Otherwise if they are separate variables as you have it written, there's no good way to refer to them in a generalizable way as they don't have indices.
Numpy can do this directly.
import numpy as np
S = np.array([[1,1],[1,2],[1,3],[1,4],[1,5]])
W = np.array([[1,0,0],[1,1,0],[1,1,1],[0,1,1],[0,0,1]])
for row in range(W.shape[1]):
print(S[W[:,row]==1])
Output:
[[1 1]
[1 2]
[1 3]]
[[1 2]
[1 3]
[1 4]]
[[1 3]
[1 4]
[1 5]]

Sorting a random array using permutation

I tried to sort an array by permuting it with itself
(the array contain all the numbers in range between 0 to its length-1)
so to test it I used random.shuffle but it had some unexpected results
a = np.array(range(10))
random.shuffle(a)
a = a[a]
a = a[a]
print(a)
# not a sorted array
# [9 5 2 3 1 7 6 8 0 4]
a = np.array([2,1,4,7,6,5,0,3,8,9])
a = a[a]
a = a[a]
print(a)
# [0 1 2 3 4 5 6 7 8 9]
so for some reason the permutation when using the second example of an unsorted array returns the sorted array as expected but the shuffled array doesn't work the same way.
Does anyone know why? Or if there is an easier way to sort using permutation or something similar it would be great.
TL;DR
There is no reason to expect a = a[a] to sort the array. In most cases it won't. In case of a coincidence it might.
What is the operation c = b[a]? or Applying a permutation
When you use an array a obtained by shuffling range(n) as a mask for an array b of same size n, you are applying a permutation, in the mathematical sense, to the elements of b. For instance:
a = [2,0,1]
b = np.array(['Alice','Bob','Charlie'])
print(b[a])
# ['Charlie' 'Alice' 'Bob']
In this example, array a represents the permutation (2 0 1), which is a cycle of length 3. Since the length of the cycle is 3, if you apply it three times, you will end up where you started:
a = [2,0,1]
b = np.array(['Alice','Bob','Charlie'])
c = b
for i in range(3):
c = c[a]
print(c)
# ['Charlie' 'Alice' 'Bob']
# ['Bob' 'Charlie' 'Alice']
# ['Alice' 'Bob' 'Charlie']
Note that I used strings for the elements of b ton avoid confusing them with indices. Of course, I could have used numbers from range(n):
a = [2,0,1]
b = np.array([0,1,2])
c = b
for i in range(3):
c = c[a]
print(c)
# [2 0 1]
# [1 2 0]
# [0 1 2]
You might see an interesting, but unsurprising fact: The first line is equal to a; in other words, the first result of applying a to b is equal to a itself. This is because b was initialised to [0 1 2], which represent the identity permutation id; thus, the permutations that we find by repeatedly applying a to b are:
id == a^0
a
a^2
a^3 == id
Can we always go back where we started? or The rank of a permutation
It is a well-known result of algebra that if you apply the same permutation again and again, you will eventually end up on the identity permutation. In algebraic notations: for every permutation a, there exists an integer k such that a^k == id.
Can we guess the value of k?
The minimum value of k is called the rank of a permutation.
If a is a cycle, then the minimum possible k is the length of the cycle. In our previous example, a was a cycle of length 3, so it took three applications of a before we found the identity permutation again.
How about a cycle of length 2? A cycle of length 2 is just "swapping two elements". For instance, swapping elements 0 and 1:
a = [1,0,2]
b = np.array([0,1,2])
c = b
for i in range(2):
c = c[a]
print(c)
# [1 0 2]
# [0 1 2]
We swap 0 and 1, then we swap them back.
How about two disjoint cycles? Let's try a cycle of length 3 on the first three elements, simultaneously with swapping the last two elements:
a = [2,0,1,3,4,5,7,6]
b = np.array([0,1,2,3,4,5,6,7])
c = b
for i in range(6):
c = c[a]
print(c)
# [2 0 1 3 4 5 7 6]
# [1 2 0 3 4 5 6 7]
# [0 1 2 3 4 5 7 6]
# [2 0 1 3 4 5 6 7]
# [1 2 0 3 4 5 7 6]
# [0 1 2 3 4 5 6 7]
As you can see by carefully examining the intermediary results, there is a period of length 3 on the first three elements, and a period of length 2 on the last two elements. The overall period is the least common multiple of the two periods, which is 6.
What is k in general? A well-known theorem of algebra states: every permutation can be written as a product of disjoint cycles. The rank of a cycle is the length of the cycle. The rank of a product of disjoint cycles is the least common multiple of the ranks of cycles.
A coincidence in your code: sorting [2,1,4,7,6,5,0,3,8,9]
Let us go back to your python code.
a = np.array([2,1,4,7,6,5,0,3,8,9])
a = a[a]
a = a[a]
print(a)
# [0 1 2 3 4 5 6 7 8 9]
How many times did you apply permutation a? Note that because of the assignment a =, array a changed between the first and the second lines a = a[a]. Let us dissipate some confusion by using a different variable name for every different value. Your code is equivalent to:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a2 = a[a]
a4 = a2[a2]
print(a4)
Or equivalently:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = (a[a])[a[a]]
This last line looks a little bit complicated. However, a cool result of algebra is that composition of permutations is associative. You already knew that addition and multiplication were associative: x+(y+z) == (x+y)+z and x(yz) == (xy)z. Well, it turns out that composition of permutations is associative as well! Using numpy's masks, this means that:
a[b[c]] == (a[b])[c]
Thus your python code is equivalent to:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = ((a[a])[a])[a]
print(a4)
Or without the unneeded parentheses:
a = np.array([2,1,4,7,6,5,0,3,8,9])
a4 = a[a][a][a]
print(a4)
Since a4 is the identity permutation, this tells us that the rank of a divides 4. Thus the rank of a is 1, 2 or 4. This tells us that a can be written as a product of swaps and length-4 cycles. The only permutation of rank 1 is the identity itself. Permutations of rank 2 are products of disjoint swaps, and we can see that this is not the case of a. Thus the rank of a must be exactly 4.
You can find the cycles by choosing an element, and following its orbit: what values is that element successively transformed into? Here we see that:
0 is transformed into 2; 2 is transformed into 4; 4 is transformed into 6; 6 is transformed into 0;
1 remains untouched;
3 becomes 7; 7 becomes 3;
5 is untouched; 8 and 9 are untouched.
Conclusion: Your numpy array represents the permutation (0 -> 2 -> 4 -> 6 -> 0)(3 <-> 7), and its rank is the least common multiple of 4 and 2, lcm(4,2) == 4.
it's took some time but I figure a way to do it.
numpy doesn't have this fiture but panda does have.
by using df.reindex I can sort a data frame by it indexes
import pandas as pd
import numpy as np
train_df = pd.DataFrame(range(10))
train_df = train_df.reindex(np.random.permutation(train_df.index))
print(train_df) # random dataframe contaning all values up to 9
train_df = train_df.reindex(range(10))
print(train_df) # sort data frame

Split an array into several arrays by defined boundaries, python

I have a numpy array which consists of 64 columns and 49 rows. Each row stands for a separate message and contains several pieces of information. When an information starts or ends can be recognized by the change of the value. In the following an excerpt of the numpy array:
[[1 1 0 0 2 2 2 2 1 0 0 0 0 2 ... 2 2 2]
[0 0 0 2 2 2 2 2 2 2 2 2 2 2 ... 2 2 2]
[2 0 0 1 2 0 0 0 0 0 0 0 0 0 ... 1 1 0]
.
.
.
[0 1 0 1 0 1 0 1 0 0 0 0 0 0 ... 2 2 2]]
The first information of the first signal therefore takes the first two positions [11]. By changing the value from 1 to 0 I know that the second information is in the third and fourth position [00]. The third information occupies the following four positions [2222]. The next information consists only of [1]. And so on...
Once I have identified the positions of each information of a signal I have to apply these boundaries to my signal numpy arrays. My first binary signal numpy array consists of 64 columns and 3031 rows:
[[1 1 0 0 0 0 0 1 0 1 0 1 0 0 ... 1 0 0 1]
[1 0 1 0 1 1 1 1 1 0 0 1 1 0 ... 1 1 1 0]
[0 1 0 1 1 1 0 0 1 0 0 1 1 1 ... 1 1 1 0]
.
.
.
[1 0 1 0 0 1 0 0 0 0 1 1 0 1 ... 1 1 1 0]]
My first array (first information from the first signal) consists of the first two positions as determined by the previous array. The output should look like this:
[[11]
[10]
[01]
.
.
.
[10]]
The output of the second array (third and fourth position) should be the following:
[[00]
[10]
[01]
.
.
.
[10]]
The output of the third array:
[[0001]
[1111]
[1100]
.
.
.
[0100]]
Unfortunately I do not know how to create and apply the initial boundaries of the first array to the binary arrays. Does anyone have a solution for this?
Thanks for the help!
Sorry, I placed the hint of where you should create a loop at the wrong place. See if this code works: (I tried to explain numpy slicing a little in comments but can learn more here: Numpy indexing and slicing
import itertools
import numpy as np
# Def to reshape signals according to message
def reshape(lst1, lst2):
iterator = iter(lst2)
return [[next(iterator) for _ in sublist]
for sublist in lst1]
# Arrays
array_1 = np.array([[1,1,0,0,2,2,2,2,1,0,0,0,0,2],
[0,0,0,2,2,2,2,2,2,2,2,2,2,2],
[2,0,0,1,2,0,0,0,0,0,0,0,0,0],
[0,1,0,1,0,1,0,1,0,0,0,0,0,0]])
array_2 = np.array([[1,1,0,0,0,0,0,1,0,1,0,1,0,0],
[1,0,1,0,1,1,1,1,1,0,0,1,1,0],
[0,1,0,1,1,1,0,0,1,0,0,1,1,1],
[1,0,1,0,0,1,0,0,0,0,1,1,0,1]])
#Group messages into pieces of information
signal_list = []
for lists in array_1:
signal_list.append([list(group) for key, group in itertools.groupby(lists)])
#Index for all message
All_messages={}
#Do this for each message:
for rows in range(len(array_1)):
#Reshapes each signal according to current message
signals_reshape = (np.array([reshape(signal_list[rows], array_2[i]) for i in range(len(array_2))]))
# Create list to append all signals in current message
all_signal = []
# Do this for each information block
for i in range(len(signals_reshape[rows])):
'''
Append information blocks
1st [:] = retrieve in all signals
2nd [:] = retrieve the whole signal
3rd [:,i] = retrieve information block from specific column
Example: signals_reshape[0][0][0] retrieves the first information element of first information block of the fisrt signal
signals_reshape[0][0][:] retrieves all the information elements from the first information block from the first signal
signals_reshape[:][:][:,0] retrieves the first information block from all the signals
'''
all_signal.append(signals_reshape[:][:][:,i].flatten())
# add message information to dictionary (+ 1 is so that the names starts at Message1 and not Message0
All_messages["Message{0}".format(rows+1)] = all_signal
print(All_messages['Message1'])
print(All_messages['Message2'])
print(All_messages['Message3'])
print(All_messages['Message4'])
See if this can help you. This example returns the information for the 1st message, but you should be able to create a loop for all 49 messages and assign it to a new list.
import itertools
import numpy as np
# Def to reshape signals according to message
def reshape(lst1, lst2):
iterator = iter(lst2)
return [[next(iterator) for _ in sublist]
for sublist in lst1]
# Arrays
array_1 = np.array([[1,1,0,0,2,2,2,2,1,0,0,0,0,2],
[0,0,0,2,2,2,2,2,2,2,2,2,2,2],
[2,0,0,1,2,0,0,0,0,0,0,0,0,0]])
array_2 = np.array([[1,1,0,0,0,0,0,1,0,1,0,1,0,0],
[1,0,1,0,1,1,1,1,1,0,0,1,1,0],
[0,1,0,1,1,1,0,0,1,0,0,1,1,1]])
#Group messages into pieces of information
signal_list = []
for lists in array_1:
signal_list.append([list(group) for key, group in itertools.groupby(lists)])
# reshapes signals for each message to be used
signals_reshape = np.array([reshape(signal_list[0], array_2[i]) for i in range(len(array_2))])
print(signals_reshape[0])
# Get information from first message (Can create a loop to do the same for all 49 messages)
final_list_1 = []
for i in range(len(signals_reshape[0])):
final_list_1.append(signals_reshape[:][:][:,i].flatten())
print(final_list_1[0])
print(final_list_1[1])
print(final_list_1[2])
Output:
final_list_1[0]
[list([1, 1]), list([1, 0]), list([0, 1])]
final_list_1[1]
[list([0, 0]), list([1, 0]), list([0, 1])]
final_list_1[2]
[list([0, 0, 0, 1]) list([1, 1, 1, 1]) list([1, 1, 0, 0])]
Credits to #Kempie. He has solved the problem. I just adapted his code to my needs, shortened it a bit and fixed some small bugs.
import itertools
import numpy as np
# Def to reshape signals according to message
def reshape(lst1, lst2):
iterator = iter(lst2)
return [[next(iterator) for _ in sublist]
for sublist in lst1]
# Arrays
array_1 = np.array([[1,1,0,0,2,2,2,2,1,0,0,0,0,2],
[0,0,0,2,2,2,2,2,2,2,2,2,2,2],
[2,0,0,1,2,0,0,0,0,0,0,0,0,0],
[0,1,0,1,0,1,0,1,0,0,0,0,0,0]])
#in my case, array_2 was a list (see difference of code to Kempies solutions)
array_2 = np.array([[1,1,0,0,0,0,0,1,0,1,0,1,0,0],
[1,0,1,0,1,1,1,1,1,0,0,1,1,0],
[0,1,0,1,1,1,0,0,1,0,0,1,1,1],
[1,0,1,0,0,1,0,0,0,0,1,1,0,1]])
#Group messages into pieces of information
signal_list = []
for lists in array_1:
signal_list.append([list(group) for key, group in itertools.groupby(lists)])
signals_reshape_list = []
#Do this for each message (as array_2 is a list, we must work with indices):
for rows in range(len(array_1)):
#Reshapes each signal according to current message
signals_reshape = (np.array([reshape(signal_list[rows], array_2[rows][i]) for i in range(len(array_2[rows]))]))
signals_reshape_list.append(signals_reshape)
#print first signal of third message e.g.
print(signals_reshape_list[2][:,0]

Pandas : determine mapping from unique rows to original dataframe

Given the following inputs:
In [18]: input
Out[18]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
2 1 5 9 1
3 1 5 9 1
In [26]: df = input.drop_duplicates()
Out[26]:
1 2 3 4
0 1 5 9 1
1 2 6 10 2
How would I go about getting an array that has the indices of the rows from the subset that are equivalent, eg:
resultant = [0, 1, 0, 0]
I.e. the '1' here is basically stating that (row[1] in input) == (row[1] in df). Since there will be fewer unique rows than there will be multiple values in 'resultant' that will equate to similar values in df. i.e (row[k] in input == row[k+N] in input) == (row[1] in df) could be a case.
I am looking for actual row number mapping from input:df.
While this example is trivial in my case i have a ton of dropped mappings that might map to one index as an example.
Why do I want this? I am training an autoencoder type system where the target sequence is non-unique.
One way would be to treat it as a groupby on all columns:
>> df.groupby(list(df.columns)).groups
{(1, 5, 9, 1): [0, 2, 3], (2, 6, 10, 2): [1]}
Another would be to sort and then compare, which is less efficient in theory but could very well be faster in some cases and is definitely easier to make more tolerant of error:
>>> ds = df.sort(list(df.columns))
>>> eqs = (ds != ds.shift()).all(axis=1).cumsum()
>>> ds.index.groupby(eqs)
{1: [0, 2, 3], 2: [1]}
This seems the right datastructure to me, but if you really do want an array with the group ids, that's easy too, e.g.
>>> eqs.sort_index() - 1
0 0
1 1
2 0
3 0
dtype: int64
Don't have pandas installed on this computer, but I think you could use df.iterrows() like:
def find_matching_row(row, df_slimmed):
for index, slimmed_row in df_slimmed.iterrows():
if slimmed_row.equals(row[slimmed_row.columns]):
return index
def rows_mappings(df, df_slimmed):
for _, row in df.iterrows():
yield find_matching_row(row, df_slimmed)
list(rows_mappings(df, input))
This is if you are interested in generating the resultant list in your example, I don't quite follow the latter part of your reasoning.

indexing numpy multidimensional arrays

I need to access this numpy array, sometimes with only the rows where the last column is 0, and sometimes the rows where the value of the last column is 1.
y = [0 0 0 0
1 2 1 1
2 -6 0 1
3 4 1 0]
I have to do this over and over, but would prefer to shy away from creating duplicate arrays or having to recalculate each time. Is there someway that I can identify the indices concerned and just call them? So that I can do this:
>>print y[LAST_COLUMN_IS_0]
[0 0 0 0
3 4 1 0]
>>print y[LAST_COLUMN_IS_1]
[1 2 1 1
2 -6 0 1]
P.S. The number of columns in the array never changes, it's always going to have 4 columns.
You can use numpy's boolean indexing to identify which rows you want to select, and numpy's fancy indexing/slicing to select the whole row.
print y[y[:,-1] == 0, :]
print y[y[:,-1] == 1, :]
You can save y[:,-1] == 0 and ... == 1 as usual, since they are just numpy arrays.
(The y[:,-1] selects the whole of the last column, and the == equality check happens element-wise, resulting in an array of booleans.)

Categories