Dataframe slicing with more than two dimensions

Dataframe slicing with more than two dimensions - python

So I'm going through a machine learning tutorial and I'm met with this line of code:
pred_list = []
batch = train[-n_input:].reshape((1, n_input, n_features))
for i in range(n_input):
pred_list.append(model.predict(batch)[0])
batch = np.append(batch[:,1:,:],[[pred_list[i]]],axis=1)
Specifically, what happens inside the for loop. I understand that the first line of code grabs the first value of whatever is predicted, this is only one value. Next it appends the value to the end of batch, this is where I'm confused.
Why is batch in the second line of code batch[:,1:,:]? What does that mean? I'm not too sure about dataframe slicing, can someone explain what the second line of code in the for loop means? It would be very much appreciated. Here's the article in question. Thank you for reading.

Seems to me batch is a numpy array with 3 dimensions of shape (1, n_input, n_features), 1 row, n_input columns, and n_features depths. batch[:,1:,:] would be a slice of batch that gets from second to last columns of batch (python is 0-based indexing). I am guessing these columns represent inputs, i.e. all the features of inputs 1 to last.
batch = np.append(batch[:,1:,:],[[pred_list[i]]],axis=1) appends [[pred_list[i]]] to that slice of batch along axis=1 which is columns. So I am guessing it removes the first input from batch and appends the new [[pred_list[i]]] as last input to batch and re-do this for all inputs in batch.

ndarray can be indexed in two way,
arr = np.array([[[1,2,3],
[3,4,5],
[7,8,9]]])
Either
arr[1][0][2] #row, col, layer
or
arr[1,0,2] #row, col, layer
First index gives you the row, second col, third layer and so on. Both the methods will give you the element present in the 2nd row, 1st column and 3rd layer.
batch[:,1:,:] means you want all the rows, all the columns following the 1st column and all the layers.
P.S
I have used the word layers here, if you know a better word do suggest.

Related

Changing the values of sliced numpy array doesn't change the original data in it

I have a numpy array total_weights which is an IxI array of floats. Each row/columns corresponds to one of I items.
During my main loop I acquire another real float array weights of size NxM (N, M < I) where each/column row also corresponds to one of the original I items (duplicates may also exist).
I want to add this array to total_weights. However, the sizes and order of the two arrays are not aligned. Therefore, I maintain a position map, a pandas Series with an index of item IDs to their proper index/position in total_weights, called pos_df.
In order to properly make the addition I want I perform the following operation inside the loop:
candidate_pos = pos_df.loc[candidate_IDs] # don't worry about how I get these
rated_pos = pos_df.loc[rated_IDs] # ^^
total_weights[candidate_pos, :][:, rated_pos] += weights
Unfortunately, the above operation must be editing a copy of the orignal total_weights matrix and not a view of it, since after the loop the total_weights array is still full of zeroes. How do I make it change the original data?
Edit:
I want to clarify that candidate_IDs are the N IDs of items and rated_IDs are the M IDs of items in the NxM array called weights. Through pos_df I can get their total order in all of I items.
Also, my guess as to the reason a copy is returned is that candidate_IDs and thus candidate_pos will probably contain duplicates e.g. [0, 1, 3, 1, ...]. So the same rows will sometimes have to be pulled into the new array/view.

Your first problem is in how you are using indexing. As candidate_pos is an array, total_weights[candidate_pos, :] is a fancy indexing operation that returns a new array. When you apply indexing again, i.e. ...[:, rated_pos] you are assigning elements to the newly created array rather than to total_weights.
The second problem, as you have already spotted, is in the actual logic you are trying to apply. If I understand your example correctly, you have a I x I matrix with weights, and you want to update weights for a sequence of pairs ((Ix_1, Iy_1), ..., (Ix_N, Iy_N)) with repetitions, with a single line of code. This can't be done in this way, using += operator, as you'll find yourself having added to weights[Ix_n, Iy_n] the weight corresponding to the last time (Ix_n, Iy_n) appears in your sequence: you have to first merge all the repeating elements in your sequence of weight updates, and then perform the update of your weights matrix with the new "unique" sequence of updates. Alternatively, you must collect your weights as an I x I matrix, and directly sum it to total_weights.

After #rveronese pointed out that it's impossible to do it one go because of the duplicates in candidate_pos I believe I have managed to do what I want with a for-loop on them:
candidate_pos = pos_df.loc[candidate_IDs] # don't worry about how I get these
rated_pos = pos_df.loc[rated_IDs] # ^^
for i, c in enumerate(candidate_pos):
total_weights[c, rated_pos] += weights[i, :]
In this case, the indexing does not create a copy and the assignment should be working as expected...

Mask certain indices for every entry in a batch, when using torch.max()

I am incremently sampling a batch of size torch.Size([n, 8]).
I also have a list valid_indices of length n which contains tuples of indices that are valid for each entry in the batch.
For instance valid_indices[0] may look like this: (0,1,3,4,5,7) , which suggests that indices 2 and 6 should be excluded from the first entry in batch along dim 1.
Particularly I need to exclude these values for when I use torch.max(batch, dim=1, keepdim=True).
Indices to be excluded (if any) may differ from entry to entry within the batch.
Any ideas? Thanks in advance.

I assume that you are getting the good old
IndexError: too many indices for tensor of dimension 1
error when you use your tuple indices directly on the tensor.
At least that was the error that I was able to reproduce when I execute the following line
t[0][valid_idx0]
Where t is a random tensor with size (10,8) and valid_idx0 is a tuple with 4 elements.
However, same line works just fine when you convert your tuple to a list as following
t[0][list(valid_idx0)]
>>> tensor([0.1847, 0.1028, 0.7130, 0.5093])
But when it comes to applying these indices to 2D tensors, things get a bit different, since we need to preserve the structure of our tensor for batch processing.
Therefore, it would be reasonable to convert our indices to mask arrays.
Let's say we have a list of tuples valid_indices at hand. First thing will be converting it to a list of lists.
valid_idx_list = [list(tup) for tup in valid_indices]
Second thing will be converting them to mask arrays.
masks = np.zeros((t.size()))
for i, indices in enumerate(valid_idx_list):
masks[i][indices] = 1
Done. Now we can apply our mask and use the torch.max on the masked tensor.
torch.max(t*masks)
Kindly see the colab notebook that I've used to reproduce the problem.
https://colab.research.google.com/drive/1BhKKgxk3gRwUjM8ilmiqgFvo0sfXMGiK?usp=sharing

Can someone explain the code happening inside the for loop? LSTM prediction

So I'm going through a machine learning model and I'm met with this code.
n_input=12
n_features=1
pred_list = []
batch = train[-n_input:].reshape(1,n_input,n_features)
for i in range(n_input):
pred_list.append(model.predict(batch)[0]) #predict one value
batch = np.append(batch[:,1:,:],[[pred_list[i]]],axis=1)#append to the end of the batch list
I understand that in batch, it takes the -n_input index values until the end denoted by ":", then reshapes the dataframe into (1,n_input,n_features) which in this case is (1,12,1).
To make a prediction, a for loop is used, and it is looped 12 times equal our n_inputs, n_inputs is the amount of periods into the future I want to forecast.
That's where my confusion begins, I don't quite understand the code inside the for loop. Can someone explain the code inside the loop? Thanks for reading.
The article in question:https://medium.com/swlh/a-quick-example-of-time-series-forecasting-using-long-short-term-memory-lstm-networks-ddc10dc1467d

Let's say you have a list of 5 values
[1,2,3,4,5]
You want to predict the 10th value.
The first line of that code is making a single prediction 6.
The second line of that code is adding it to the previous list (batch[:,1:,:] is specifying that batch will now increment by one value and [[pred_list[i]]] is adding the recent prediction to the end). So you end up with:
[2,3,4,5,6]
Then you'd restart the loop with the new batch to predict the 7th value.

Python Code not clear (arrays)

I have the following lines:
Xtest = numpy.arange(-15,15,0.1)
Xtest = numpy.array([Xtest,Xtest*0+1]).T
Why does the second line look like this in the sense of "Xtest*0+1" ? I've tried
Xtest = numpy.array([Xtest,1]).T
I get the same output except that at the end of the array I have "dtype=object". Why is that?
Also, not clear what happens when I try
Xtest = numpy.array([Xtest,Xtest*0]).T
The output is unclear to me. Thought that I would have Xtest column with the column of 0's...
Finally,
Xtest =numpy.array([Xtest,0]).T
Why am I getting the second column with ones instead of zeros?

Since Xtest is an array, it has more than one entry. When you multiply it by zero, you have that many zeroes. Then you add one to make it into an array full of one's. In contrast, when you directly put in 1, you end up with a single 1, which is different.

CNTK RuntimeError: AddSequence: Sequences must be a least one frame long

I am getting error in following code:
x = cntk.input_variable(shape=c(8,3,1))
y = cntk.sequence.slice(x,1,0)
x0 = np.reshape(np.arange(48.0,dtype=np.float32),(2,8,1,3))
y.eval({x:x0})
Error : Sequences must be a least one frame long
But when I run this it runs fine :
x = cntk.input_variable(shape=c(3,2)) #change
y = cntk.sequence.slice(x,1,0)
x0 = np.reshape(np.arange(24.0,dtype=np.float32),(1,8,1,3)) #change
y.eval({x:x0})
I am not able to understand few things which slice method :
At what array level it's going to slice.
Acc. to documentation, second argument is begin_index, and next to it it end_index. How can being_index be greater than end_index.

There are two versions of slice(), one for slicing tensors, and one for slicing sequences. Your example uses the one for sequences.
If your inputs are sequences (e.g. words), the first form, cntk.slice(), would individually slice every element of a sequence and create a sequence of the same length that consists of sliced tensors. The second form, cntk.sequence.slice(), will slice out a range of entries from the sequence. E.g. cntk.sequence.slice(x, 13, 42) will cut out sequence items 13..41 from x, and create a new sequence of length (42-13).
If you intended to experiment with the first form, please change to cntk.slice(). If you meant the sequence version, please try to enclose x0 in an additional [...]. The canonical form of passing minibatch data is as a list of batch entries (e.g. MB size of 128 --> a list with 128 entries), where each batch entry is a tensor of shape (Ti,) + input_shape where Ti is the sequence length of the respective sequence. This
x0 = [ np.reshape(np.arange(48.0,dtype=np.float32),(2,8,1,3)) ]
would denote a minibatch with a single entry (1 list entry), where the entry is a sequence of 2 sequence items, where each sequence item has shape (8,1,3).
The begin and end indices can be negative, in order to index from the end (similar to Python slices). Unlike Python however, 0 is a valid end index that refers to the end.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Dataframe slicing with more than two dimensions - python

Related

Changing the values of sliced numpy array doesn't change the original data in it

Mask certain indices for every entry in a batch, when using torch.max()

Can someone explain the code happening inside the for loop? LSTM prediction

Python Code not clear (arrays)

CNTK RuntimeError: AddSequence: Sequences must be a least one frame long

Categories

Resources