Vectorize Sequences explanation

Vectorize Sequences explanation - python

Studying Deep Learning with Python, I can't comprehend the following simple batch of code which encodes the integer sequences into a binary matrix.
def vectorize_sequences(sequences, dimension=10000):
# Create an all-zero matrix of shape (len(sequences), dimension)
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1. # set specific indices of results[i] to 1s
return results
(train_data, train_labels), (test_data, test_labels) = imdb.load_data(num_words=10000)
x_train = vectorize_sequences(train_data)
And the output of x_train is something like
x_train[0]
array([ 0., 1.,1., ...,0.,0.,0.])
Can someone put some light of the 0.'s existance in x_train array while only 1.'s are appending in each next i iteration?
I mean shouldn't be all 1's?

The script transforms you dataset into a binary vector space model. Let's disect things one by one.
First, if we examine the x_train content we see that each review is represented as a sequence of word ids. Each word id corresponds to one specific word:
print(train_data[0]) # print the first review
[1, 14, 22, 16, 43, 530, 973, ..., 5345, 19, 178, 32]
Now, this would be very difficult to feed the network. The lengths of reviews varies, fractional values between any integers have no meaning (e.g. what if on the output we get 43.5, what does it mean?)
So what we can do, is create a single looong vector, the size of the entire dictionary, dictionary=10000 in your example. We will then associate each element/index of this vector with one word/word_id. So word represented by word id 14 will now be represented by 14-th element of this vector.
Each element will either be 0 (word is not present in the review) or 1 (word is present in the review). And we can treat this as a probability, so we even have meaning for values in between 0 and 1. Furthermore, every review will now be represented by this very long (sparse) vector which has a constant length for every review.
So on a smaller scale if:
word word_id
I -> 0
you -> 1
he -> 2
be -> 3
eat -> 4
happy -> 5
sad -> 6
banana -> 7
a -> 8
the sentences would then be processed in a following way.
I be happy -> [0,3,5] -> [1,0,0,1,0,1,0,0,0]
I eat a banana. -> [0,4,8,7] -> [1,0,0,0,1,0,0,1,1]
Now I highlighted the word sparse. That means, there will have A LOT MORE zeros in comparison with ones. We can take advantage of that. Instead of checking every word, whether it is contained in a review or not; we will check a substantially smaller list of only those words that DO appear in our review.
Therefore, we can make things easy for us and create reviews × vocabulary matrix of zeros right away by np.zeros((len(sequences), dimension)). And then just go through words in each review and flip the indicator to 1.0 at position corresponding to that word:
result[review_id][word_id] = 1.0
So instead of doing 25000 x 10000 = 250 000 000 operations, we only did number of words = 5 967 841. That's just ~2.5% of original amount of operations.

The for loop here is not processing all the matrix. As you can see, it enumerates elements of the sequence, so it's looping only on one dimension.
Let's take a simple example :
t = np.array([1,2,3,4,5,6,7,8,9])
r = np.zeros((len(t), 10))
Output
array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
then we modify elements with the same way you have :
for i, s in enumerate(t):
r[i,s] = 1.
array([[0., 1., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.]])
you can see that the for loop modified only a set of elements (len(t)) which has index [i,s] (in this case ; (0, 1), (1, 2), (2, 3), an so on))

import numpy as np
def vectorize_sequences(sequences, dimension=10000):
results = np.zeros((len(sequences), dimension))
for i, sequence in enumerate(sequences):
results[i, sequence] = 1.
return results

Related

Trying to compare different sized one-hot-encoded lists

I have run an autoencoder model, and returned a dictionary with each output and it's label, using FashionMNIST. My goal is to print 10 images only for the dress and coat class (class labels 3 and 4). I have one-hot-encoded the labels such that the dress class appears as [0.,0,.0,1.,0.,0.,0.,0.,0.]. My dictionary output is:
print(pa). #dictionary is called pa
{'output': array([[1.5346111e-04, 2.3307074e-04, 2.8705355e-04, ..., 1.9890528e-04,
1.8257453e-04, 2.0764180e-04],
[1.9767908e-03, 1.5839143e-03, 1.7811939e-03, ..., 1.7838757e-03,
1.4038634e-03, 2.3405524e-03],
[5.8998094e-06, 6.9388111e-06, 5.8752844e-06, ..., 5.1715115e-06,
4.4670110e-06, 1.2018012e-05],
...,
[2.1034568e-05, 3.0344427e-05, 7.0048365e-05, ..., 9.4724113e-05,
8.9003828e-05, 4.1828611e-05],
[2.7930623e-06, 3.0393956e-06, 4.5835086e-06, ..., 3.8765144e-04,
3.6324131e-05, 5.6411723e-06],
[1.2453397e-04, 1.1948447e-04, 2.0121646e-04, ..., 1.0773790e-03,
2.9582143e-04, 1.7229551e-04]], dtype=float32),
'label': array([[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 1., 0.],
[0., 0., 0., ..., 1., 0., 0.],
...,
[1., 0., 0., ..., 0., 0., 0.],
[0., 0., 1., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], dtype=float32)}
I am trying to run a for loop, where if the pa['label'] is equal to a certain one-hot-encoded array, I plot the corresponding pa['output'].
for i in range(len(pa['label'])):
if pa['label'][i] == np.array([0.,0.,0.,1.,0.,0.,0.,0.,0.]):
print(pa['lable'][i])
# plt.imshow(pa['output'][i].reshape(28,28))
# plt.show()
However, I get a warning(?):
/opt/conda/lib/python3.7/site-packages/ipykernel_launcher.py:2: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
I have also tried making a list of arrays of the one-hot-encoded arrays i want to plot and trying to compare my dictionary label to this array (different sized arrays):
clothing_array = np.array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])
for i in range(len(pa['label'])):
if (pa['label'][i] == clothing_array[i]).any():
plt.imshow(pa['output'][i].reshape(28,28))
plt.show()
However, it plots a picture of a tshirt, a bag, and then i get the error
IndexError: index 2 is out of bounds for axis 0 with size 2
Which i understand since clothing_array only has two indices. But obviously this code is wrong since I want to print ONLY dress and coat. I don't know why it's printing these images and i don't know how to fix it. Any help or clarifying questions are more than welcome.
Here are the first ten arrays of my dictionary labels:
array([[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 1., 0.],
[0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 1.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]], dtype=float32)

I will post an example here.
Here we have two arrays for you x is the label array and y the clothing . You can get in z the ones that are identical (the indexes). Finally by using the matching_indexes you can collect the onces you want from output and plot them
x = np.array([[1., 0., 0., 0., 0., 0., 0.],
[0., 1., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 1., 0., 0.],
[1., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 0.],
[0., 0., 0., 1., 0., 0., 0.]])
y = np.array([[1.,0.,0.,0.,0.,0.,0.]])
z= np.multiply(x,y)
matching_indexes = np.where(z.any(axis=1))[0]

How to replace the value of multiple cells in multiple rows in a Pytorch tensor?

I have a tensor
import torch
torch.zeros((5,10))
>>> tensor([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
How can I replace the values of X random cells in each row with random inputs (torch.rand())?
That is, if X = 2, in each row, 2 random cells should be replaced with torch.rand().
Since I need it to not break backpropagation I found here that replacing the .data attribute of the cells should work.
The only familiar thing to me is using a for loop but it's not efficient for a large tensor

You can try tensor.scatter_().
x = torch.zeros(3,4)
n_replace = 3 # number of cells to be replaced with random number
src = torch.randn(x.size())
index = torch.stack([torch.randperm(x.size()[1]) for _ in range(x.size()[0])])[:,:n_replace]
x.scatter_(1, index, src)
Out[22]:
tensor([[ 0.0000, 0.5769, 0.7432, -0.1776],
[-2.1673, -1.0802, 0.0000, 0.6241],
[-0.6421, 0.1315, 0.0000, -2.7224]])
To avoid repetition,
perm = torch.randperm(tensor.size(0))
idx = perm[:k]
samples = tensor[idx]

Python - 2D Array: Finding Coordinates

I have a python based battleship game that allows players to pass in a .csv containing a 10x10 grid of 0s and 1s, where 1s correspond to locations of ships. For example,
In [1]: board = np.genfromtext(filename, delimiter=',')
Out[2]: array([
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 1., 0., 0., 0., 1., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 1., 1., 1., 1., 0., 0., 0.],
[1., 0., 0., 0., 0., 1., 1., 1., 1., 1.],
[1., 0., 0., 0., 0., 0., 0., 0., 0., 0.]])
(5) unit ship located at board[8,5:10]
(4) unit ship located at board[7,3:7]
(3) unit ship located at board[3:6,2]
(3) unit ship located at board[3:6,6]
(2) unit ship located at board[8:10,0]
The game restricts players to 5 ships with sizes of (5, 4, 3, 3, 2). Given this information how can I come up with a method for determining the coordinates for each ship?
I'm currently applying a for loop that finds indexes where 1s are present, see below.
for ri, ci in zip(range(board.shape[0]), range(board.shape[1])):
# get row indexes of ships
rlocs = len(np.where(board[ri,:]==1)[0])
# get col indexes of ships
clocs = len(np.where(board[:,ci]==1)[0])
# skip empty row and col
if (rlocs == 0) and (clocs == 0): continue
# check if consecutives <int>
rcons = is_consecutive(rlocs)
ccons = is_consecutive(clocs)
# if more than one consecutive is found then assume more than 1 ship in row/col
if (rcons > 1) or (ccons > 1):
# .... ?
At this point I'm unsure what the next step would be... any help or advice is welcomed!
FYI: is_consecutive returns an int of the number of values in a list that are in a consecutive order. For example, [0,1,2,9,10] would return 2 (i.e., 0-2 and 9-10).
The output that I'm looking for is a dictionary that looks similar to this:
{'ship_05_01': [(x0,y0), (x1,y1)], 'ship_03_01': [(x0,y0), (x1,y1)], 'ship_03_02': [(x0,y0), (x1,y1)], ...}
where ship_xx_nn --> xx = number of spaces; nn = index

Add homogeneous coordinate (x0=1) to images in numpy

I have 7 images of size 29*29, I want to add one homogenous coordinate (augment them
with feature, x0=1) to all 7 images, but I am not sure how to do it.
my original image dimension is
images.shape
#(7, 29, 29)
What I have tried is zipping np.ones() but it ends up making separate array for first feature resulting in dimension 7*2
np.array([list(a) for a in zip(np.ones([7,1]),images_all[:,:])]).shape
#(7,2)
#
#[[array([1.]),
# array([[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
[0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.],
....
As you can see, it adds 1 as separate array and does append in as the first element.
Also, I tried to loop through images and insert 1 at the first element, but it makes dimension 30 and gives error
for i in range(len(images)):
images[i][0] = np.insert(images[i][0], 0, 1., axis=0)
ValueError: could not broadcast input array from shape (30) into shape (29)

First create a larger array of ones, reshape the original array and update the larger array.
padded_images = np.ones((7,29*29+1))
padded_images[:,1:] = images.reshape(7,29*29)

Removing NaN rows from a three dimensional array

How can I remove the NaN rows from the array below using indices (since I will need to remove the same rows from a different array.
array([[[nan, 0., 0., 0.],
[ 0., 0., 0., 0.],
[ 0., 0., 0., 0.]],
[[ 0., 0., 0., 0.],
[ 0., nan, 0., 0.],
[ 0., 0., 0., 0.]]])
I get the indices of the rows to be removed by using the command
a[np.isnan(a).any(axis=2)]
But using what I would normally use on a 2D array does not produce the desired result, losing the array structure.
a[~np.isnan(a).any(axis=2)]
array([[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.],
[0., 0., 0., 0.]])
How can I remove the rows I want using the indices obtained from my first command?

You need to reshape:
a[~np.isnan(a).any(axis=2)].reshape(a.shape[0], -1, a.shape[2])
But be aware that the number of NaN-rows at each 2D subarray should be the same to get a new 3D array.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorize Sequences explanation - python

import numpy as np def vectorize_sequences(sequences, dimension=10000): results = np.zeros((len(sequences), dimension)) for i, sequence in enumerate(sequences): results[i, sequence] = 1. return results

Related

Trying to compare different sized one-hot-encoded lists

How to replace the value of multiple cells in multiple rows in a Pytorch tensor?

Python - 2D Array: Finding Coordinates

Add homogeneous coordinate (x0=1) to images in numpy

Removing NaN rows from a three dimensional array

Categories

Resources