How to handle very sparse vectors in Tensorflow - python

What is the best way to handle sparse vectors of size (about) 30 000, where all indexes are zero except one index with value one (1-HOT vector)?
In my dataset I have a sequence of values, that I convert to one 1-HOT vector for each value. Here is what i currently do:
# Create some queues to read data from .csv files
...
# Parse example(/line) from the data file
example = tf.decode_csv(value, record_defaults=record_defaults)
# example now looks like (e.g) [[5], [1], [4], [38], [571], [9]]
# [5] indicates the length of the sequence
# 1, 4, 38, 571 is the input sequence
# 4, 38, 571, 9 is the target sequence
# Create 1-HOT vectors for each value in the sequence
sequence_length = example[0]
one_hots = example[1:]
one_hots = tf.reshape(one_hots, [-1])
one_hots = tf.one_hot(one_hots, depth=n_classes)
# Grab the first values as the input features and the last values as target
features = one_hots[:-1]
targets = one_hots[1:]
...
# The sequence_length, features and targets are added to a list
# and the list is sent into a batch with tf.train_batch_join(...).
# So now I can get batches and feed into my RNN
...
This works, but I am convinced that it could be done in a more efficient way. I looked at SparseTensor, but I could not figure out how to create SparseTensors from the example tensor I get from tf.decode_csv. And I read somwhere that it is best to parse the data after it is retrieved as a batch, is this still true?
Here is a pastebin of the full code. From line 32 is my current way of creating 1-HOT vectors.

Instead of dealing with converting your inputs to sparse 1 hot vectors, it is preffered to use tf.nn.embedding_lookup, which simply selects the relevant rows of the matrix you would multiply by. This is equivalent for multiplication of the matrix by the 1-hot vector.
Here is a usage example
embed_dim = 3;
vocab_size = 10;
E = np.random.rand(vocab_size, embed_dim)
print E
embeddings = tf.Variable(E)
examples = tf.Variable(np.array([4,5, 2,9]).astype('int32'))
examples_embedded = tf.nn.embedding_lookup(embeddings, examples)
s = tf.InteractiveSession()
s.run(tf.initialize_all_variables())
print ''
print examples_embedded.eval()
Also see this example in im2txt project, for how to feed this kind of data for RNNs, (the line saying seq_embeddings = tf.nn.embedding_lookup(embedding_map, self.input_seqs))

Related

How to convert a large string data array from file to np.array with float data type

My task is to direct a potentially large set of elements for training a neural network. I am trying to use tf.data.experimental.CsvDataset and tf.data.experimental.make_csv_dataset but I keep getting stuck.
My dataset is a text file containing strings with numbers separated by ';'. This is how it looks:
14;14;14;55;55;20;20...33;34;34
20;20;20;15;15;15;26...10;10;10
....
10;10;10;30;30;35;35...23;23;23
Each line contains 2500 numbers, separated by each other. I tried to use this code
dataset = tf.data.experimental.CsvDataset(pathAsk,
record_defaults=[tf.float32],
field_delim=";",
na_value='NA'
)
for element in dataset.as_numpy_iterator():
print(element)
But I get an error like there are more elements in the row than I specified in record_defaults. Also I try to use this:
dataset = tf.data.experimental.make_csv_dataset(pathAsk, batch_size=2, field_delim=';')
iterator = dataset.as_numpy_iterator()
print(dict(next(iterator)))
But I get error:
Cannot have duplicate column names.
My task is to use this dataset so that it can get into the input of a neural network built in a similar way:
inputs = keras.Input(shape=(2500), name="ask")
x = keras.layers.Embedding(1000, 64)
x = keras.layers.Dense(64, activation=keras.activations.relu)(inputs)
x = keras.layers.Dense(32, activation=keras.activations.relu)(x)
outputs = keras.layers.Dense(6, activation=keras.activations.relu)(x)
model = keras.Model(inputs, outputs)
model.compile("adam", "binary_crossentropy", metrics=["accuracy"])
It looks like I found the answer to my question.
I added an indication of the number of dimensions of the row in the record_defaults property.
datasetAsk = tf.data.experimental.CsvDataset(
pathAsk,
record_defaults = [tf.constant(0, dtype=tf.float32)] * 2500,
header = False,
field_delim = ";",
)
As a result, the number of elements in a line corresponds to the number of data types dtype and the iteration operation passes without errors.

Dealing with batch size and time step in 1D CNN

I have a batch generator which gives me data in the shape of (500, 1, 12) (i.e. corresponding to (batch size, time steps, features)).
def batch_generator(batch_size, gen_x,gen_y):
batch_features = np.zeros((batch_size,1, 12))
batch_labels = np.zeros((batch_size,9))
while True:
for i in range(batch_size):
batch_features[i] = next(gen_x)
batch_labels[i] = next(gen_y)
yield batch_features, batch_labels
def generate_X():
while True:
with open("/my_path/my_data.csv") as f:
for line in f:
currentline = line.rstrip('\n').split(",")
currentline = np.asarray(currentline)
currentline = currentline.reshape(1,1,12)
yield currentline
def generate_y():
while True:
for i in range(len(y_train)):
y= y_train[i]
yield y
I then try to feed this into a 1D-CNN:
model = Sequential()
model.add(Conv1D(filters=100, kernel_size=1, activation='relu', input_shape=(1,12), data_format="channels_last"))
But now I am not able to use a kernel size of more than 1 (i.e. kernel_size = 1). This is probably because my time step is equal to 1.
How can I use the whole batch size as input to the 1D-CNN and increase the kernel_size?
Keep in mind that 1D-convolution is used when each of our input samples is a sequence, i.e. data in which the order of values are important/specified, like stock market values over a week or the weather temperature values over a period of month or a sequence of genomes or words. With that said, considering your data, there are three different scenarios:
If each line in your csv file is a sequence of length 12, then you are dealing with samples of shape (12,1), i.e. in each sample there are 12 timesteps where each timestep has only on feature. So you should reshape it accordingly (i.e. to (12,1) and not to (1,12)).
However, if each line is not a sequence by itself, but a group of consecutive lines form a sequence, then you must generate your data accordingly: each sample would consists of multiple consecutive lines, e.g. if we consider the number of timesteps to be 10 then lines #1 to #10 would be a sample, lines #2 to #12 would be another sample, and so on. And in this case each sample would have a shape of (number_of_timesteps, 12) (in the example I mentioned it would be (10,12)). Now you can create and generate these samples by writing a custom function, or alternatively you could load all of the data as a numpy array and then use TimeseriesGenerator to do it for you.
If none of the two cases above apply, then it's very likely that your data is not a sequential at all and therefore using 1D-CNN (or any other sequence processing model like RNNs) does not make sense for this data. Instead, you should use other suitable architectures.

BERT sentence embedding by summing last 4 layers

I used Chris Mccormick tutorial on BERT using pytorch-pretained-bert to get a sentence embedding as follows:
tokenized_text = tokenizer.tokenize(marked_text)
indexed_tokens = tokenizer.convert_tokens_to_ids(tokenized_text)
segments_ids = [1] * len(tokenized_text)
tokens_tensor = torch.tensor([indexed_tokens])
segments_tensors = torch.tensor([segments_ids])
model = BertModel.from_pretrained('bert-base-uncased')
model.eval()
with torch.no_grad():
encoded_layers, _ = model(tokens_tensor, segments_tensors)
# Holds the list of 12 layer embeddings for each token
# Will have the shape: [# tokens, # layers, # features]
token_embeddings = []
# For each token in the sentence...
for token_i in range(len(tokenized_text)):
# Holds 12 layers of hidden states for each token
hidden_layers = []
# For each of the 12 layers...
for layer_i in range(len(encoded_layers)):
# Lookup the vector for `token_i` in `layer_i`
vec = encoded_layers[layer_i][batch_i][token_i]
hidden_layers.append(vec)
token_embeddings.append(hidden_layers)
Now, I am trying to get the final sentence embedding by summing the last 4 layers as follows:
summed_last_4_layers = [torch.sum(torch.stack(layer)[-4:], 0) for layer in token_embeddings]
But instead of getting a single torch vector of length 768 I get the following:
[tensor([-3.8930e+00, -3.2564e+00, -3.0373e-01, 2.6618e+00, 5.7803e-01,
-1.0007e+00, -2.3180e+00, 1.4215e+00, 2.6551e-01, -1.8784e+00,
-1.5268e+00, 3.6681e+00, ...., 3.9084e+00]), tensor([-2.0884e+00, -3.6244e-01, ....2.5715e+00]), tensor([ 1.0816e+00,...-4.7801e+00]), tensor([ 1.2713e+00,.... 1.0275e+00]), tensor([-6.6105e+00,..., -2.9349e-01])]
What did I get here? How do I pool the sum of the last for layers?
Thank you!
You create a list using a list comprehension that iterates over token_embeddings. It is a list that contains one tensor per token - not one tensor per layer as you probably thought (judging from your for layer in token_embeddings). You thus get a list with a length equal to the number of tokens. For each token, you have a vector that is a sum of BERT embeddings from the last 4 layers.
More efficient would be avoiding the explicit for loops and list comprehenions:
summed_last_4_layers = torch.stack(encoded_layers[-4:]).sum(0)
Now, variable summed_last_4_layers contains the same data, but in the form of a single tensor of dimension: length of the sentence × 768.
To get a single (i.e., pooled) vector, you can do pooling over the first dimension of the tensor. Max-pooling or average-pooling might make much more sense in this case than summing all the token embeddings. When summing the values, vectors of differently long sentences are in different ranges and are not really comparable.

How to reshape input for keras LSTM?

I have a numpy array of some 5000 rows and 4 columns (temp, pressure, speed, cost). So this is of the shape (5000, 4). Each row is an observation at a regular interval this is the first time i'm doing time series prediction and I'm stuck on input shape. I'm trying to predict
a value 1 timestep from the last data point. How do I reshape it into the 3D form for LSTM model in keras?
Also It will be much more helpful if a small sample program is written. There doesn't seem to be any example/tutorial where the input has more than one feature (and also not NLP).
The first question you should ask yourself is :
What is the timescale in which the input features encode relevant information for the value you want to predict?
Let's call this timescale prediction_context.
You can now create your dataset :
import numpy as np
recording_length = 5000
n_features = 4
prediction_context = 10 # Change here
# The data you already have
X_data = np.random.random((recording_length, n_features))
to_predict = np.random.random((5000,1))
# Make lists of training examples
X_in = []
Y_out = []
# Append examples to the lists (input and expected output)
for i in range(recording_length - prediction_context):
X_in.append(X_data[i:i+prediction_context,:])
Y_out.append(to_predict[i+prediction_context])
# Convert them to numpy array
X_train = np.array(X_in)
Y_train = np.array(Y_out)
At the end :
X_train.shape = (recording_length - prediction_context, prediction_context, n_features)
So you will need to make a trade-off between the length of your prediction context and the number of examples you will have to train your network.

Retrieving last value of LSTM sequence in Tensorflow

I have sequences of different lengths that I want to classify using LSTMs in Tensorflow. For the classification I just need the LSTM output of the last timestep of each sequence.
max_length = 10
n_dims = 2
layer_units = 5
input = tf.placeholder(tf.float32, [None, max_length, n_dims])
lengths = tf.placeholder(tf.int32, [None])
cell = tf.nn.rnn_cell.LSTMCell(num_units=layer_units, state_is_tuple=True)
sequence_outputs, last_states = tf.nn.dynamic_rnn(cell, sequence_length=lengths, inputs=input)
I would like to get, in NumPy notation: output = sequence_outputs[:,lengths]
Is there any way or workaround to get this behaviour in Tensorflow?
---UPDATE---
Following this post How to select rows from a 3-D Tensor in TensorFlow? it seems that is possible to solve the problem in an efficient manner with tf.gather and manipulating the indices. The only requirement is that the batch size must be known in advance. Here is the adaptation of the referred post to this concrete problem:
max_length = 10
n_dims = 2
layer_units = 5
batch_size = 2
input = tf.placeholder(tf.float32, [batch_size, max_length, n_dims])
lengths = tf.placeholder(tf.int32, [batch_size])
cell = tf.nn.rnn_cell.LSTMCell(num_units=layer_units, state_is_tuple=True)
sequence_outputs, last_states = tf.nn.dynamic_rnn(cell,
sequence_length=lengths, inputs=input)
#Code adapted from #mrry response in StackOverflow:
#https://stackoverflow.com/questions/36088277/how-to-select-rows-from-a-3-d-tensor-in-tensorflow
rows_per_batch = tf.shape(input)[1]
indices_per_batch = 1
# Offset to add to each row in indices. We use `tf.expand_dims()` to make
# this broadcast appropriately.
offset = tf.range(0, batch_size) * rows_per_batch
# Convert indices and logits into appropriate form for `tf.gather()`.
flattened_indices = lengths - 1 + offset
flattened_sequence_outputs = tf.reshape(self.sequence_outputs, tf.concat(0, [[-1],
tf.shape(sequence_outputs)[2:]]))
selected_rows = tf.gather(flattened_sequence_outputs, flattened_indices)
last_output = tf.reshape(selected_rows,
tf.concat(0, [tf.pack([batch_size, indices_per_batch]),
tf.shape(self.sequence_outputs)[2:]]))
#petrux option (Get the last output of a dynamic_rnn in TensorFlow) seems also to work but the need of building a list within a for loop may be less optimized, although I did not perform any benchmark to support this statement.
This could be an answer. I don't think there is anything similar to the NumPy notation you pointed out, but the effect is the same.
Here's a solution, using gather_nd, where batch size does not need to be known ahead of time.
def extract_axis_1(data, ind):
"""
Get specified elements along the first axis of tensor.
:param data: Tensorflow tensor that will be subsetted.
:param ind: Indices to take (one for each element along axis 0 of data).
:return: Subsetted tensor.
"""
batch_range = tf.range(tf.shape(data)[0])
indices = tf.stack([batch_range, ind], axis=1)
res = tf.gather_nd(data, indices)
return res
output = extract_axis_1(sequence_outputs, lengths - 1)
Now output is a tensor of dimension [batch_size, num_cells].

Categories