One Hot Encoding in Tensorflow

One Hot Encoding in Tensorflow - python

I've been following the tensorflow walkthrough here to create my own categorical OHE layer. The layer suggested is below and I've followed the preceding steps to the guide very closely:
def get_category_encoding_layer(name, dataset, dtype, max_tokens=None):
# Create a StringLookup layer which will turn strings into integer indices
if dtype == 'string':
index = preprocessing.StringLookup(max_tokens=max_tokens)
else:
index = preprocessing.IntegerLookup(max_tokens=max_tokens)
# Prepare a Dataset that only yields our feature
feature_ds = dataset.map(lambda x, y: x[name])
# Learn the set of possible values and assign them a fixed integer index.
index.adapt(feature_ds)
# Create a Discretization for our integer indices.
encoder = preprocessing.CategoryEncoding(num_tokens=index.vocabulary_size())
# Apply one-hot encoding to our indices. The lambda function captures the
# layer so we can use them, or include them in the functional model later.
return lambda feature: encoder(index(feature))
However the output isn't aligned with the guide. When my input to the layer is a list of n strings, instead of the output being shape (n, vocabulary size), I am receiving an output of shape (1, vocabulary size), with multiple categories incorrectly marked '1'.
e.g. using n=2 and vocabulary size=3
Instead of getting an OHE of [[1, 0, 0], [0, 1, 0]], I am getting [1, 1, 0].
My code is exactly the same as the guide, but it looks like the layer is "merging" the encoding of each element of my input. Is there something wrong with the layer they provided or could someone give pointer on what I could test?

By default, CategoryEncoding uses output_mode="multi_hot". That's why you're getting output of size (1, vocab_size). To get OHE of size (n, vocab_size), make this change in your code
encoder = preprocessing.CategoryEncoding(num_tokens=index.vocabulary_size(), output_mode='one_hot')

Related

How to build an embedding layer in Tensorflow RNN?

I'm building an RNN LSTM network to classify texts based on the writers' age (binary classification - young / adult).
Seems like the network does not learn and suddenly starts overfitting:
Red: train
Blue: validation
One possibility could be that the data representation is not good enough. I just sorted the unique words by their frequency and gave them indices. E.g.:
unknown -> 0
the -> 1
a -> 2
. -> 3
to -> 4
So I'm trying to replace that with word embedding.
I saw a couple of examples but I'm not able to implement it in my code. Most of the examples look like this:
embedding = tf.Variable(tf.random_uniform([vocab_size, hidden_size], -1, 1))
inputs = tf.nn.embedding_lookup(embedding, input_data)
Does this mean we're building a layer that learns the embedding? I thought that one should download some Word2Vec or Glove and just use that.
Anyway let's say I want to build this embedding layer...
If I use these 2 lines in my code I get an error:
TypeError: Value passed to parameter 'indices' has DataType float32 not in list of allowed values: int32, int64
So I guess I have to change the input_data type to int32. So I do that (it's all indices after all), and I get this:
TypeError: inputs must be a sequence
I tried wrapping inputs (argument to tf.contrib.rnn.static_rnn) with a list: [inputs] as suggested in this answer, but that produced another error:
ValueError: Input size (dimension 0 of inputs) must be accessible via
shape inference, but saw value None.
Update:
I was unstacking the tensor x before passing it to embedding_lookup. I moved the unstacking after the embedding.
Updated code:
MIN_TOKENS = 10
MAX_TOKENS = 30
x = tf.placeholder("int32", [None, MAX_TOKENS, 1])
y = tf.placeholder("float", [None, N_CLASSES]) # 0.0 / 1.0
...
seqlen = tf.placeholder(tf.int32, [None]) #list of each sequence length*
embedding = tf.Variable(tf.random_uniform([VOCAB_SIZE, HIDDEN_SIZE], -1, 1))
inputs = tf.nn.embedding_lookup(embedding, x) #x is the text after converting to indices
inputs = tf.unstack(inputs, MAX_POST_LENGTH, 1)
outputs, states = tf.contrib.rnn.static_rnn(lstm_cell, inputs, dtype=tf.float32, sequence_length=seqlen) #---> Produces error
*seqlen: I zero-padded the sequences so all of them have the same list size, but since the actual size differ, I prepared a list describing the length without the padding.
New error:
ValueError: Input 0 of layer basic_lstm_cell_1 is incompatible with
the layer: expected ndim=2, found ndim=3. Full shape received: [None,
1, 64]
64 is the size of each hidden layer.
It's obvious that I have a problem with the dimensions... How can I make the inputs fit the network after embedding?

From the tf.nn.static_rnn , we can see the inputs arguments to be:
A length T list of inputs, each a Tensor of shape [batch_size, input_size]
So your code should be something like:
x = tf.placeholder("int32", [None, MAX_TOKENS])
...
inputs = tf.unstack(inputs, axis=1)

tf.squeeze is a method that removes dimensions of size 1 from the tensor. If the end goal is to have the input shape as [None,64], then put a line similar to inputs = tf.squeeze(inputs) and that would fix your problem.

TensorFlow, batchwise indexing (first dimension) and sorting

I've got a params tensor with shape (?,368,5), as well as a query tensor with shape (?,368). The query tensor stores indices for sorting the first tensor.
The required output has shape: (?,368,5). Since I need it for a loss function in a neural network, the used operations should stay differentiable. Also, at runtime the size of the first axis ? corresponds to the batchsize.
So far I experimented with tf.gather and tf.gather_nd, however
tf.gather(params,query) results in a tensor with shape (?,368,368,5).
The query tensor is achieved by performing:
query = tf.nn.top_k(params[:, :, 0], k=params.shape[1], sorted=True).indices
Overall, I try to sort the params tensor by the first element on the third axis (for kind of a chamfer distance). At last to mention is, that I work with the Keras framework.

You need to add the indices of the first dimension to query in order to use it with tf.gather_nd. Here is a way to do it:
import tensorflow as tf
import numpy as np
np.random.seed(100)
with tf.Graph().as_default(), tf.Session() as sess:
params = tf.placeholder(tf.float32, [None, 368, 5])
query = tf.nn.top_k(params[:, :, 0], k=params.shape[1], sorted=True).indices
n = tf.shape(params)[0]
# Make tensor of indices for the first dimension
ii = tf.tile(tf.range(n)[:, tf.newaxis], (1, params.shape[1]))
# Stack indices
idx = tf.stack([ii, query], axis=-1)
# Gather reordered tensor
result = tf.gather_nd(params, idx)
# Test
out = sess.run(result, feed_dict={params: np.random.rand(10, 368, 5)})
# Check the order is correct
print(np.all(np.diff(out[:, :, 0], axis=1) <= 0))
# True

Tensorflow shape inference static RNN compiler error

I am working on OCR software optimized for phone camera images.
Currently, each 300 x 1000 x 3 (RGB) image is reformatted as a 900 x 1000 numpy array. I have plans for a more complex model architecture, but for now I just want to get a baseline working. I want to get started by training a static RNN on the data that I've generated.
Formally, I am feeding in n_t at each timestep t for T timesteps, where n_t is a 900-vector and T = 1000 (similar to reading the whole image left to right). Here is the Tensorflow code in which I create batches for training:
sequence_dataset = tf.data.Dataset.from_generator(example_generator, (tf.int32,
tf.int32))
sequence_dataset = sequence_dataset.batch(experiment_params['batch_size'])
iterator = sequence_dataset.make_initializable_iterator()
x_batch, y_batch = iterator.get_next()
The tf.nn.static_bidirectional_rnn documentation claims that the input must be a "length T list of inputs, each a tensor of shape [batch_size, input_size], or a nested tuple of such elements." So, I go through the following steps in order to get the data into the correct format.
# Dimensions go from [batch, n , t] -> [t, batch, n]
x_batch = tf.transpose(x_batch, [2, 0, 1])
# Unpack such that x_batch is a length T list with element dims [batch_size, n]
x_batch = tf.unstack(x_batch, experiment_params['example_t'], 0)
Without altering the batch any further, I make the following call:
output, _, _ = tf.nn.static_rnn(lstm_fw_cell, x_batch, dtype=tf.int32)
Note that I do not explicitly tell Tensorflow the dimensions of the matrices (this could be the problem). They all have the same dimensionality, yet I am getting the following bug:
ValueError: Input size (dimension 0 of inputs) must be accessible via shape
inference, but saw value None.
At which point in my stack should I be declaring the dimensions of my input? Because I am using a Dataset and hoping to get its batches directly to the RNN, I am not sure that the "placeholder -> feed_dict" route makes sense. If that in fact is the method that makes the most sense, let me know what that looks like (I definitely do not know). Otherwise, let me know if you have any other insights to the problem. Thanks!

The reason for the absence of static shape information is that TensorFlow doesn't understand enough about the example_generator function to determine the shapes of the arrays it yields, and so it assumes the shapes can be completely different from one element to the next. The best way to constrain this is to specify the optional output_shapes argument to tf.data.Dataset.from_generator(), which accepts a nested structure of shapes matching the structure of the yielded elements (and the output_types argument).
In this case you'd pass a tuple of two shapes, which can be partially specified. For example, if the x elements are 900 x 1000 arrays and the y elements are scalars:
sequence_dataset = tf.data.Dataset.from_generator(
example_generator, (tf.int32, tf.int32),
output_shapes=([900, 1000], []))

(Keras) Apply pad_sequences for deeper levels // Variable label length

I got a label data shaped (2000,2,x) where x is between 100 and 250 for each of the 2000 sets with 2 being the x and y coordinates. To my understanding, fitting my model like in the code below would only match the length of the coordinates.
model.fit(
x=train_data,
y=keras.preprocessing.sequence.pad_sequences(train_labels, maxlen=250),
epochs=EPOCHS,
batch_size=BATCH_SIZE)
So how can I bring all of these labels to the same length since that seems necessary in order to use them to train the model?

I imagine labels are going to be a somewhat sparse matrix with shape ( 2000, 2, 250) if you account for padding right? And you're attempting predicting for each example a 2D matrix with (2, 250)?
Anyways, the padding you currently have will only affect the coordinate's dimension.
A hack to get padding on the last dimension would be to permute the axis of the data and add padding then permute back to original shape:
perm_y = np.moveaxis(y, 1, 2)
padded_perm_y = sequence.padding(y, max_len=250, padding='post',
truncating='post')
padded_y = np.moveaxis(padded_perm_y, 2, 1)

It turned out that np.pad works here (while np.moveaxis + sequence.padding didn't). So I'm iterating over my input twice; once to get the maximum length and a second time to apply np.pad to a new array that got the shape (training_samples, coordinates, maximum_sequence_length).
While I don't know whether padding distorts the output of the CNN-LSTM, I'm glad that the above error doesn't arise any longer.

For padding sequences with deeper levels (list of lists of lists,..) you can use ragged tensors and convert to tensors/arrays. For example:
import tensorflow as tf
padded_y = tf.ragged.constant(train_labels).to_tensor(0.)
This pads with 0.

Use batch_size in model_fn in skflow

I need to create a random variable inside my model_fn(), having shape [batch_size, 20].
I do not want to pass batch_size as an argument, because then I cannot use a different batch size for prediction.
Removing the parts which do not concern this question, my model_fn() is:
def model(inp, out):
eps = tf.random_normal([batch_size, 20], 0, 1, name="eps"))) # batch_size is the
# value I do not want to hardcode
# dummy example
predictions = tf.add(inp, eps)
return predictions, 1
if I replace [batch_size, 20] by inp.get_shape(), I get
ValueError: Cannot convert a partially known TensorShape to a Tensor: (?, 20)
when running myclf.setup_training().
If I try
def model(inp, out):
batch_size = tf.placeholder("float", [])
eps = tf.random_normal([batch_size.eval(), 20], 0, 1, name="eps")))
# dummy example
predictions = tf.add(inp, eps)
return predictions, 1
I get ValueError: Cannot evaluate tensor using eval(): No default session is registered. Usewith sess.as_default()or pass an explicit session to eval(session=sess) (understandably, because I have not provided a feed_dict)
How can I access the value of batch_size inside model_fn(), while remaining able to change it during prediction?

I wasn't aware of the difference between Tensor.get_shape() and tf.shape(Tensor). The latter works:
eps = tf.random_normal(tf.shape(inp), 0, 1, name="eps")))
As mentionned in Tensorflow 0.8 FAQ:
How do I build a graph that works with variable batch sizes?
It is often useful to build a graph that works with variable batch
sizes, for example so that the same code can be used for (mini-)batch
training, and single-instance inference. The resulting graph can be
saved as a protocol buffer and imported into another program.
When building a variable-size graph, the most important thing to
remember is not to encode the batch size as a Python constant, but
instead to use a symbolic Tensor to represent it. The following tips
may be useful:
Use batch_size = tf.shape(input)[0] to extract the batch dimension
from a Tensor called input, and store it in a Tensor called
batch_size.
Use tf.reduce_mean() instead of tf.reduce_sum(...) / batch_size.
If you use placeholders for feeding input, you can specify a variable
batch dimension by creating the placeholder with tf.placeholder(...,
shape=[None, ...]). The None element of the shape corresponds to a
variable-sized dimension.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

One Hot Encoding in Tensorflow - python

By default, CategoryEncoding uses output_mode="multi_hot". That's why you're getting output of size (1, vocab_size). To get OHE of size (n, vocab_size), make this change in your code encoder = preprocessing.CategoryEncoding(num_tokens=index.vocabulary_size(), output_mode='one_hot')

Related

How to build an embedding layer in Tensorflow RNN?

TensorFlow, batchwise indexing (first dimension) and sorting

Tensorflow shape inference static RNN compiler error

(Keras) Apply pad_sequences for deeper levels // Variable label length

Use batch_size in model_fn in skflow

Categories

Resources