I tried to write my own estimator model_fn() for a GCP ML Engine package. I decoded a sequence of outputs using embedding_rnn_decoder as shown below:
outputs, state = tf.contrib.legacy_seq2seq.embedding_rnn_decoder(
decoder_inputs = decoder_inputs,
initial_state = curr_layer,
cell = tf.contrib.rnn.GRUCell(hidden_units),
num_symbols = n_classes,
embedding_size = embedding_dims,
feed_previous = False)
I know that outputs is "A list of the same length as decoder_inputs of 2D Tensors" but I am wondering how I can use this list to calculate the loss function for the entire sequence?
I know that if I grab outputs[0] (ie. grab only the first sequence output) then I could loss by following:
logits = tf.layers.dense(
outputs[0],
n_classes)
loss = tf.reduce_mean(
tf.nn.sparse_softmax_cross_entropy_with_logits(
logits=logits, labels=labels)
Is it appropriate to generate a loss value for each out the items in output and then pass these all to tf.reduce_mean? This feels inefficient, especially for long sequences -- are there any other ways to calculate the softmax at each step of the sequence that would be more efficient?
I think you're looking for sequence_loss (currently in contrib/ a type of incubation).
It looks like the solution to my problem is to use sequence_loss_by_example
Related
My data set is a 3D array of the size (M,t,N) where M is the number of samples, t is the number of timesteps in a sequence and N is the number of possible events that can happen at time t. By selecting a specific M we have a 2D array of size (t,N) where each row is a timestep and each column is an event. Each column is set to 1 if that event happened at time t, otherwise it's set to 0. Only 1 event can happen at any given timestep.
I want to try and build an auto-encoder for anomaly detection, and in the tutorials and blogs I have read, the last activation layer is 'relu' and the loss function is 'mse'. But since I am trying to basically reconstruct a classification with N classes, would 'softmax' as the last layer and 'categorical_crossentropy' be better?
inputs = Input(shape = (timesteps,n_features))
# Encoder
lstm_enc_1 = LSTM(32, activation='relu', input_shape=(timesteps, n_features), return_sequences=True)(inputs)
lstm_enc_2 = LSTM(latent_dim, activation='relu', return_sequences=False)(lstm_enc_1)
repeater = RepeatVector(timesteps)
# Decoder
lstm_dec_1 = LSTM(latent_dim, activation='relu', return_sequences=True)
lstm_dec_2 = LSTM(32, activation='relu', return_sequences=True)
time_dis = TimeDistributed(Dense(n_features,activation='softmax')) #<-- Does this make sense here?
z = repeater(lstm_enc_2)
h = lstm_dec_1(z)
decoded_h = lstm_dec_2(h)
decoded = time_dis(decoded_h)
ae = Model(inputs,decoded)
ae.compile(loss='categorical_crossentropy', optimizer='adam') #<-- Does this make sense here?
Or should I, for some reason, still use 'relu' and 'mse' as the last activation function and loss function?
Any input is appreciated.
When i read it correctly, N is one-hot encoded and it sounds like you want to do a classification, no regression.
For beeing y one-hot encoded, using categorical_crossentropy is correct.
If you have more classes in y than 4, you may use integer-encodings and use sparse_categorical_crossentropy, which decodes you y values to one-hot matrices on the way.
mse is better used for regression.
As last actication, since you have a classification, you may want to use softmax, which outputs a probability for each of your y classes.
As far as I know, your normally do not use relu is the last layer, if you have a regression task, you prefer sigmoid in general.
I want to use BERT model to do multi-label classification with Tensorflow.
To do so, I want to adapt the example run_classifier.py from BERT github repository, which is an example on how to use BERT to do simple classification, using the pre-trained weights given by Google Research. (For example with BERT-Base, Cased)
I have X different labels which have value of either 0 or 1, so I want to add to the original BERT model a new Dense layer of size X and using the sigmoid_cross_entropy_with_logits activation function.
So, for the theorical part I think I am OK.
The problem is that I don't know how I can append a new output layer and retrain only this new layer with my dataset, using the existing BertModel class.
Here is the original create_model() function from run_classifier.py where I guess I have to do my modifications. But I am a bit lost on what to do.
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids,
labels, num_labels, use_one_hot_embeddings):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids,
use_one_hot_embeddings=use_one_hot_embeddings)
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable(
"output_weights", [num_labels, hidden_size],
initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable(
"output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
one_hot_labels = tf.one_hot(labels, depth=num_labels, dtype=tf.float32)
per_example_loss = -tf.reduce_sum(one_hot_labels * log_probs, axis=-1)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
And here is the same function, with some of my modifications, but where there is things missing (and wrong things too? )
def create_model(bert_config, is_training, input_ids, input_mask, segment_ids, labels, num_labels):
"""Creates a classification model."""
model = modeling.BertModel(
config=bert_config,
is_training=is_training,
input_ids=input_ids,
input_mask=input_mask,
token_type_ids=segment_ids)
output_layer = model.get_pooled_output()
hidden_size = output_layer.shape[-1].value
output_weights = tf.get_variable("output_weights", [num_labels, hidden_size],initializer=tf.truncated_normal_initializer(stddev=0.02))
output_bias = tf.get_variable("output_bias", [num_labels], initializer=tf.zeros_initializer())
with tf.variable_scope("loss"):
if is_training:
# I.e., 0.1 dropout
output_layer = tf.nn.dropout(output_layer, keep_prob=0.9)
logits = tf.matmul(output_layer, output_weights, transpose_b=True)
logits = tf.nn.bias_add(logits, output_bias)
probabilities = tf.nn.softmax(logits, axis=-1)
log_probs = tf.nn.log_softmax(logits, axis=-1)
per_example_loss = tf.nn.sigmoid_cross_entropy_with_logits(labels=labels, logits=logits)
loss = tf.reduce_mean(per_example_loss)
return (loss, per_example_loss, logits, probabilities)
The other things I have adapted in the code and for which I had no problem :
DataProcessor to load and parse my custom dataset
Changing the type of labels variable from numerical values to arrays everywhere it is used
So, if anyone knows what I should do to resolve my problem, or even point out some obvious mistake I may have done, I would be glad to hear it.
Notes :
I found this article that correspond pretty well to what I am trying to do, but it use PyTorch, and I can not translate it into Tensorflow.
You want to replace the softmax that models a single distribution over possible outputs (all scores sum up to one) with sigmoid which models an independent distribution for each class (there is yes/no distribution for each output).
So, you correctly change the loss function, but you also need to change how you compute the probabilities. It should be:
probabilities = tf.sigmoid(logits)
In this case, you don't need the log_probs.
I am trying to implement an autoencoder in Keras that not only minimizes the reconstruction error but its constructed features should also maximize a measure I define. I don't really have an idea of how to do this.
Here's a snippet of what I have so far:
corrupt_data = self._corrupt(self.data, 0.1)
# define encoder-decoder network structure
# create input layer
input_layer = Input(shape=(corrupt_data.shape[1], ))
encoded = Dense(self.encoding_dim, activation = "relu")(input_layer)
decoded = Dense(self.data.shape[1], activation="sigmoid")(encoded)
# create autoencoder
dae = Model(input_layer, decoded)
# define custom multitask loss with wlm measure
def multitask_loss(y_true, y_pred):
# extract learned features from hidden layer
learned_fea = Model(input_layer, encoded).predict(self.data)
# additional measure I want to optimize from an external function
wlm_measure = wlm.measure(learned_fea, self.labels)
cross_entropy = losses.binary_crossentropy(y_true, y_pred)
return wlm_measure + cross_entropy
# create optimizer
dae.compile(optimizer=self.optimizer, loss=multitask_loss)
dae.fit(corrupt_data, self.data,
epochs=self.epochs, batch_size=20, shuffle=True,
callbacks=[tensorboard])
# separately create an encoder model
encoder = Model(input_layer, encoded)
Currently this does not work properly... When I viewed the training history the model seems to ignore the additional measure and train only based on the cross entropy loss. Also if I change the loss function to consider only wlm measure, I get the error "numpy.float64" object has no attribute "get_shape" (I don't know if changing my wlm function's return type to a tensor will help).
There are a few places that I think may have gone wrong. I don't know if I am extracting the outputs of the hidden layer correctly in my custom loss function. Also I don't know if my wlm.measure function is outputting correctly—whether it should output numpy.float32 or a 1-dimensional tensor of type float32.
Basically a conventional loss function only cares about the output layer's predicted labels and the true labels. In my case, I also need to consider the hidden layer's output (activation), which is not that straightforward to implement in Keras.
Thanks for the help!
You don't want to define your learned_fea Model inside your custom loss function. Rather, you could define a single model upfront with two outputs: the output of the decoder (the reconstruction) and the output of the endoder (the feature representation):
multi_output_model = Model(inputs=input_layer, outputs=[decoded, encoded])
Now you can write a custom loss function that only applies to the output of the encoder:
def custom_loss(y_true, y_pred):
return wlm.measure(y_pred, y_true)
Upon compiling the model, you pass a list of loss functions (or a dictionary if you name your tensors):
model.compile(loss=['binary_crossentropy', custom_loss], optimizer=...)
And fit the model by passing a list of outputs:
model.fit(X=X, y=[data_to_be_reconstructed,labels_for_wlm_measure])
Although not new to Machine Learning, I am still relatively new to Neural Networks, more specifically how to implement them (In Keras/Python). Feedforwards and Convolutional architectures are fairly straightforward, but I am having trouble with RNNs.
My X data consists of variable length sequences, each data-point in that sequence having 26 features. My y data, although of variable length, each pair of X and y have the same length, e.g:
X_train[0].shape: (226,26)
y_train[0].shape: (226,)
X_train[1].shape: (314,26)
y_train[1].shape: (314,)
X_train[2].shape: (189,26)
y_train[2].shape: (189,)
And my objective is to classify each item in the sequence into one of 39 categories.
What I can gather thus far from reading example code, is that we do something like the following:
encoder_inputs = Input(shape=(None, 26))
encoder = GRU(256, return_state=True)
encoder_outputs, state_h = encoder(encoder_inputs)
decoder_inputs = Input(shape=(None, 39))
decoder_gru= GRU(256, return_sequences=True)
decoder_outputs, _ = decoder_gru(decoder_inputs, initial_state=state_h)
decoder_dense = Dense(39, activation='softmax')
decoder_outputs = decoder_dense(decoder_outputs)
model = Model([encoder_inputs, decoder_inputs], decoder_outputs)
model.compile(loss=keras.losses.categorical_crossentropy,
optimizer=keras.optimizers.Adadelta(),
metrics=['accuracy'])
Which makes sense to me, because each of the sequences have different lengths.
So with a for loop that loops over all sequences, we use None in the input shape of the first GRU layer because we are unsure what the sequence length will be, and then return the hidden state state_h of that encoder. With the second GRU layer returning sequences, and the initial state being the state returned from the encoder, we then pass the outputs to a final softmax activation layer.
Obviously something is flawed here because I get:
decoder_outputs, _ = decoder_gru(decoder_inputs, initial_state=state_h)
File "/usr/local/lib/python3.6/dist-
packages/tensorflow/python/framework/ops.py", line 458, in __iter__
"Tensor objects are only iterable when eager execution is "
TypeError: Tensor objects are only iterable when eager execution is
enabled. To iterate over this tensor use tf.map_fn.
This link points to a proposed solution, but I don't understand why you would add encoder states to a tuple for as many layers you have in the network.
I'm really looking for help in being able to successfully write this RNN to do this task, but also understanding. I am very interested in RNNs and want to understand them more in depth so I can apply them to other problems.
As an extra note, each sequence is of shape (sequence_length, 26), but I expand the dimension to be (1, sequence_length, 26) for X and (1, sequence_length) for y, and then pass them in a for loop to be fit, with the decoder_target_data one step ahead of the current input:
for idx in range(X_train.shape[0]):
X_train_s = np.expand_dims(X_train[idx], axis=0)
y_train_s = np.expand_dims(y_train[idx], axis=0)
y_train_s1 = np.expand_dims(y_train[idx+1], axis=0)
encoder_input_data = X_train_s
decoder_input_data = y_train_s
decoder_target_data = y_train_s1
model.fit([encoder_input_data, decoder_input_data], decoder_target_data,
epochs=50,
validation_split=0.2)
With other networks I have wrote (FeedForward and CNN), I specify the model by adding layers on top of Keras's Sequential class. Because of the inherent complexity of RNNs I see the general format of using Keras's Input class like above and retrieving hidden states (and cell states for LSTM) etc... to be logical, but I have also seen them built from using Keras's Sequential Class. Although these were many to one type tasks, I would be interested in how you would write it that way too.
The problem is that the decoder_gru layer does not return its state, therefore you should not use _ as the return value for the state (i.e. just remove , _):
decoder_outputs = decoder_gru(decoder_inputs, initial_state=state_h)
Since the input and output lengths are the same and there is a one to one mapping between the elements of input and output, you can alternatively construct the model this way:
inputs = Input(shape=(None, 26))
gru = GRU(64, return_sequences=True)(inputs)
outputs = Dense(39, activation='softmax')(gru)
model = Model(inputs, outputs)
Now you can make this model more complex (i.e. increase its capacity) by stacking multiple GRU layers on top of each other:
inputs = Input(shape=(None, 26))
gru = GRU(256, return_sequences=True)(inputs)
gru = GRU(128, return_sequences=True)(gru)
gru = GRU(64, return_sequences=True)(gru)
outputs = Dense(39, activation='softmax')(gru)
model = Model(inputs, outputs)
Further, instead of using GRU layers, you can use LSTM layers which has more representational capacity (of course this may come at the cost of increasing computational cost). And don't forget that when you increase the capacity of the model you increase the chance of overfitting as well. So you must keep that in mind and consider solutions that prevent overfitting (e.g. adding regularization).
Side note: If you have a GPU available, then you can use CuDNNGRU (or CuDNNLSTM) layer instead, which has been optimized for GPUs so it runs much faster compared to GRU.
Why is the pred variable being calculated before any of the training iterations occur? I would expect that a pred would be generated (through the RNN() function) during each pass through of the data for every iteration?
There must be something I am missing. Is pred something like a function object? I have looked at the docs for tf.matmul() and that returns a tensor, not a function.
Full source: https://github.com/aymericdamien/TensorFlow-Examples/blob/master/examples/3_NeuralNetworks/recurrent_network.py
Here is the code:
def RNN(x, weights, biases):
# Prepare data shape to match `rnn` function requirements
# Current data input shape: (batch_size, n_steps, n_input)
# Required shape: 'n_steps' tensors list of shape (batch_size, n_input)
# Unstack to get a list of 'n_steps' tensors of shape (batch_size, n_input)
x = tf.unstack(x, n_steps, 1)
# Define a lstm cell with tensorflow
lstm_cell = rnn.BasicLSTMCell(n_hidden, forget_bias=1.0)
# Get lstm cell output
outputs, states = rnn.static_rnn(lstm_cell, x, dtype=tf.float32)
# Linear activation, using rnn inner loop last output
return tf.matmul(outputs[-1], weights['out']) + biases['out']
pred = RNN(x, weights, biases)
# Define loss and optimizer
cost = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(logits=pred, labels=y))
optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate).minimize(cost)
# Evaluate model
correct_pred = tf.equal(tf.argmax(pred,1), tf.argmax(y,1))
accuracy = tf.reduce_mean(tf.cast(correct_pred, tf.float32))
# Initializing the variables
init = tf.global_variables_initializer()
Tensorflow code has two distinct phases. First, you build a "dependency graph", which contains all of the operations that you will use. Note that during this phase you are not processing any data. Instead, you are simply defining the operations you want to occur. Tensorflow is taking note of the dependencies between the operations.
For example, in order to compute the accuracy, you'll need to first compute correct_pred, and to compute correct_pred you'll need to first compute pred, and so on.
So all you have done in the code shown is to tell tensorflow what operations you want. You've saved those in a "graph" data structure (that's a tensorflow data structure that basically is a bucket that contains all the mathematical operations and tensors).
Later you will run operations on the data using calls to sess.run([ops], feed_dict={inputs}).
When you call sess.run notice that you have to tell it what you want from the graph. If you ask for accuracy:
sess.run(accuracy, feed_dict={inputs})
Tensorflow will try to compute accuracy. It will see that accuracy depends on correct_pred, so it will try to compute that, and so on through the dependency graph that you defined.
The error you're making is that you think pred in the code you listed is computing something. It's not. The line:
pred = RNN(x, weights, biases)
only defined the operation and its dependencies.