Simple autoencoder keeping constant tensor as predict in keras - python

I'm new in keras and deep learning field. In fact, I want to make a dense vector for each document in my data so that i built a simple autoencoder using keras library.
The input data are normalized using Word2vec with 200 as embedding size and all features are between -1 and 1. I prepared a 3D tensor that contains 137 samples (number of document) with 469 columns (maximum numbers of words) and the third dimension is the embedding size.I used mse loss function and GRU as recurrent neural network. I am having the same vector for all documents as the autoencoder prediction output while loss start with a very low value and became constant after a few number of epochs.
I tried different number of epochs but I got the same thing. I tried also to change the batch size but no change. Can any one help me find the problem please.
input = Input(shape=(469,200))
encoder = GRU(120,activation='sigmoid',dropout=0.2)(input)
neck = Dense(20)(encoder)
decoder1 = RepeatVector(469)(neck)
decoder1 = GRU(120,return_sequences=True,activation='sigmoid',dropout=0.2)(decoder1)
decoder1 = TimeDistributed(Dense(200,activation='tanh'))(decoder1)
model = Model(inputs=input, outputs=decoder1)
model.compile(optimizer='adam', loss='mse')
history =, x_train,validation_data=(x_test,x_test) ,epochs=10, batch_size=8)
this is the input data "x_train" :
print(model.predict(x_train)) return this values (same vectors):
Why "model.predict(x_train)" return the same vector for the 137 samples ?
Thank you in advance.


LSTM with Attention getting weights?? Classifing documents based on sentence embedding

I'm really stuck building a NN for text-classification with keras using lstm and adding an attention_layer on top. Im sure Iam pretty close, but Im confused:
Do I have to add a TimeDistributed dense layer after LSTM?
And, how do I retrieve the Attention weights from my network (for visualization purpose)? - so that I know which sentence was 'responsible' that the document was classified as good or bad?
Say, I have 10 documents consisting of 100 sentences and each sentence is represented as a 500 element vector. So my documents matrix containing the sentence-sequences looks like: X = np.array(Matrix).reshape(10, 100, 500)
The documents should be classified to an according sentiment 1=good; 0=bad - so
y= [1,0,0,1,1]
yy= np.array(y)
I dont need an embedding-layer cause each sentence of each document is already a sparse-vector.
The attention layer is taken from:
review_input = Input(shape=(MAX_SENTS, MAX_SENT_LENGTH))
l_lstm_sent = LSTM(100, activation='tanh', return_sequences=True)(review_input)
l_att_sent = AttLayer(100)(l_lstm_sent)
preds = Dense(1, activation='softmax')(l_att_sent)
model = Model(review_input, preds)
metrics=['acc']), yy, nb_epoch=10, batch_size=50)
So I think my model should be set up correctly but Im not quite sure.. But how do I get the attention-weights from that (e.g. so I know which sentence caused a classification as 1)? Help so much appreciated
1. Time distributed
In this case, you don't have to wrap Dense into TimeDistributed, although it may be a little bit faster if you do, especially if you can provide a mask that masks out a large part of the LSTM output.
However, Dense operates in the last dimension no matter what the shape before the last dimension is.
2. Attention weights
Yes, it is as you suggest in the comment. You need to modify the AttLayer it is capable of returning both its output and the attention weights.
return output, ait
And then create a model that contains both prediction and attention weight tensors and get the predictions for them:
l_att_sent, l_att_sent = AttLayer(100)(l_lstm_sent)
predictions, att_weights = attmodel.predict(X)

neural network return classes instead of probabilities

I have a list of 70k datas of dimension 25000 that I'm trying to feed into a neural network to get a classification of 20 different things. The system run out of memory before I can do anything. So I came out with this idea. So I divide the set of features into 5 5000. Then I feed each of the 5 set of data of dimension 5000 into a neural network. Then the result of prediction is the average of the 5.
Here is how I separated the datas:
Then I feed each of them into a neural network:
model1 = Sequential()
model1.add(layers.Dense(300, activation = "relu", input_shape=(5000,)))
# Hidden - Layers
model1.add(layers.Dropout(0.4, noise_shape=None, seed=None))
model1.add(layers.Dense(20, activation = "softmax"))
metrics=['accuracy']) np.array(vectorized_training1), np.array(y_train_neralnettr),
validation_data=(np.array(vectorized_validation1), np.array(y_validation_neralnet)))
predict1= model1.predict(np.array(vectorized_validation1))
I have this same code for model2 neural network trained on feature2 dataset and so on.
And in the end, I took the average of the predictions.
Here is my question: prediction from neural network gives an elementary vector and not a probability. So taking the average of the predictions is going to return a vector of 1s and 0s. How can I change so that I actually get a probability of being one of the 20 classes for each prediction?
Is my method good to try?
can you give me sone reference?

Keras: Share a layer of weights across Training Examples (Not between layers)

The problem is the following. I have a categorical prediction task of vocabulary size 25K. On one of them (input vocab 10K, output dim i.e. embedding 50), I want to introduce a trainable weight matrix for a matrix multiplication between the input embedding (shape 1,50) and the weights (shape(50,128)) (no bias) and the resulting vector score is an input for a prediction task along with other features.
The crux is, I think that the trainable weight matrix varies for each input, if I simply add it in. I want this weight matrix to be common across all inputs.
I should clarify - by input here I mean training examples. So all examples would learn some example specific embedding and be multiplied by a shared weight matrix.
After every so many epochs, I intend to do a batch update to learn these common weights (or use other target variables to do multiple output prediction)
LSTM? Is that something I should look into here?
With the exception of an Embedding layer, layers apply to all examples in the batch.
Take as an example a very simple network:
inp = Input(shape=(4,))
h1 = Dense(2, activation='relu', use_bias=False)(inp)
out = Dense(1)(h1)
model = Model(inp, out)
This a simple network with 1 input layer, 1 hidden layer and an output layer. If we take the hidden layer as an example; this layer has a weights matrix of shape (4, 2,). At each iteration the input data which is a matrix of shape (batch_size, 4) is multiplied by the hidden layer weights (feed forward phase). Thus h1 activation is dependent on all samples. The loss is also computed on a per batch_size basis. The output layer has a shape (batch_size, 1). Given that in the forward phase all the batch samples affected the values of the weights, the same is true for backdrop and gradient updates.
When one is dealing with text, often the problem is specified as predicting a specific label from a sequence of words. This is modelled as a shape of (batch_size, sequence_length, word_index). Lets take a very basic example:
from tensorflow import keras
from tensorflow.keras.layers import *
from tensorflow.keras.models import Model
sequence_length = 80
emb_vec_size = 100
vocab_size = 10_000
def make_model():
inp = Input(shape=(sequence_length, 1))
emb = Embedding(vocab_size, emb_vec_size)(inp)
emb = Reshape((sequence_length, emb_vec_size))(emb)
h1 = Dense(64)(emb)
recurrent = LSTM(32)(h1)
output = Dense(1)(recurrent)
model = Model(inp, output)
model.compile('adam', 'mse')
return model
model = make_model()
You can copy and paste this into colab and see the summary.
What this example is doing is:
Transform a sequence of word indices into a sequence of word embedding vectors.
Applying a Dense layer called h1 to all the batches (and all the elements in the sequence); this layer reduces the dimensions of the embedding vector. It is not a typical element of a network to process text (in isolation). But this seemed to match your question.
Using a recurrent layer to reduce the sequence into a single vector per example.
Predicting a single label from the "sentence" vector.
If I get the problem correctly you can reuse layers or even models inside another model.
Example with a Dense layer. Let's say you have 10 Inputs
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# defining 10 inputs in a List with (X,) shape
inputs = [Input(shape = (X,),name='input_{}'.format(k)) for k in
# defining a common Dense layer
D = Dense(64, name='one_layer_to_rule_them_all')
nets = [D(inp) for inp in inputs]
model = Model(inputs = inputs, outputs = nets)
model.compile(optimizer='adam', loss='categorical_crossentropy')
This code is not going to work if the inputs have different shapes. The first call to D defines its properties. In this example, outputs are set directly to nets. But of course you can concatenate, stack, or whatever you want.
Now if you have some trainable model you can use it instead of the D:
import tensorflow as tf
from tensorflow.keras.layers import Input, Dense
from tensorflow.keras.models import Model
# defining 10 inputs in a List with (X,) shape
inputs = [Input(shape = (X,),name='input_{}'.format(k)) for k in
# defining a shared model with the same weights for all inputs
nets = [special_model(inp) for inp in inputs]
model = Model(inputs = inputs, outputs = nets)
model.compile(optimizer='adam', loss='categorical_crossentropy')
The weights of this model are shared among all inputs.

RNN fails to fit a linear trend (Keras BPTT issue?)

I am trying to train a simple LSTM to fit a line. My hypothesis is that I should be able to fit a linearly decreasing trend with zero input since the LSTM can decide how much it listens to its input vs. internal state, and can thus learn to just operate on the internal state. Basically a degenerate case for testing whether the LSTM can fit an expected result with zero input.
I create my input and target data:
seq_len = 1000
x_train = np.zeros((1, seq_len, 1)) # [batch_size, seq_len, num_feat]
target = np.linspace(100, 0, num=seq_len).reshape(1, -1, 1)
I create a pretty simple network:
from keras.models import Model
from keras.layers import LSTM, Dense, Input, TimeDistributed
x_in = Input((seq_len, 1))
seq1 = LSTM(8, return_sequences=True)(x_in)
dense1 = TimeDistributed(Dense(8))(seq1)
seq2 = LSTM(8, return_sequences=True)(dense1)
dense2 = TimeDistributed(Dense(8))(seq2)
out = TimeDistributed(Dense(1))(dense2)
model = Model(inputs=x_in, outputs=out)
model.compile(optimizer='adam', loss='mean_squared_error')
history =, target, batch_size=1, epochs=1000,
I also created a custom callback that calls model.predict(x_train) after every epoch and adds the results to an array so I can see how my model's output is evolving over time. Basically the model just learns to predict a constant value which gradually (asymptotically) approaches the mean of my target line (target line is in red, not sure why the legend didn't show):
So basically nothing is driving my response to fit the actual line, I'm just gradually approaching the mean of the line. I suspect I am not getting any gradient with respect to time (data index), just an average gradient over time. But I would have thought LSTM losses would automagically give you gradient through time.
I've tried:
different activation functions for the LSTM layers (None, 'relu' for both the regular activation and recurrent activation)
different optimizers ('nadam', 'adadelta', 'rmsprop')
the 'mean_aboslute_error' loss function, which I didn't expect to improve the results, and it acted about the same
passing sequences of random numbers drawn from a normal distribution as input
replacing LSTM with GRU
Nothing seems to do it.
Anybody have a suggestion as to how I can force this thing to train on the gradient as a function of my sequence index, i.e. g(t)? Or any other suggestions on how I can get this to work?
Note: with the trend as shown, if the LSTM results in a constant value at exactly the mean (50), the minimum mean absolute error will be 25 and the minimum mean squared error will be about 835.8. So if we don't see any better than that, we probably aren't fitting the line, just the mean.
Just some references in case you run this yourself.

Tensorflow RNN for classification with single output

I want to create a RNN in Tensorflow that classifies short texts analyzing them on per-letter basis. For that I created a numpy 2D array, where each piece of text was either padded or truncated, where each element is a character code. An output is just vector of clasess represented as one-hot encoded numpy 2D-array.
Here is an example:
train_x.shape, train_y.shape
((91845, 50), (91845, 5))
Input consists of 90K rows 50 chars each, output is 90K rows with 5 classes. Next, I want to build a network shown in a figure below.
The structure looks trivial, but I deffinetelly lack knowledge in Tensorflow and run in all kinds of problems trying to at least do training. Here is the part of code I use to build the network
chars = sequence_categorical_column_with_identity('chars', params['domain_size']+1)
chars_emb = tf.feature_column.embedding_column(chars, dimension=10)
columns = [chars_emb]
input_layer, sequence_length = sequence_input_layer(features, columns)
hidden_units = 32
lstm = tf.nn.rnn_cell.LSTMCell(hidden_units, state_is_tuple=True)
rnn_outputs, state = tf.nn.dynamic_rnn(lstm,
inputs = input_layer,
output = rnn_outputs[:,-1,:]
logits = tf.layers.dense(output, params['n_classes'], activation=tf.nn.tanh)
# apply projection to every timestep.
# Compute predictions.
predicted_classes = tf.nn.softmax(logits)
# Compute loss.
loss = tf.nn.softmax_cross_entropy_with_logits_v2(labels=labels, logits=logits)
# Compute evaluation metrics.
accuracy = tf.metrics.accuracy(labels=labels,
But I get an error
InvalidArgumentError (see above for traceback): Input to reshape is a tensor with 8 values, but the requested shape has 1
[[Node: Reshape = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/device:CPU:0"](softmax_cross_entropy_with_logits, sequence_input_layer/chars_embedding/assert_equal/Const)]]
A fuller minimal example you can find here. Most likely you would need Tensorflow 1.8.0.
loss = tf.reduce_mean(loss)
now allows to train the network, but the results are underwhelming.
