This is a follow up to the following question: Confused about how to implement time-distributed LSTM + LSTM
The current draft structure that is working well:
The basic idea is that there is a TimeDistributed deep LSTM input layer that works on each epoch of raw time series data and outputs a vector of features for each output. Then, the "outer" deep LSTM layer takes 7 of those sequential outputs and tries to classify the center epoch (assumed that 1 epoch does not have enough information to be classified by itself, and needs surrounding epochs). I say this is a draft because I haven't yet explored the feature space required for this to work well on many subjects.
There are several issues that still need to be resolved, but the one that I haven't found any clear-cut examples of online are trying to train this model in two parts: 1) the TimeDistributed later and 2) the "outer" layer. The reason being is that as I increase the number of epochs needed to classify (currently 7, but I expect it may get up to 21 or higher) more duplicated data is loaded, and the training speed is decreasing quickly.
One may propose an autoencoder for the first layer. However, I don't think this is the best solution. The reason I think so is that the features necessary to reproduce the input might very well be different than the features necessary to be used with other epochs to classify said layer. To expand: this is probable because the time series is semi-periodic, with most of the epoch providing little information other than the current period from important feature to important feature (and the number and location of these important features varies in each epoch).
Related
I am training an LSTM to forecast the next value of a timeseries. Let's say I have training data with the given shape (2345, 95) and a total of 15 files with this data, this means that I have 2345 window with 50% overlap between them (the timeseries was divided into windows). Each window has 95 timesteps. If I use the following model:
input1 = Input(shape=(95, 1))
lstm1 = LSTM(units=100, return_sequences=False,
activation="tanh")(input1)
outputs = Dense(1, activation="sigmoid")(lstm1)
model = Model(inputs=input1, outputs=outputs)
model.compile(loss=keras.losses.BinaryCrossentropy(),
optimizer=tf.keras.optimizers.Adam(learning_rate=0.01))
I am feeding this data using a generator where it passes a whole file each time, therefore one epoch will have 15 steps. Now my question is, in a given epoch, for a given step, does the LSTM remember the previous window that it saw or is the memory of the LSTM reset after seeing each window? If it remembers the previous windows, then is the memory reset only at the end of an epoch?
I have seen similar questions like this TensorFlow: Remember LSTM state for next batch (stateful LSTM) or https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm but I either did not quite understand the explanation or I was unsure if what I wanted was explained. I'm looking for more of a technical explanation as to where in the LSTM architecture is the whole memory/hidden state reset.
EDIT:
So from my understanding there are two concepts we can call "memory"
here. The weights that are updated through BPTT and the hidden state
of the LSTM cell. For a given window of timesteps the LSTM can
remember what the previous timestep was, this is what the hidden
state is for I think. Now the weight update does not directly
reflect memory if I'm understanding this correctly.
The size of the hidden state, in other words how much the LSTM
remembers is determined by the batch size, which in this case is one
whole file, but other question/answers
(https://datascience.stackexchange.com/questions/27628/sliding-window-leads-to-overfitting-in-lstm and https://stackoverflow.com/a/50235563/13469674) state that if we
have to windows for instance: [1,2,3] and [4,5,6] the LSTM does not
know that 4 comes after 3 because they are in different windows,
even though they belong to the same batch. So I'm still unsure how
exactly memory is maintained in the LSTM
It makes some sense that the hidden state is reset between windows when we look at the LSTM cell diagram. But then the weights are only updated after each step, so where does the hidden state come into play?
What you are describing is called "Back Propagation Through Time", you can google that for tutorials that describe the process.
Your concern is justified in one respect and unjustified in another respect.
The LSTM is capable of learning across multiple training iterations (e.g. multiple 15 step intervals). This is because the LSTM state is being passed forward from one iteration (e.g. multiple 15 step intervals) to the next iteration. This is feeding information forward across multiple training iterations.
Your concern is justified in that the model's weights are only updated with respect to the 15 steps (plus any batch size you have). As long as 15 steps is long enough for the model to catch valuable patterns, it will generally learn a good set of weights that generalize well beyond 15 steps. A good example of this is the Shakespeare character recognition model described in Karpathy's, "The unreasonable effectiveness of RNNs".
In summary, the model is learning to create a good hidden state for the next step averaged over sets of 15 steps as you have defined. It is common that an LSTM will produce a good generalized solution by looking at data in these limited segments. Akin to batch training, but sequentially over time.
I might note that 100 is a more typical upper limit for the number of steps in an LSTM. At ~100 steps you start to see a vanishing gradient problem in which the earlier steps contribute nearly nothing to the gradient.
Note that it is important to ensure you are passing the LSTM state forward from training step to training step over the course of an episode (any contiguous sequence). If this step was missed the model would certainly suffer.
i'm trying to train my neural network. it aims is to predict video quality. i am using VGG16+LSTM networks, but the VGG16 is not trainable. the total of trainable parameters are 700,000.
i have three questions:
is enough 700,000 trainable parameters for training the 68000 input frames?
is it essential to train vgg16?
how many epochs needed for getting the best resaults?
I haven't been into machine learning in a while, but my understanding is that:
depends, but the only way to find out is to train it and look for over/underfitting
depends on the network layout. It might also be useful to bypass some information around the VGG16, in case the VGG16 hides some of the information you actually need about 'video quality'
depends. You wouldh have to split your data into a training and a test set in order to find that out.
As most things in machine learning and especially deep learning the answers aren't obvious and depend heavily on the problem and the exact network layout. There will be much trial and error involved.
The most important takeaway, I think, is to have two (or even three) different datasets for the training/validation/test step, so you can answer those questions yourself.
For more information, read the wikipedia entry about splitting your datasets.
You start with one and see what impact it had.
Even one epoch will take long and getting the error takes also a bit of time.
I'm building a Sequential NN model in Keras for binary classification. The training data has about 600,000 rows and 2,000 features, so every epoch and every layer is very time consuming. I believe many of the features are not relevant to the model, and can be dropped altogether, to make the model thinner, so it it would be faster to work with.
I run a simple model with one hidden-layer of 200 neurons. How can I tell which of the features (which are actually the nodes in the input layer) are meaningless, so I could drop them from the data set and re run the model without them?
There is a very big topic in machine learning called feature selection. Though, neural networks are considered to automatically choose the best features for the problem, to an extent, by using their weights, to either consider more or less some of them. Neural networks also need a lot of experience to be tuned correctly. I would definitely suggest you to increase the layers of the network, because you have a lot of data and features and use l1 regularisation, in order to get sparse weights and exclude most of the features. Also, these information are indicative, since I do not know anything about your dataset and your network architecture. At last, I would suggest you to study more about the basics of machine learning and then continue learning about neural networks, before practicing with real data.
How can I use Long Short-term Memory (LSTM) to predict a future value x(t+1) (out of sample prediction) based on a historical dataset. I read and tried many web tutorials for forecasting and prediction using lstm, but still far away from the point. What's the exact procedure to do this prediction? Is it just as simple as shifting the target array (n)steps where n is the number of future predicts and do the prediction operation? or there's another techniques?
please help or leave a suggestion.
Can you provide the framework you are using? tensorflow? pytorch? which web tutorials specifically?
Assuming you are going tensorflow, you can copy and paste code from one of these, test that it works on the provided dataset, then modify the input encoding functions to fit your dataset, then run on your dataset.
https://github.com/llSourcell/How-to-Predict-Stock-Prices-Easily-Demo (best)
https://github.com/sebastianheinz/stockprediction
https://github.com/talolard/MarketVectors/blob/master/preparedata.ipynb (you will have to replace fc layers with lstm, and fiddle with inputs)
In general procedure is something like (assuming tensorflow):
Download Dataset
Create a function to load batches of data
Create a function to encode batch of data (normalization, other transforms)
Create LSTM layer to recieve series of inputs.
Create output layer (usually fully connected) to take last lstm state and predict output of your desired size.
Create a tf session to wire everything together, and hit run.
Some questions to ask conceptually about which network use:
How many inputs to how many outputs - see this excellent http://cs231n.stanford.edu/slides/2016/winter1516_lecture10.pdf by Karpathy
How far back do you consider the stock prices eg {t-100... t} or {t-10 ...t} which may dictate size of hidden layers.
What other information do you think is relevant to the model? does stock A influence stock B? in which case you may have 2 lstms outputing a state to your fully connected layer...
Is there a way to learn unsupervised features from set of images. Similar to word2vec or doc2vec, where neural network is learnt and given new document we get its features.
Expecting similar to this example shows that it can load learnt nn-model and predict features for new images.
Is there any simple example how to implement cnn over images and get their features back will help !!
Suppose in this example
If I want to get cnn features for all X_train and X_test ... is there any way?
Also, if we can get weights per layer per image, we can stack them and use as features. In that case is there a way to get the same.
Using those features for unsupervised task would be easier, if we consider them as vectors.
If I correctly understood your question, this task is quite common in a deep learning field. In case of images what I consider the best is a convolutional autoencoder. You may read about this architecture e.g. here
http://people.idsia.ch/~ciresan/data/icann2011.pdf
Previous version of Keras supported this architecture as one of core layers, though from version 1.0 I noticed that it disappeared from documentation. But - it's still quite easy to build it from a scratch :)
In noimage cases there are also another approaches like e.g. Restricted Boltzmann Machines.
UPDATE :
When it comes to what sort of activations are the best for obtaining new features from neural network activations - from my personal experience - it depends on the size of the net which you use. If you use a network which last layer is wide (has a lot of nodes) it might be useful to get only last layer (due to number of parameters if you want to consider also previous layers - it may harm the performance of learning). But - if (like in case of some MNIST networks) your last layer is not sufficient for this task - you may try using also previous layers activation or even all net activity. To be honest - I'm not expecting much of improvement in this case - but you may try. I think that you should use both approaches - starting from taking only last layer activations - and then trying to check the behaviour of your new classifier when you add activations from previous layers.
What I will strongly advise to you is also getting some insights from what sort of features network is learning - using T-SNE embeddings of it activations. In many cases I found it useful - e.g. checking if the size of a layer is sufficient. Using T-SNE you may check if the features obtained from last layer are good discriminators of your classes. It may also give you good insights about your data and what neural networks are really learning (alongside with amazing visualizations :) )