How to split text data for LSTM training - python

I'm using Colab Pro TPU that offers up to 35 GB of memory. My dataset contains 650,000 sequences. I'm trying to use BiDirectional LSTM to predict the next word.
When I attempt to generate the binary_class vector using to_categorical(), it crashes because of memory limits. I took the first 200k sequences, trained the model, accuracy almost stops at around 65%. Before tweaking the hyperparameters, I wanted to feed the whole dataset and train the model. Is it possible to split the dataset, generate sequences, join them for training?
Appreciate any suggestions.
Thanks.

Related

Training a RNN with multiple files

I am trying to train a LSTM model to predict speech veracity using video as input. I have a dataset per video and each dataset contains its own timestamp. I have been trying to prepare each dataset as a time series for model training. I have tried combining all the datasets into a single one, however i am not sure if that is the right procedure due to the videos not being ordered in the final dataset as a time series should be. I have also tried fitting the model with one dataset at a time.
This is how the video datasets look like
This is how the dataset looks after some preprocessing
What is the best way to apply the multiple datasets to a LSTM model?

Is there a difference between training small data multiple times and large data once when training a model?

I already have a model that has trained 130,000 sentences.
I want to categorize sentences with bidirectional lstm.
We plan to use this service.
However, the model must continue to be trained throughout the service.
so i Think
Until the accuracy of the model increases
I will look at the sentences that the model has categorized and I will answer them myself.
I will train sentence to answer.
Is there a difference between training the sentences one by one and training them by merge them into one file?
Every time I give a sentence
One by one training
Does it matter?
Yes, there is a difference. Suppose, you have a dataset of 10,000 sentences.
If you are training one sentence at each time, then optimization will take place at every sentence ( backpropagation ). This consumes more time and memory and is not a good choice. This is not possible if you have a large dataset. Computing gradient on each instance is noisy and the speed of convergence is less.
If you are training in batches, suppose the batch size is 1000, then you have 10 batches. These batches together go in the network and thus gradients are computed over these batches. Hence, the gradients receive enough noise to converge at the global minima rather than the local minima. Also, it is memory efficient and converges faster.
You can check out answers from here, here and here.

Dataset too big for RAM, how to do efficient epochs

I currently am working with a dataset of about 2 million objects. Before I train on them I have to load them from disk and perform some preprocessing (that makes the dataset much bigger, so that it would be inefficient to save the post-processed data).
Right now I just load and train in small batches, but if I want to train for multiple epochs on the full dataset, I would have to load all the data from the previous epoch multiple times, which ends up taking a lot of time. The alternative is training for multiple epochs on the smaller batches of data before moving onto the next batch. Could the second method result in any issues (like overfitting)? And is there any other, better way to do this? I'm using tflearn with Python 3 if there are any built-in methods using that.
tl;dr: Is it okay to train for multiple epochs on subsets of data before training for a single epoch on all the data

training the same model with different data sets in tensorflow

The problem:
I have a model that I would like to train with independent data sets. Afterwards, I would like to extract the weights of each model (the model is the same for each instance but trained using different datasets) and finally, compute and average of these weights. Basically, my intention is to mimic tensorflow running on multiple devices and then average their weights so that they are used by one model.
My solution:
I added this model multiple times to tensorflow and am currently training each of these models separately with its unique dataset.. but this is using GBs of memory, and am wondering if there is a better way to do this?
One of the possible solutions is that you can fine-tune your network weights with other similar networks(similar datasets, i.e, if your dataset is images, you can use AlexNet weights)don't afraid if your network has no same architecture, you can simply load weights of layers as much as you need by 'load_with_skip' function of
https://github.com/joelthchao/tensorflow-finetune-flickr-style/blob/master/network.py
Fine-tuning takes much less than train networks from scratch.

How to cancel the huge negative effect of my training data distribution on subsequent neural network classification function?

I need to train my network on a data that has a normal distribution, I've noticed that my neural net has a very high tendency to only predict the most occurring class label in a csv file I exported (comparing its prediction with the actual label).
What are some suggestions (except cleaning the data to produce an evenly distributed training data), that would help my neural net to not go and only predict the most occurring label?
UPDATE: Just wanted to mention that, indeed the suggestions made in the comment sections worked. I, however, found out that adding an extra layer to my NN, mitigated the problem.
Assuming the NN is trained using mini-batches, it is possible to simulate (instead of generate) an evenly distributed training data by making sure each mini-batch is evenly distributed.
For example, assuming a 3-class classification problem and a minibatch size=30, construct each mini-batch by randomly selecting 10 samples per class (with repetition, if necessary).

Categories