this is a repost from the PyBrain google group: https://groups.google.com/forum/#!topic/pybrain/J9qv0nHuxVY.
I've been tinkering with OpenNN and FANN and have yet to find an ANN library that does what I need.
I'll break this up into short-term and medium-term goals, but first a little background...
Background
I want to use an ANN to do time-series prediction of a 1000-2000 item vector varying over time. Each element is a Boolean value that represents the presence of a cluster of visual properties in a computer vision system at a particular moment in time.
The idea is to feed the network the vector at time t-1 with the vector at time t as a target value.
The output of the network will then be a prediction of what is expected to happen in the current time (t) based on the previous state (t-1) of the vector.
Short Term
I would like to train an ANN so that it can learn to predict these vectors over time. That is, I would like to feed it an arbitrary vector and it return the vector that it has been trained to expect next. It's fine to use a finite dataSet for now, and I expect to start with normal epoch learning. I'm starting with a normal supervised data set where the input and targets are offset by one time unit. Thus far I have not been showing results as good as I was able to get in FANN in terms of MSE returns (no significant decrease of error after the first epoch), as described here: https://groups.google.com/forum/#!topic/pybrain/QSfVHsFRXz0.
In FANN I was just using a simple MLP with 1026 inputs, 103 hidden, and 1026 outputs. Boolean inputs were scaled to -1 to 1 and weights were initialized to random values between -1 and 1. (This was done because apparently learning is faster with negative values than just 0-1). The network was reproducing input patterns quite well and ended up with a small MSE.
In PyBrain this is the current version of the code:
#!/usr/bin/python
# First try and using pyBrain for building an ANN. We'll start with an MLP for obvious reasons.
from pybrain.tools.shortcuts import buildNetwork
from pybrain.datasets import SupervisedDataSet
from pybrain.supervised.trainers import BackpropTrainer
from pybrain.tools.validation import ModuleValidator
import pickle, time
# Load the dataset (parse data)
print "Loading and parsing dataset..."
inputFile = open("../MLP/data/113584_backgroundState-filter-1.fann", 'r')
rawData = inputFile.readlines()
inputFile.close()
processedData = list()
for inputLine in rawData[1:]: # all but last element.
data = inputLine.split() # Slicing to strip off the final newline
scaleData = list()
for item in data:
scaleData.append(int(item))
processedData.append(scaleData)
# Create dataset from parsed data
inputData = SupervisedDataSet(1027, 1027)
for index in xrange(0,len(processedData)-1,2): # every second line.
inputData.addSample(processedData[index], processedData[index+1])
del processedData # no longer needed
# Build the same network as in FANN
net = buildNetwork(1027, 103, 1027, bias=True)
# Train the network
print "Training network..."
trainer = BackpropTrainer(net, inputData, verbose = True)
for i in xrange(5):
startTime = time.time()
error = trainer.train()
print "ERROR " + str(i) + " " + str(error) # test network (calculate error)
print "ProcessingTime " + str(i) + " " + str(time.time() - startTime)
# save results.
print "Saving network..."
fileObject = open('network.pybrain', 'w')
pickle.dump(net, fileObject)
# For each input pattern, what is the output?
# Compatible with FANN output
print "Testing network and generating results..."
i = 0;
for inputPattern in inputData['input']:
outputPattern = net.activate(inputPattern)
for j in xrange(len(inputPattern)):
print "RESULT " + str(i) + " " + str(j) + " " + str(inputPattern[j]) + " " + str(outputPattern[j])
i += 1
print "Done."
Any recommendations for how to best set this up for learning? (I expect I'll need recurrency, but wanted to compare PyBrain to my previous FANN results with a plain MLP.)
I actually tried building this network with recurrent=True, but in all my tests python ended up using all the available RAM and crashed (there is 8GB ram on this machine). I'm not sure how I can enable recurrency without a huge increase of memory footprint.
Medium Term
Eventually, the system will have to be running on-line where inputs are fed on the fly and constantly changing. This means epoch training will not be possible, so I need to be able to run a single iteration of the learning algorithm. I realize it will be hard for the ANN to learn, but the good news is there will be no shortage of data-points (at least 100,000s). Because there is no fixed set of data, there is no need for convergence. I expect error will rise and fall as new or stable patterns are presented.
Thanks for any comments and suggestions.
Related
I have a dataframe consisting of over 1m articles I have cleaned for the process of training BERT a text summary model which in turn gets fed into a QA pipeline to ultimately train a T5 closed book QA model. I used multithreading before to vastly improve scraping times. However, is there an equivalent of multiprocessing for BERT?
The current flow looks like this:
def longBatch(context):
model = BartForConditionalGeneration.from_pretrained('facebook/bart-large-cnn')
tokenizer = BartTokenizer.from_pretrained('facebook/bart-large-cnn')
inputs_no_trunc = tokenizer(row, max_length=None, return_tensors='pt', truncation=False)
chunk_start = 0
chunk_end = tokenizer.model_max_length
inputs_batch_lst = []
while chunk_start <= len(inputs_no_trunc['input_ids'][0]):
inputs_batch = inputs_no_trunc['input_ids'][0][chunk_start:chunk_end]
inputs_batch = torch.unsqueeze(inputs_batch, 0)
inputs_batch_lst.append(inputs_batch)
chunk_start += tokenizer.model_max_length # == 1024 for Bart
chunk_end += tokenizer.model_max_length # == 1024 for Bart
# generate a summary on each batch
summary_ids_lst = [model.generate(inputs, num_beams=4, max_length=100, early_stopping=True) for inputs in inputs_batch_lst]
summary_batch_lst = []
for summary_id in summary_ids_lst:
summary_batch = [tokenizer.decode(g, skip_special_tokens=True, clean_up_tokenization_spaces=False) for g in summary_id]
summary_batch_lst.append(summary_batch[0])
summary_all = '\n'.join(summary_batch_lst)
listarray.append(summary_all)
print(summary_all)
listarray = []
for row in x["0"]:
try:
pd.DataFrame(listarray).to_csv("/drive/somepath/sum.csv")
print(len(listarray))
listarray.append(longBatch(row))
print("Done with line")
except:
print("error:" +" " + row)
Now this works fine as is, with each article in its own row to be processed one at a time, but it is fairly slow. Multithreading took article my scrape times into the realm of roughly 500,000 a day. But this current model of summarization puts the peak performance numbers at roughly 5,500 a day. I've tried defining the processes and running it into a pool but the times actually increased by 6 seconds per summarization. The system doesn't strain feeding 1 article at a time, so I'd imagine it could handle quite a bit more, how would I go about parallelizing the process?
For reference, I am testing each section of script in ColabPro+ which has Tesla V100 GPU and 8 core CPU before moving to my local machine.
I'm trying to do one basic tweet sentiment analysis using word2vec and tfidf-score on a dataset consisting of 1,6M tweets but my 6 GB Gforce-Nvidia fails to do so. since this is my first practice project relating machine learning I'm wondering what I'm doing wrong because dataset is all text it shouldn't take this much RAM which makes my laptop froze in tweet2vec function or giving Memory Error in scaling part. below is part of my code that everything collapses.
the last thing is that I've tried with up to 1M data and it worked! so I'm curious what causes the problem
# --------------- calculating word weight for using later in word2vec model & bringing words together ---------------
def word_weight(data):
vectorizer = TfidfVectorizer(sublinear_tf=True, use_idf=True)
d = dict()
for index in tqdm(data, total=len(data), desc='Assigning weight to words'):
# --------- try except caches the empty indexes ----------
try:
matrix = vectorizer.fit_transform([w for w in index])
tfidf = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))
d.update(tfidf)
except ValueError:
continue
print("every word has weight now\n"
"--------------------------------------")
return d
# ------------------- bringing tokens with weight to recreate tweets ----------------
def tweet2vec(tokens, size, tfidf):
count = 0
for index in tqdm(tokens, total=len(tokens), desc='creating sentence vectors'):
# ---------- size is the dimension of word2vec model (200) ---------------
vec = np.zeros(size)
for word in index:
try:
vec += model[word] * tfidf[word]
except KeyError:
continue
tokens[count] = vec.tolist()
count += 1
print("tweet vectors are ready for scaling for ML algorithm\n"
"-------------------------------------------------")
return tokens
dataset = read_dataset('training.csv', ['target', 't_id', 'created_at', 'query', 'user', 'text'])
dataset = delete_unwanted_col(dataset, ['t_id', 'created_at', 'query', 'user'])
dataset_token = [pre_process(t) for t in tqdm(map(lambda t: t, dataset['text']),
desc='cleaning text', total=len(dataset['text']))]
print('pre_process completed, list of tweet tokens is returned\n'
'--------------------------------------------------------')
X = np.array(tweet2vec(dataset_token, 200, word_weight(dataset_token)))
print('scaling vectors ...')
X_scaled = scale(X)
print('features scaled!')
the data given to word_weight function is a (1599999, 200) shaped list which each index is consisted of pre-processed tweet tokens.
I appreciate your time and answer in advance and of course I'm glad to hear better approaches for handling big datasets
If I understood correctly, it works with 1M tweets, but fails with 1.6M tweets? So you know the code is correct.
If the GPU is running out of memory when you think it shouldn't, it may be holding on from a previous process. Use nvidia-smi to check what processes are using the GPU, and how much memory. If (before you run your code) you spot python processes in there holding a big chunk, it could be a crashed process, or a Jupyter window still open, etc.
I find it useful to watch nvidia-smi (not sure if there is a windows equivalent), to see how GPU memory changes as training progresses. Normally a chunk is reserved at the start, and then it stays fairly constant. If you see it rising linearly, something could be wrong with the code (are you re-loading the model on each iteration, something like that?).
my problem was solved when i changed the code (tweet2vec function) to this
(w is word weight)
def tweet2vec(tokens, size, tfidf):
# ------------- size is the dimension of word2vec model (200) ---------------
vec = np.zeros(size).reshape(1, size)
count = 0
for word in tokens:
try:
vec += model[word] * tfidf[word]
count += 1
except KeyError:
continue
if count != 0:
vec /= count
return vec
X = np.concatenate([tweet2vec(token, 200, w) for token in tqdm(map(lambda token: token, dataset_token),
desc='creating tweet vectors',
total=len(dataset_token))]
)
I have no idea why!!!!
As the title mention, how could I train a model to classify following sentences are logical or illogical?
“He has two legs”–logical
“He has six legs”–illogical
Solution I tried:
1 : Train the classifier by cnn
I have done it before, it works very well if you have enough of data. Problem is I do not have a huge data set which comes with “logical” or “illogical” labels for this case.
2 : Use language model
Train a language model introduced by gluonnlp on some data set like wiki, use it to find out the probability of the sentences. If the probability of the sentences are high, mark it as logical and vice versa. Problem is the results not good.
The way I estimate the probability
def __predict(self):
lines = self.__text_edit_input.toPlainText().split("\n")
result = ""
for line in lines:
result += str(self.__sentence_prob(line, 10)) + "\n"
self.__text_edit_output.setPlainText(result)
def __prepare_sentence(self, text, max_len):
result = mx.nd.zeros([max_len, 1], dtype='float32')
max_len = min(len(text), max_len)
i = max(max_len - len(text), 0)
j = 0
for index in range(i, max_len):
result[index][0] = self.__vocab[text[j]]
j = j + 1
return result
def __sentence_prob(self, text, max_len):
hiddens = self.__model.begin_state(1, func=mx.nd.zeros, ctx=self.__context)
tokens = self.__tokenizer(text)
data = self.__prepare_sentence(tokens, max_len)
output, _ = self.__model(data, hiddens)
prob = 0
for i in range(max_len):
total_prob = mx.nd.softmax(output[i][0])
prob += total_prob[self.__vocab[i]].asscalar()
return prob / max_len
Possible issues of language models:
1. Do not use correct way to split the sentences(I am using jieba to split the Chinese senteces)
2. Number of vocab is too small/big(test 10000, 15000 and 30000)
3. Loss too high(ppl around 190) after 50 epochs?
4. Number of sentences length should be larger/smaller(tried 10,20,35)
5. The data I use do not meet my requirements(not every sentences are logical)
6. Language model is not appropriate for this task?
Any suggestions?Thanks
Issue 6. Language model is not appropriate for this task? is the main problem. Language models are built to make sense of the input text with respect of language usage (syntax, semantics etc.) and not draw logical conclusions. So, you may not get good results even with a large amount of data or very deep models.
The problem you're trying to solve is extremely difficult. Something you may want to look at is Symbolic AI. There's a lot of ongoing research in this area.
I'm trying to implement this paper TreeLSTM by using TensorFlow Fold. Actually, in Tensorflow Fold, there's already an example of TreeLSTM but in a BinaryTreeLSTM version, here's the tutorial: https://github.com/tensorflow/fold/blob/master/tensorflow_fold/g3doc/sentiment.ipynb
What I'm trying to do now is to implement a real NaryTreeLSTM, means that a LSTM node can be the parent of any number of children, not just 2 like in the above tutorial.
This is my attempt at trying to fold the tree, this is a modify version of logits_and_state() in the above example "
def logits_and_state():
"""Creates a block that goes from tokens to (logits, state) tuples."""
word2vec = (td.GetItem(0) >> td.InputTransform(lookup_word) >>
td.Scalar('int32') >> word_embedding)
children_num =
children2vec_list = list()
children2vec_list.append(embed_subtree())
for i in range(children_num):
children2vec_list.append(embed_subtree())
children2vec = tuple(children2vec_list)
# Trees are binary, so the tree layer takes two states as its input_state.
zero_state = td.Zeros((tree_lstm.state_size,) * 2)
# Input is a word vector.
zero_inp = td.Zeros(word_embedding.output_type.shape[0])
# word_case =
word_case = td.AllOf(word2vec, zero_state)
children_case = td.AllOf(zero_inp, children2vec)
tree2vec = td.OneOf(lambda x: 1 if len(x) == 1 else 2), [(1,word_case),(2,children_case)])
return tree2vec >> tree_lstm >> (output_layer, td.Identity())
The children_num is the thing that I'm struggling at this moment, I have no idea to get out that number, eventhought I know that the length of children can be obtained by td.GetItem(1) ==> will produce a block that contains an array of children ==> how to get out the real number of that block?
You may say that I should try PyTorch or some others DL framework that also provides Dynamic Computation Graph, but in my case, the requirement is strict with Tensorflow Fold.
I'm in a situation where I'm constantly hitting my memory limit (I have 20G of RAM). Somehow I managed to get the huge array into memory and carry on my processes. Now the data needs to be saved onto the disk. I need to save it in leveldb format.
This is the code snippet responsible for saving the normalized data onto the disk:
print 'Outputting training data'
leveldb_file = dir_des + 'svhn_train_leveldb_normalized'
batch_size = size_train
# create the leveldb file
db = leveldb.LevelDB(leveldb_file)
batch = leveldb.WriteBatch()
datum = caffe_pb2.Datum()
for i in range(size_train):
if i % 1000 == 0:
print i
# save in datum
datum = caffe.io.array_to_datum(data_train[i], label_train[i])
keystr = '{:0>5d}'.format(i)
batch.Put( keystr, datum.SerializeToString() )
# write batch
if(i + 1) % batch_size == 0:
db.Write(batch, sync=True)
batch = leveldb.WriteBatch()
print (i + 1)
# write last batch
if (i+1) % batch_size != 0:
db.Write(batch, sync=True)
print 'last batch'
print (i + 1)
Now, my problem is, I hit my limit pretty much at the very end (495k out of 604k items that need to be saved to the disk) when saving to the disk.
To get around this issue, I thought after writing each batch, I release the corresponding memory from the numpy array (data_train) since it seems leveldb writes the data in a transaction manner, and until all the data are written, they are not flushed to the disk!
My second thought is to somehow, make the write non-transactional, and when each batch is written using the db.Write, it actually saves the content to the disk.
I don't know if any of these ideas are applicable.
Try reducing batch_size to something smaller than the entire dataset, for example, 100000.
Converted to Community Wiki from #ren 's comment