How do I allow a text input to a TensorFlow model? - python

I'm working on a custom text classification model in TensorFlow, and would now like to set it up with TensorFlow serving for production deployment. The model predicts on the basis of a text embedding that's computed via a separate model, and that model requires the raw text to be encoded as a vector.
I have this working in a somewhat disjointed way right now, where one service does all the text preprocessing and then computes the embedding, which is then sent to the text classifier as the embedded text vector. It would be nice if we could bundle this all into one TensorFlow serving model, especially the initial text preprocessing step.
And that's where I'm stuck. How do you construct a Tensor (or other TensorFlow primitive) that is a raw text input? And do you need to do anything special to earmark the lookup table for the token-vector component mapping so that it gets saved out as part of the model bundle?
For reference, here's a rough approximation of what I have now:
input = tf.placeholder(tf.float32, [None, 510], name='input')
# lots of steps omitted for brevity/clarity
outputs = tf.linalg.matmul(outputs, terminal_layer, transpose_b=True, name='output')
sess = tf.Session()
tf.saved_model.simple_save(sess,
'model.pb',
inputs={'input': input}, outputs={'output': outputs})

This turns out to be relatively straightforward, thanks to the tf.lookup.StaticVocabularyTable that's available as part of the TensorFlow standard library.
My model is using a bag of words approach, rather than preserving order, though that would be a pretty simple change to the code.
Assuming you have a list object that encodes your vocabulary (which I've called vocab) and a matrix of corresponding term/token embeddings you want to use (which I've called raw_term_embeddings, since I'm coercing that into a Tensor), the code will look something like this:
initalizer = tf.lookup.KeyValueTensorInitializer(vocab, np.arange(len(vocab)))
lut = tf.lookup.StaticVocabularyTable(initalizer, 1) # the one here is the out of vocab size
lut.initializer.run(session=sess) # pushes the LUT onto the session
input = tf.placeholder(tf.string, [None, None], name='input')
ones_at = lut.lookup(input)
encoded_text = tf.math.reduce_sum(tf.one_hot(ones_at, tf.dtypes.cast(lut.size(), np.int32)), axis=0, keepdims=True)
# I didn't build an embedding for the out of vocabulary token
term_embeddings = tf.convert_to_tensor(np.vstack([raw_term_embeddings]), dtype=tf.float32)
embedded_text = tf.linalg.matmul(encoded_text, term_embeddings)
# then use embedded_text for the remainder of the model
The one small trick is also making sure to pass legacy_init_op=tf.tables_initializer() to the save function to hint TensorFlow Serving to initialize the lookup table for the text encoding when the model is loaded.

Related

sentence transformer using huggingface/transformers pre-trained model vs SentenceTransformer

This page has two scripts
When should one use 1st method shown below vs 2nd? As nli-distilroberta-base-v2 trained specially for finding sentence embedding wont that always be better than the first method?
training_stsbenchmark.py1 -
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, models, util
#You can specify any huggingface/transformers pre-trained model here, for example, bert-base-uncased, roberta-base, xlm-roberta-base
model_name = sys.argv[1] if len(sys.argv) > 1 else 'distilbert-base-uncased'
# Use Huggingface/transformers model (like BERT, RoBERTa, XLNet, XLM-R) for mapping tokens to embeddings
word_embedding_model = models.Transformer(model_name)
# Apply mean pooling to get one fixed sized sentence vector
pooling_model = models.Pooling(word_embedding_model.get_word_embedding_dimension(),
pooling_mode_mean_tokens=True,
pooling_mode_cls_token=False,
pooling_mode_max_tokens=False)
model = SentenceTransformer(modules=[word_embedding_model, pooling_model])
training_stsbenchmark_continue_training.py -
from sentence_transformers import SentenceTransformer, LoggingHandler, losses, util, InputExample
model_name = 'nli-distilroberta-base-v2'
model = SentenceTransformer(model_name)
You are comparing 2 different things:
training_stsbenchmark.py - This example shows how to create a SentenceTransformer model from scratch by using a pre-trained transformer model together with a pooling layer.
In other words, you are creating your own model SentenceTransformer using your own data, therefore fine-tuning.
training_stsbenchmark_continue_training.py - This example shows how to continue training on STS data for a previously created & trained SentenceTransformer model.
In that example, they load a model trained on NLI data.
So, to answer "wont that always be better than the first method?"
It depends on you final results. Try both methods and check for yourself which will deliver better cross-validation results.

How do I preprocess and tokenize a TensorFlow CsvDataset inside the map method?

I made a TensorFlow CsvDataset, and I'm trying to tokenize the data as such:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'
from tensorflow import keras
import tensorflow as tf
from tensorflow.keras.preprocessing.text import Tokenizer
import os
os.chdir('/home/nicolas/Documents/Datasets')
fname = 'rotten_tomatoes_reviews.csv'
def preprocess(target, inputs):
tok = Tokenizer(num_words=5_000, lower=True)
tok.fit_on_texts(inputs)
vectors = tok.texts_to_sequences(inputs)
return vectors, target
dataset = tf.data.experimental.CsvDataset(filenames=fname,
record_defaults=[tf.int32, tf.string],
header=True).map(preprocess)
Running this, gives the following error:
ValueError: len requires a non-scalar tensor, got one of shape Tensor("Shape:0", shape=(0,), dtype=int32)
What I've tried: just about anything in the realm of possibilities. Note that everything runs if I remove the preprocessing step.
What the data looks like:
(<tf.Tensor: shape=(), dtype=int32, numpy=1>,
<tf.Tensor: shape=(), dtype=string, numpy=b" Some movie critic review...">)
First of all, let's find out the problems in your code:
The first problem, which is also the reason behind the given error, is that the fit_on_texts method accepts a list of texts, not a single text string. Therefore, it should be: tok.fit_on_texts([inputs]).
After fixing that and running the code again, you would get another error: AttributeError: 'Tensor' object has no attribute 'lower'. This is due to the fact that the elements in the dataset are Tensor objects, and the map function should be able to handle them; however, the Tokenizer class is not designed to handle Tensor objects (there is a fix for this problem, but I won't address it now because of the next problem).
The biggest problem is that each time the map function, i.e. preprocess, is called, a new instance of Tokenizer class is created and it would be fit on a single text document. Update: As #Princy correctly pointed out in the comments section, the fit_on_texts method actually performs a partial fit (i.e. updates or augments the internal vocabulary stats, instead of starting from scratch). So if we create the Tokenizer class outside the preprocess function and assuming the vocabulary set is known beforehand (otherwise, you can't filter the most frequent words in a partial fit scheme unless you have or build the vocabulary set first), then it would be possible to use this approach (i.e. based on Tokenizer class) after applying the above fixes as well. However, personally, I prefer the solution below.
So, what should we do? As mentioned above, in almost all of the models which deal with text data, we first need to convert the texts into numerical features, i.e. encode them. For performing encoding, first we need a vocabulary set or a dictionary of tokens. Therefore, the steps we should take are as follows:
If there is a pre-built vocabulary available, then skip to the next step. Otherwise, tokenize all the text data first and build the vocabulary.
Encode the text data using the vocabulary set.
For performing the first step, we use tfds.features.text.Tokenizer to tokenize text data and build the vocabulary by iterating over the dataset.
For the second step, we use tfds.features.text.TokenTextEncoder to encode the text data using the vocabulary set built in previous step. Note that, for this step we are using map method; however, since map only functions in graph mode, we have wrapped our encode function in tf.py_function so that it could be used with map.
Here is the code (please read the comments in the code for additional points; I have not included them in the answer because they are not directly relevant, but they are useful and practical):
import tensorflow as tf
import tensorflow_datasets as tfds
from collections import Counter
fname = "rotten_tomatoes_reviews.csv"
dataset = tf.data.experimental.CsvDataset(filenames=fname,
record_defaults=[tf.int32, tf.string],
header=True)
# Create a tokenizer instance to tokenize text data.
tokenizer = tfds.features.text.Tokenizer()
# Find unique tokens in the dataset.
lowercase = True # set this to `False` if case-sensitivity is important.
vocabulary = Counter()
for _, text in dataset:
if lowercase:
text = tf.strings.lower(text)
tokens = tokenizer.tokenize(text.numpy())
vocabulary.update(tokens)
# Select the most common tokens as final vocabulary set.
# Note: if you want all the tokens to be included,
# set `vocab_size = len(vocabulary)` instead.
vocab_size = 5000
vocabulary, _ = zip(*vocabulary.most_common(vocab_size))
# Create an encoder instance given our vocabulary set.
encoder = tfds.features.text.TokenTextEncoder(vocabulary,
lowercase=lowercase,
tokenizer=tokenizer)
# Set this to a non-zero integer if you want the texts
# to be truncated when they have more than `max_len` tokens.
max_len = None
def encode(target, text):
text_encoded = encoder.encode(text.numpy())
if max_len:
text_encoded = text_encoded[:max_len]
return text_encoded, target
# Wrap `encode` function inside `tf.py_function` so that
# it could be used with `map` method.
def encode_pyfn(target, text):
text_encoded, target = tf.py_function(encode,
inp=[target, text],
Tout=(tf.int32, tf.int32))
# (optional) Set the shapes for efficiency.
text_encoded.set_shape([None])
target.set_shape([])
return text_encoded, target
# Apply encoding and then padding.
# Note: if you want the sequences in all the batches
# to have the same length, set `padded_shapes` argument accordingly.
dataset = dataset.map(encode_pyfn).padded_batch(batch_size=3,
padded_shapes=([None,], []))
# Important Note: probably this dataset would be used as input to a model
# which uses an Embedding layer. Therefore, don't forget that you
# should set the vocabulary size for this layer properly, i.e. the
# current value of `vocab_size` does not include the padding (added
# by `padded_batch` method) and also the OOV token (added by encoder).
Side note for future readers: notice that the order of arguments, i.e. target, text, and the data types are based on the OP's dataset. Adapt as needed based on your own dataset/task (although, at the end, i.e. return text_encoded, target, we adjusted this to make it compatible with expected format of fit method).

How to predict with Word2Vec?

I'm doing arabic dialect text classification and I've used Word2Vec to train the model, I got this so far:
def read_input(input_file):
with open (input_file, 'rb') as f:
for i, line in enumerate (f):
yield gensim.utils.simple_preprocess (line)
documents = list (read_input (data_file))
logging.info ("Done reading data file")
model = gensim.models.Word2Vec (documents, size=150, window=10, min_count=2, workers=10)
model.train(documents,total_examples=len(documents),epochs=10)
What do I do now to predict a new text if it's of any of the 5 dialects I have?
Also, I looked around and found this code:
# load the pre-trained word-embedding vectors
embeddings_index = {}
for i, line in enumerate(open('w2vmodel.vec',encoding='utf-8')):
values = line.split()
embeddings_index[values[0]] = numpy.asarray(values[1:], dtype='float32')
# create a tokenizer
token = text.Tokenizer()
token.fit_on_texts(trainDF['text'])
word_index = token.word_index
# convert text to sequence of tokens and pad them to ensure equal length vectors
train_seq_x = sequence.pad_sequences(token.texts_to_sequences(train_x), maxlen=70)
valid_seq_x = sequence.pad_sequences(token.texts_to_sequences(valid_x), maxlen=70)
# create token-embedding mapping
embedding_matrix = numpy.zeros((len(word_index) + 1, 300))
for word, i in word_index.items():
embedding_vector = embeddings_index.get(word)
if embedding_vector is not None:
embedding_matrix[i] = embedding_vector
But it gives me this error when I run it and load my trained word2vec model:
ValueError: could not convert string to float: '\x00\x00\x00callbacksq\x04)X\x04\x00\x00\x00loadq\x05cgensim.utils'
Note:
Actually, there's another code that I didn't post here, I wanted to use word2vec with neural networks, I have the code for neural network, but I don't know how to make the features I got from word2vec to be as an input to the neural net and with labels as output. Is it possible to connect word2vec to a deep neural net and how?
Word2vec alone isn't something that would classify texts into dialects, so you haven't sketched out a plausible full approach here.
What made you think word2vec could or should be used as part of this task? (If there's a motivating theory-of-operation, or some other published precedent for this approach that gave you the idea, that could help guide what else should be done.)
What is your training data like?
If it's a large number of example texts, with accurate labels as to which dialect each text is from, have you tried classifiers working on simple bag-of-words or bag-of-character-n-grams representations of the texts? (By discovering relationships between the dialects, and the exact words, groups-of-words, or word-fragments in the texts, such a classifier might work far better than word2vec. Word2vec ignores word fragments, and drives the vectors of similar-meaning words close together, obscuring small differences in word-spelling or word-choice.)
You might also try:
FastText in classification mode, in which the word-vectors (and optionally, word-fragment-vectors) are trained to specifically be good at classifying amongst a set of known labels (rather than just to be good at predicting nearby words, as in classic word2vec
the technique of using multiple word2vec-models (one per dialect) as classifiers, as demonstrated in a notebook included with gensim: Deep Inverse Regression with Yelp Reviews
(Separately regarding your shown code:
you don't need to call train(documents,...) if you already supplied documents to the class-instantiation call – that will have already done training, as enabling INFO logging and watching the logs should make clear
you shouldn't need to use such code that tries to open/read your w2vmodel.vec file directly, because gensim includes methods for reading such files directly, such as .load_word2vec_format() or (if a full model was natively .save()d from gensim), just .load().
)

BucketIterator not returning batches of correct size

I'm implementing a simple LSTM language model in PyTorch, and wanted to check out the BucketIterator that is provided by torchtext.
It turns out that the batch that is returned has the size of my entire corpus, so I must be doing something wrong during its initialisation.
I've already got the BPTTIterator working, but as I want to be able to train on batches of complete sentences as well, I thought the BucketIterator should be the way to go.
I use the following setup, with my corpus a simple txt file containing sentences at each line.
field = Field(use_vocab=True, batch_first=True)
corpus = PennTreebank('project_2_data/train_lines.txt', field)
field.build_vocab(corpus)
iterator = BucketIterator(corpus,
batch_size=64,
repeat=False,
sort_key=lambda x: len(x.text),
sort_within_batch=True,
)
I expect a batch from this iterator to have the shape (batch_size, max_len), but it appends the entire corpus into 1 tensor of shape (1, corpus_size).
What am I missing in my setup?
Edit: it seems the PennTreebank object is not compatible with a BucketIterator (it contains only 1 Example as noted here http://mlexplained.com/2018/02/15/language-modeling-tutorial-in-torchtext-practical-torchtext-part-2/). Using a TabularDataset with only 1 Field got it working.
If someone has an idea how language modelling with padded sentence batches can be done in torchtext in a more elegant manner I'd love to hear it!

Is evaluation(detection) on multi gpu possible when data is previously trained on a single gpu?

We are trying to recognize numbers on number plates given pictures of it.
We have trained the data set based on a single gpu.
Is it possible to evaluate the data on multiple gpus without modifying the model?
We use Tensorflow library for training and evaluating.
It is really difficult to help you without any code.
I can give you an insight.
To pass from GPU to multi GPU, you have to :
1) Split your data according the number of GPU (be careful of the size, it must be matrix)
2) Build your graph on each gpu with a loop looking like :
with tf.variable_scope(tf.get_variable_scope()) as outer_scope:
for i in enumerate(range(nb_of_GPU)):
name = 'tower_{}'.format(i)
with tf.device("/gpu:"+str(i)), tf.name_scope(name):
logits = self.build_graph(splitted_inputs[i])
batch_loss = self.compute_loss(logits, splitted_labels[i])
batch_acc = self.compute_acc(logits, splitted_labels[i])
losses.append(batch_loss)
accs.append(batch_acc)
tf.summary.scalar("loss", batch_loss)
gradient = optimizer.compute_gradients(batch_loss)
tower_grads.append(gradient)
outer_scope.reuse_variables()
avergage_grads = average_gradients(tower_grads)
train_op = optimizer.apply_gradients(avergage_grads)
You also have to change the size of the placeholder for inputs.
So the answer of your question Is it possible to evaluate the data on multiple gpus without modifying the model? is No.
You will have to change some things.

Categories