Why is does pad_sequences necessary when one_hot encoding is used?

Why is does pad_sequences necessary when one_hot encoding is used? - python

In Keras, I can have the following code:
docs
Out[9]:
['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
labels = array([1,1,1,1,1,0,0,0,0,0])
voc_size = 50
encoded = [one_hot(d, voc_size) for d in docs]
max_length = 4
padded_docs = pad_sequences(encoded, maxlen=max_length, padding='post')
My understanding is that, the 'one_hot' encoding already creates an equal length of each doc based on the vocabulary size. So why does each doc need to be padded again?
EDIT: another example for more clarification:
A one-hot encoding is a representation of categorical variables (e.g. cat, dog, rat) as binary vectors (e.g. [1,0,0], [0,1,0], [0,0,1]).
So in this case, cat, dog and rat are encoded as equal length of vector. How is this different from the example above?

TLDR; one_hot makes each index to be from a fixed range, no the result list to have a fixed length.
In order to understand that issue one need to understand what the one_hot function actually does. It transforms a document into a sequence of int indices which has approximately the same length as the number of words (tokens) in document. E.g.:
'one hot encoding' -> [0, 2, 17]
where each index is index of a word in vocabulary (e.g. one has index 0). This means that when you apply one_hot to a sequence of texts (as in the piece of code you have provided) you are getting the list of lists indices where each list might have different length. This is a problem for keras and numpy which expects the list of lists to be in array-like form - which means that each sublist should have the equal, fixed length.
This is done via pad_sequences function which makes each of the sublists to have a fixed length.

Related

How to get cosine similarity of word embedding from BERT model

I was interesting in how to get the similarity of word embedding in different sentences from BERT model (actually, that means words have different meanings in different scenarios).
For example:
sent1 = 'I like living in New York.'
sent2 = 'New York is a prosperous city.'
I want to get the cos(New York, New York)'s value from sent1 and sent2, even if the phrase 'New York' is same, but it appears in different sentence. I got some intuition from https://discuss.huggingface.co/t/generate-raw-word-embeddings-using-transformer-models-like-bert-for-downstream-process/2958/2
But I still do not know which layer's embedding I need to extract and how to caculate the cos similarity for my above example.
Thanks in advance for any suggestions!

Okay let's do this.
First you need to understand that BERT has 13 layers. The first layer is basically just the embedding layer that BERT gets passed during the initial training. You can use it but probably don't want to since that's essentially a static embedding and you're after a dynamic embedding. For simplicity I'm going to only use the last hidden layer of BERT.
Here you're using two words: "New" and "York". You could treat this as one during preprocessing and combine it into "New-York" or something if you really wanted. In this case I'm going to treat it as two separate words and average the embedding that BERT produces.
This can be described in a few steps:
Tokenize the inputs
Determine where the tokenizer has word_ids for New and York (suuuuper important)
Pass through BERT
Average
Cosine similarity
First, what you need to import: from transformers import AutoTokenizer, AutoModel
Now we can create our tokenizer and our model:
tokenizer = AutoTokenizer.from_pretrained('bert-base-cased')
model = model = AutoModel.from_pretrained('bert-base-cased', output_hidden_states=True).eval()
Make sure to use the model in evaluation mode unless you're trying to fine tune!
Next we need to tokenize (step 1):
tok1 = tokenizer(sent1, return_tensors='pt')
tok2 = tokenizer(sent2, return_tensors='pt')
Step 2. Need to determine where the index of the words match
# This is where the "New" and "York" can be found in sent1
sent1_idxs = [4, 5]
sent2_idxs = [0, 1]
tok1_ids = [np.where(np.array(tok1.word_ids()) == idx) for idx in sent1_idxs]
tok2_ids = [np.where(np.array(tok2.word_ids()) == idx) for idx in sent2_idxs]
The above code checks where the word_ids() produced by the tokenizer overlap the word indices from the original sentence. This is necessary because the tokenizer splits rare words. So if you have something like "aardvark", when you tokenize it and look at it you actually get this:
In [90]: tokenizer.convert_ids_to_tokens( tokenizer('aardvark').input_ids)
Out[90]: ['[CLS]', 'a', '##ard', '##var', '##k', '[SEP]']
In [91]: tokenizer('aardvark').word_ids()
Out[91]: [None, 0, 0, 0, 0, None]
Step 3. Pass through BERT
Now we grab the embeddings that BERT produces across the token ids that we've produced:
with torch.no_grad():
out1 = model(**tok1)
out2 = model(**tok2)
# Only grab the last hidden state
states1 = out1.hidden_states[-1].squeeze()
states2 = out2.hidden_states[-1].squeeze()
# Select the tokens that we're after corresponding to "New" and "York"
embs1 = states1[[tup[0][0] for tup in tok1_ids]]
embs2 = states2[[tup[0][0] for tup in tok2_ids]]
Now you will have two embeddings. Each is shape (2, 768). The first size is because you have two words we're looking at: "New" and "York. The second size is the embedding size of BERT.
Step 4. Average
Okay, so this isn't necessarily what you want to do but it's going to depend on how you treat these embeddings. What we have is two (2, 768) shaped embeddings. You can either compare New to New and York to York or you can combine New York into an average. I'll just do that but you can easily do the other one if it works better for your task.
avg1 = embs1.mean(axis=0)
avg2 = embs2.mean(axis=0)
Step 5. Cosine sim
Cosine similarity is pretty easy using torch:
torch.cosine_similarity(avg1.reshape(1,-1), avg2.reshape(1,-1))
# tensor([0.6440])
This is good! They point in the same direction. They're not exactly 1 but that can be improved in several ways.
You can fine tune on a training set
You can experiment with averaging different layers rather than just the last hidden layer like I did
You can try to be creative in combining New and York. I took the average but maybe there's a better way for your exact needs.

How to generate a new column based on some other column after clustering the data?

I have a dataframe like this with columns - ["A","B","C",D"]
A --> Categorical feature with 2 values, say Yes or No
B --> Categorical feature with 10 unique values, like "AAXX-10","BBYY-20" etc
C --> A date-time field
D --> Text-based column, describing if a person was interested in the movie or not based on short text(basically their comments after coming out of theatre)
Sample df
A | B | C | D
------------------------------------------------------------------------------
Yes|AAXX-10|8/10/2018|"Yes I liked the movie, it was great"
------------------------------------------------------------------------------
Yes|BBYY-20|8/10/2017|"I liked the performance of the cast in the movie but as a whole, It was just average"
------------------------------------------------------------------------------
No |AANN-88|8/10/2013|"Never seen a ridiculous movie like this"
I have two questions here -
I want to make a fifth column, say "Interest", based on the column "D" which would have 4 categories ["Liked", "Didn't like", "Average", "Cannot comment"]. How could I do that?
--On the basis of "D", the "Interest" column should have ["Liked", "Average", "Didn't like"]--.
Since most of the columns are categorical and date-time, and one column as Text. How should I go ahead and do the feature engineering in this particular scenario to be able to feed to Kmeans?
How to get features out of column "D" which is a text feature?.
Should I convert column A to binary 0s a 1s?
Should I do one hot encoding/label encoding to the second column?
How to make use of the date-time feature in the clustering?
Things I have tried -
I did preprocess and feature engineering of column A(convert to binary), B(label encoding), C(Converted to year and month feature from dates) and D(ignored this feature as did not know how could I use it).
Based on this, I got clusters using kmeans.labels_, but those clusters are numeric 1,2,3,4.
How can I actually map those to ["Liked", "Didn't like", "Average", "Cannot comment"]?
How can I use the text column efficiently to make the clusters?
Just short answers to my query would do. I don't need any implementation.

To answer the second question first:
A: can be turned to binary
B: what information can you get from a list of unique strings by encoding? After encoding you are left with either the identity matrix(One-Hot) or a list of monotonically increasing ints (label encoding)
C: you might better transform to Timestamp unix epoch if the date range allows it, this allows you to caluclate distance properly.
D: This is the bread and butter of the project. Processing step is very complex but a short summary:
A basic recipe includes but is not limited to:
Text normalization:
convert to lower or upper case
converting numbers into words or removing numbers,
removing punctuations, accent marks and other diacritics,
removing leading or trailing white spaces
Corupus tokenization (Split each row into a list of single words)
remove stop words, (a, the ..) they contain very litle information and are common
Stemming or Lemmatization. Tese reduce the words to a base form. Stemming is quite crude and could produce inavlid words, but is fast. Lemmatization produces valid words based on a dictionary, but is slower
.... many more stuff
n. Feature Extraction with TF-IDF, this is a sort of encoding that gives each word an importance score. This method works by increasing the weight of a word when it appears many times in a document, and lowering it’s weight when it’s common in many documents.
Example for td-idf:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
'This is the first document.',
'This document is the second document.',
'And this is the third one.',
'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
print(vectorizer.get_feature_names())
print(X.shape)
After these n steps, you get the answer to your first question; The output could look something like this :
You can find code on how to do all this stuff here (with NLTK). You might not be allowed to use NLTK however, in which case, you will have a hard time doing all these steps.

Working with variable-length text in Tensorflow

I am building a Tensorflow model to perform inference on text phrases.
For sake of simplicity, assume I need a classifier with fixed number of output classes but a variable-length text in input. In other words, my mini batch would be a sequence of phrases but not all phrases have the same length.
data = ['hello',
'my name is Mark',
'What is your name?']
My first preprocessing step was to build a dictionary of all possible words in the dictionary and map each word to its integer word-Id. The input becomes:
data = [[1],
[2, 3, 4, 5],
[6, 4, 7, 3]
What's the best way to handle this kind of input? Can tf.placeholder() handle variable-size input within the same batch of data?
Or should I pad all strings such that they all have the same length, equal to the length of the longest string, using some placeholder for the missing words? This seems to be very memory inefficient if some string are much longer that most of the others.
-- EDIT --
Here is a concrete example.
When I know the size of my datapoints (and all the datapoint have the same length, eg. 3) I normally use something like:
input = tf.placeholder(tf.int32, shape=(None, 3)
with tf.Session() as sess:
print(sess.run([...], feed_dict={input:[[1, 2, 3], [1, 2, 3]]}))
where the first dimension of the placeholder is the minibatch size.
What if the input sequences are words in sentences of different length?
feed_dict={input:[[1, 2, 3], [1]]}

The other two answers are correct, but low on details. I was just looking at how to do this myself.
There is machinery in TensorFlow to to all of this (for some parts it may be overkill).
Starting from a string tensor (shape [3]):
import tensorflow as tf
lines = tf.constant([
'Hello',
'my name is also Mark',
'Are there any other Marks here ?'])
vocabulary = ['Hello', 'my', 'name', 'is', 'also', 'Mark', 'Are', 'there', 'any', 'other', 'Marks', 'here', '?']
The first thing to do is split this into words (note the space before the question mark.)
words = tf.string_split(lines," ")
Words will now be a sparse tensor (shape [3,7]). Where the two dimensions of the indices are [line number, position]. This is represented as:
indices values
0 0 'hello'
1 0 'my'
1 1 'name'
1 2 'is'
...
Now you can do a word lookup:
table = tf.contrib.lookup.index_table_from_tensor(vocabulary)
word_indices = table.lookup(words)
This returns a sparse tensor with the words replaced by their vocabulary indices.
Now you can read out the sequence lengths by looking at the maximum position on each line :
line_number = word_indices.indices[:,0]
line_position = word_indices.indices[:,1]
lengths = tf.segment_max(data = line_position,
segment_ids = line_number)+1
So if you're processing variable length sequences it's probably to put in an lstm ... so let's use a word-embedding for the input (it requires a dense input):
EMBEDDING_DIM = 100
dense_word_indices = tf.sparse_tensor_to_dense(word_indices)
e_layer = tf.contrib.keras.layers.Embedding(len(vocabulary), EMBEDDING_DIM)
embedded = e_layer(dense_word_indices)
Now embedded will have a shape of [3,7,100], [lines, words, embedding_dim].
Then a simple lstm can be built:
LSTM_SIZE = 50
lstm = tf.nn.rnn_cell.BasicLSTMCell(LSTM_SIZE)
And run the across the sequence, handling the padding.
outputs, final_state = tf.nn.dynamic_rnn(
cell=lstm,
inputs=embedded,
sequence_length=lengths,
dtype=tf.float32)
Now outputs has a shape of [3,7,50], or [line,word,lstm_size]. If you want to grab the state at the last word of each line you can use the (hidden! undocumented!) select_last_activations function:
from tensorflow.contrib.learn.python.learn.estimators.rnn_common import select_last_activations
final_output = select_last_activations(outputs,tf.cast(lengths,tf.int32))
That does all the index shuffling to select the output from the last timestep. This gives a size of [3,50] or [line, lstm_size]
init_t = tf.tables_initializer()
init = tf.global_variables_initializer()
with tf.Session() as sess:
init_t.run()
init.run()
print(final_output.eval().shape())
I haven't worked out the details yet but I think this could probably all be replaced by a single tf.contrib.learn.DynamicRnnEstimator.

How about this? (I didn’t implement this. but maybe this idea will work.)
This method is based on BOW representation.
Get your data as tf.string
Split it using tf.string_split
Find indexes of your words using tf.contrib.lookup.string_to_index_table_from_file or tf.contrib.lookup.string_to_index_table_from_tensor. Length of this tensor can vary.
Find embeddings of your indexes.
word_embeddings = tf.get_variable(“word_embeddings”,
[vocabulary_size, embedding_size])
embedded_word_ids = tf.nn.embedding_lookup(word_embeddings, word_ids)`
Sum up the embeddings. And you will get a tensor of fixed length(=embedding size). Maybe you can choose another method then sum.(avg, mean or something else)
Maybe it’s too late :) Good luck.

I was building a sequence to sequence translator the other day. What I did is decided to do was make it for a fixed length of 32 words (which was a bit above the average sentence length) although you can make it as long as you want. I then added a NULL word to the dictionary and padded all my sentence vectors with it. That way I could tell the model where the end of my sequence was and the model would just output NULL at the end of its output. For instance take the expression "Hi what is your name?" This would become "Hi what is your name? NULL NULL NULL NULL ... NULL". It worked pretty well but your loss and accuracy during training will appear a bit higher than it actually is since the model usually gets the NULLs right which count towards the cost.
There is another approach called masking. This too allows you to build a model for a fixed length sequence but only evaluate the cost up to the end of a shorter sequence. You could search for the first instance of NULL in the output sequence (or expected output, whichever is greater) and only evaluate the cost up to that point. Also I think some tensor flow functions like tf.dynamic_rnn support masking which may be more memory efficient. I am not sure since I have only tried the first approach of padding.
Finally, I think in the tensorflow example of Seq2Seq model they use buckets for different sized sequences. This would probably solve your memory issue. I think you could share the variables between the different sized models.

So here is what I did (not sure if its 100% the right way to be honest):
In your vocab dict where each key is a number pointing to one particular word, add another key say K which points to "<PAD>"(or any other representation you want to use for padding)
Now your placeholder for input would look something like this:
x_batch = tf.placeholder(tf.int32, shape=(batch_size, None))
where None represents the largest phrase/sentence/record in your mini batch.
Another small trick I used was to store the length of each phrase in my mini batch. For example:
If my input was: x_batch = [[1], [1,2,3], [4,5]]
then I store: len_batch = [1, 3, 2]
Later I use this len_batch and the max size of a phrase(l_max) in my minibatch to create a binary mask. Now l_max=3 from above, so my mask would look something like this:
mask = [
[1, 0, 0],
[1, 1, 1],
[1, 1, 0]
]
Now if you multiply this with your loss you would basically eliminate all loss introduced as a result of padding.
Hope this helps.

Classify strings via class using sklearn

I'm having issues with presenting my data in a form which sklearn will accept
My raw data is a few hundred strings, and these are classified into one of 5 classes, I've a list of the strings i'd like to classify, and a parallel list of their respective classes. I'm using GaussianNB()
Example Data:
For such a large, successful business, I really feel like they need to be
either choosier in their employee selection or teach their employees to
better serve their customers.|||Class:4
Which represents a given "feature" and a classification
Naturally, the strings themselves have to be converted to vectors prior to their use in the classifier, I've attempted to use DictVector to perform this task
dictionaryTraining = convertListToSentence(data)
vec = DictVectorizer()
print(dictionaryTraining)
vec.fit_transform(dictionaryTraining)
However in order todo it, i have to attach the actual classification of the data into the dictionary, otherwise i get the error 'str' object has no attribute 'items' I understand this is because .fit_transform requires features and indices, but i don't fully understand the purpose of the indice
fit_transform(X[, y]) Learn a list of feature name -> indices mappings and transform X.
My question is, how can i take a list of strings, and a list of numbers representing their classifications, and provide these to a gaussianNB() classifier such that i can present it with a similar string in the future and it will estimate the strings class?

Since your input data are in the format of raw text and not in the format of a dictionary where like {"word":number_of_occurrences, } I believe you should go with a CountVectorizer which will split your input text on white space and transform it on the input vectors you need.
A simple example of such a transformation would be:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.', 'This is the second second document.',
'And the third one.', 'Is this the first document?',]
x = CountVectorizer().fit_transform(corpus)
print x.todense() #x holds your features. Here I am only vizualizing it

One Hot Encoding for representing corpus sentences in python

I am a starter in Python and Scikit-learn library.
I currently need to work on a NLP project which firstly need to represent a large corpus by One-Hot Encoding.
I have read Scikit-learn's documentations about the preprocessing.OneHotEncoder, however, it seems like it is not the understanding of my term.
basically, the idea is similar as below:
1000000 Sunday;
0100000 Monday;
0010000 Tuesday;
...
0000001 Saturday;
if the corpus only have 7 different words, then I only need a 7-digit vector to represent every single word. and then, a completed sentence can be represented by a conjunction of all the vectors, which is a sentence matrix.
However, I tried in Python, it seems not working...
How can I work this out? my corpus have a very large amount of different words.
Btw, also, seems like if the vectors are mostly fulfilled with zeros, we can use Scipy.Sparse to make the storage small, for example, CSR.
Hence, my entire question will be:
how the sentences in corpus can be represented by OneHotEncoder, and stored in a SparseMatrix?
Thank you guys.

In order to use the OneHotEncoder, you can split your documents into tokens and then map every token to an id (that is always the same for the same string). Then apply the OneHotEncoder to that list. The result is by default a sparse matrix.
Example code for two simple documents A B and B B:
from sklearn.preprocessing import OneHotEncoder
import itertools
# two example documents
docs = ["A B", "B B"]
# split documents to tokens
tokens_docs = [doc.split(" ") for doc in docs]
# convert list of of token-lists to one flat list of tokens
# and then create a dictionary that maps word to id of word,
# like {A: 1, B: 2} here
all_tokens = itertools.chain.from_iterable(tokens_docs)
word_to_id = {token: idx for idx, token in enumerate(set(all_tokens))}
# convert token lists to token-id lists, e.g. [[1, 2], [2, 2]] here
token_ids = [[word_to_id[token] for token in tokens_doc] for tokens_doc in tokens_docs]
# convert list of token-id lists to one-hot representation
vec = OneHotEncoder(n_values=len(word_to_id))
X = vec.fit_transform(token_ids)
print X.toarray()
Prints (one hot vectors in concatenated form per document):
[[ 1. 0. 0. 1.]
[ 0. 1. 0. 1.]]

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.