Should data feed into Universal Sentence Encoder be normalized?

Should data feed into Universal Sentence Encoder be normalized? - python

I am currently working with Tensor Flow's Universal Sentence Encoder (https://arxiv.org/pdf/1803.11175.pdf) for my B.Sc. thesis where I study extractive summarisation techniques.
In the vast majority of techniques for this task (like https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11225/10855), the sentences are first normalized (lowercasing, stop word removal, lemmantisation), but I couldn't find a hint whether sentences feed into the USE should first be normalized. Is that the case? Does is matter?

The choice really depends on the application of design.
Regarding stop word removal and lemmatization: these operations in general removes some contents from the text, hence, it can remove the information. However, if it doesn't make an impact, then you can remove. (It is always best to try out both. In general the performance differences shouldn't be too much).
Lowercasing depends on the pre-trained model that you use (For example, in BERT, you have bert-base-uncased and bert-base-cased) and choice of application. One simple way to verify is, input a text into USE model, obtain it's sentence embeddings, then lowercase the same input text and obtain it's sentence embeddings. If they are same, that means your model is case insensitive. However, if it gives different embedding, then it's case sensitive. (By running the program provided here, it appears that USE is case-sensitive). The choice of lower-casing is again application dependent.

Related

How to find common adjectives related to a world using word2vec?

I need to train a model in Python based on word2vec or other models in order to get adjectives which are semantically close to a world.
For example give a word like 'cat' to model and receive adjectives like 'cute', 'nice', etc.
Is there any way?

With any word2vec model – whether you train it on your own data, or download someone else's pre-trained model – you can give it a word like cat and receive back a ranked list of words that are considered 'similar' in its coordinate system.
However, these won't normally be limited to adjectives, as typical word2vec models don't take any note of a word's part-of-speech. So to filter to just adjectives, some options could include:
use a typical word2vec set-of-vectors that is oblivious to part-of-speech, but use some external reference (like say WordNet) to check each returned word, and discard those that can't be adjectives
preprocess a suitable training corpus to label words with their part-of-speech before word2vec training, as is sometimes done. Then your model's tokens will include within them a declared part-of-speech. For example, you'd then no longer have the word good alone as a token, but (depending on what conventions you use) tagged-tokens like good/NOUN & good/ADJ instead. Then, filtering the closest-words to just adjectives is a simple matter of checking for the desired string pattern.
However, the words you receive from any such process based on word2vec might not be precisely what you're looking for. The kinds of 'semantic similarity' captured by word2vec coordinates are driven by how well words predict other nearby words under the model's limitations. Whether these will meet your needs is something you'll have to try; there could be surprises.
For example, words that humans consider antonyms, like hot & cold, will still be relatively close to each other in word2vec models, as they both describe the same aspect of something (its temperature), and often appear in the same surrounding-word contexts.
And, depending on training texts & model training parameters, different word2vec models can sometimes emphasize different kinds of similarity in their rankings. Some have suggested, for example, that using a smaller window can tend to place words that are direct replacements for each other (same syntactic roles) closer together, whereas a larger window will somewhat more bring together words used in the same topical domains (even if they aren't of the same type). Which kind of similarity would be better for your need? I'm not sure; if you have the time/resources, you could compare the quality of results from multiple contrasting models.

Which document embedding model for document similarity

First, I want to explain my task. I have a dataset of 300k documents with an average of 560 words (no stop word removal yet) 75% in German, 15% in English and the rest in different languages. The goal is to recommend similar documents based on an existing one. At the beginning I want to focus on the German and English documents.  
To achieve this goal I looked into several methods on feature extraction for document similarity, especially the word embedding methods have impressed me because they are context aware in contrast to simple TF-IDF feature extraction and the calculation of cosine similarity. 
I'm overwhelmed by the amount of methods I could use and I haven't found a proper evaluation of those methods yet. I know for sure that the size of my documents are too big for BERT, but there is FastText, Sent2Vec, Doc2Vec and the Universal Sentence Encoder from Google. My favorite method based on my research is Doc2Vec even though there aren't any or old pre-trained models which means I have to do the training on my own.
Now that you know my task and goal, I have the following questions:
Which method should I use for feature extraction based on the rough overview of my data?
My dataset is too small to train Doc2Vec on it. Do I achieve good results if I train the model on English / German Wikipedia?

You really have to try the different methods on your data, with your specific user tasks, with your time/resources budget to know which makes sense.
You 225K German documents and 45k English documents are each plausibly large enough to use Doc2Vec - as they match or exceed some published results. So you wouldn't necessarily need to add training on something else (like Wikipedia) instead, and whether adding that to your data would help or hurt is another thing you'd need to determine experimentally.
(There might be special challenges in German given compound words using common-enough roots but being individually rare, I'm not sure. FastText-based approaches that use word-fragments might be helpful, but I don't know a Doc2Vec-like algorithm that necessarily uses that same char-ngrams trick. The closest that might be possible is to use Facebook FastText's supervised mode, with a rich set of meaningful known-labels to bootstrap better text vectors - but that's highly speculative and that mode isn't supported in Gensim.)

Is there a pretrained Gensim phrase model?

Is there a pretrained Gensim's Phrases model? If not, would it be possible to reverse engineer and create a phrase model using a pretrained word embedding?
I am trying to use GoogleNews-vectors-negative300.bin with Gensim's Word2Vec. First, I need to map my words into phrases so that I can look up their vectors from the Google's pretrained embedding.
I search on the official Gensim's documentation but could not find any info. Thanks!

I'm not aware of anyone sharing a Phrases model. Any such model would be very sensitive to the preprocessing/tokenization step, and the specific parameters, the creator used.
Other than the high-level algorithm description, I haven't seen Google's exact choices for tokenization/canonicalization/phrase-combination done to the data that fed into the GoogleNews 2013 word-vectors have been documented anywhere. Some guesses about preprocessing can be made by reviewing the tokens present, but I'm unaware of any code to apply similar choices to other text.
You could try to mimic their unigram tokenization, then speculatively combine strings of unigrams into ever-longer multigrams up to some maximum, check if those combinations are present, and when not present, revert to the unigrams (or largest combination present). This might be expensive if done naively, but be amenable to optimizations if really important - especially for some subset of the more-frequent words – as the GoogleNews set appears to obey the convention of listing words in descending frequency.
(In general, though it's a quick & easy starting set of word-vectors, I think GoogleNews is a bit over-relied upon. It will lack words/phrases and new senses that have developed since 2013, and any meanings it does capture are determined by news articles in the years leading up to 2013... which may not match the dominant senses of words in other domains. If your domain isn't specifically news, and you have sufficient data, deciding your own domain-specific tokenization/combination will likely perform better.)

Creating a Python program that takes in a short description and returns a solution from a given set (using nlp)

I am trying to take a person's ailment, and return what they should do (from a predetermined set of "solutions").
For example,
person's ailment
My head is not bleeding
predetermined set of "solutions"
[take medicine, go to a doctor, call the doctor]
I know I need to first remove common words from the sentence (such as 'my' and 'is') but also preserve "common" words such as 'not,' which are crucial to the solution and important to the context.
Next, I'm pretty sure I'll need to train a set of processed inputs and match them to outputs to train a model which will attempt to identify the "solution" for the given string.
Are there any other libraries I should be using (other than nltk, and scikit-learn)?

You should check out gensim. and you might want to check out these keywords for a start: tokenizer, word stemming, lemmatization, good luck!

Use proxy sentences from cleaned data

Gensim's Word2Vec model takes as an input a list of lists with the inner list containing individual tokens/words of a sentence. As I understand Word2Vec is used to "quantify" the context of words within a text using vectors.
I am currently dealing with a corpus of text that has already been split into individual tokens and no longer contains an obvious sentence format (punctuation has been removed). I was wondering how should I input this into the Word2Vec model?
Say if I simply split the corpus into "sentences" of uniform length (10 tokens per sentence for example), would this be a good way of inputting the data into the model?
Essentially, I am wondering how the format of the input sentences (list of lists) affects the output of Word2Vec?

That sounds like a reasonable solution. If you have access to data that is similar to your cleaned data you could get average sentence length from that data set. Otherwise, you could find other data in the language you are working with (from wikipedia or another source) and get average sentence length from there.
Of course your output vectors will not be as reliable as if you had the correct sentence boundaries, but it sounds like word order was preserved so there shouldn't be too much noise from incorrect sentence boundaries.

Most typically, text is passed to Word2Vec in logical units (like sentences or paragraphs). Also, the published papers and early demo code tended to convert punctuation into tokens, as well.
But text without punctuation, and arbitrary breaks between texts, are a reasonable workaround and still give pretty good results.
For example, the text8/text9 corpuses often used in demos (including the word2vec intro Jupyter notebook bundled in gensim) are just giant runs-of-words, lacking punctuation and line-breaks. So, the utility LineSentence class used in gensim will break them into individual 10,000-token texts.
It's probably better to go larger in your arbitrary breaks (eg 10,000), rather than smaller (eg 10), for a couple reasons:
source texts are usually longer than 10 words
often the source material that was run-together was still semantically-related across its original boundaries
the optimized algorithms work better on larger chunks of data
the harm of "false context windows" (created by the concatenation) is probably just noise with no net biasing effect, while more "true windows" (by creating as few false splits as possible) likely retains more of the original corpus' learnable word-to-word relationships signal
you can always simulate more-conservative contexts with a smaller window parameter (if the original source really did have tiny sentences that weren't sequentially-related)
But, gensim's cython-optimized training path has an implementation limit of 10,000 tokens per text – with any more being silently ignored – so you wouldn't intentionally want to supply longer texts for any reason.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.