I have trained the imdb_lstm.py on my PC.
Now I want to test the trained network by inputting some text of my own. How do I do it?
Thank you!
So what you basically need to do is as follows:
Tokenize sequnces: convert the string into words (features): For example: "hello my name is georgio" to ["hello", "my", "name", "is", "georgio"].
Next, you want to remove stop words (check Google for what stop words are).
This stage is optional, it may lead to faulty results but I think it worth a try. Stem your words (features), that way you'll reduce the number of features which will lead to a faster run. Again, that's optional and might lead to some failures, for example: if you stem the word 'parking' you get 'park' which has a different meaning.
Next thing is to create a dictionary (check Google for that). Each word gets a unique number and from this point we will use this number only.
Computers understand numbers only so we need to talk in their language. We'll take the dictionary from stage 4 and replace each word in our corpus with its matching number.
Now we need to split our data set to two groups: training and testing sets. One (training) will train our NN model and the second (testing) will help us to figure out how good is our NN. You can use Keras' cross validation function.
Next thing is defining whats the max number of features our NN can get as an input. Keras call this parameter - 'maxlen'. But you don't really have to do this manually, Keras can do that automatically just by searching for the longest sentence you have in your corpus.
Next, let's say that Keras found out that the longest sentence in your corpus has 20 words (features) and one of your sentences is the example in the first stage, which its length is 5 (if we'll remove stop words it'll be shorter), in such case we'll need to add zeros, 15 zeros actually. This is called pad sequence, we do that so every input sequence will be in the same length.
This might help.
http://keras.io/models/
Here is an sample usage.
How to use keras for XOR
Probably you have to convert ur corpus into ndarray first and throw it to your model.predict
From what it seem so far the model.predict input of the training model should be 100 words corpus which represent an index of each word in dictionary. So if you want to train it with ur corpus, you have to convert ur corpus according to those dictionary and see if the result is 0 or 1
Related
I am currently working with Tensor Flow's Universal Sentence Encoder (https://arxiv.org/pdf/1803.11175.pdf) for my B.Sc. thesis where I study extractive summarisation techniques.
In the vast majority of techniques for this task (like https://www.aaai.org/ocs/index.php/IJCAI/IJCAI15/paper/view/11225/10855), the sentences are first normalized (lowercasing, stop word removal, lemmantisation), but I couldn't find a hint whether sentences feed into the USE should first be normalized. Is that the case? Does is matter?
The choice really depends on the application of design.
Regarding stop word removal and lemmatization: these operations in general removes some contents from the text, hence, it can remove the information. However, if it doesn't make an impact, then you can remove. (It is always best to try out both. In general the performance differences shouldn't be too much).
Lowercasing depends on the pre-trained model that you use (For example, in BERT, you have bert-base-uncased and bert-base-cased) and choice of application. One simple way to verify is, input a text into USE model, obtain it's sentence embeddings, then lowercase the same input text and obtain it's sentence embeddings. If they are same, that means your model is case insensitive. However, if it gives different embedding, then it's case sensitive. (By running the program provided here, it appears that USE is case-sensitive). The choice of lower-casing is again application dependent.
I have a text file with million of rows which I wanted to convert into word vectors and later on I can compare these vectors with a search keyword and see which all texts are closer to the search keyword.
My Dilemma is all the training files that I have seen for the Word2vec are in the form of paragraphs so that each word has some contextual meaning within that file. Now my file here is independent and contains different keywords in each row.
My question is whether is it possible to create word embedding using this text file or not, if not then what's the best approach for searching a matching search keyword in this million of texts
**My File Structure: **
Walmart
Home Depot
Home Depot
Sears
Walmart
Sams Club
GreenMile
Walgreen
Expected
search Text : 'WAL'
Result from My File:
WALGREEN
WALMART
WALMART
Embeddings
Lets step back and understand what is word2vec. Word2vec (like Glove, FastText etc) is a way to represent words as vectors. ML models don't understand words they only understand numbers so when we are dealing with words we would want to convert them into numbers (vectors). One-hot encoding is one naive way of encoding words as vectors. But for a large vocabulary one-hot encoding become too long. Also there is no semantic relationship between one-hot encoded word.
With DL came the distributed representation of words (called word embeddings). One important property of these word embeddings is that the vector distance between related words is small compared to the distance between unrelated words. i.e distance(apple,orange) < distance(apple,cat)
So how are these embedding model trained ? The embedding models are trained on (very) huge corpus of text. When you have huge corpus of text the model will learn that the apple are orange are used (many times) in same context. It will learn that the apple and orange are related. So to train a good embedding model you need huge corpus of text (not independent words because independent words have no context).
However, one rarely trains a word embedding model form scratch because good embedding model are available in open source. However, if your text is domain specific (say medical) then you do a transfer learning on openly available word embeddings.
Out of vocabulary (OOV) words
Word embedding like word2vec and Glove cannot return an embedding for OOV words. However the embeddings like FastText (thanks to #gojom for pointing it out) handle OOV words by breaking them into n-grams of chars and build a vector by summing up subword vectors that would make up the word.
Problem
Coming to your problem,
Case 1: lets say the user enters a word WAL, first of all it is not a valid English word so it will not be in vocabulary and it is hard to mind a meaning full vector to it. Embeddings like FastText handling them by breaking it into n-grams. This approach gives good embeddings for misspelled words or slang.
Case 2: Lets say the user enters a word WALL and if you plan to use vector similarly to find closest word it will never be close to Walmart because semantically they are not related. It will rather be close to words like window, paint, door.
Conclusion
If your search is for semantically similar words, then solution using vector embeddings will be good. On the other hand, if your search is based on lexicons then vectors embeddings will be of no help.
If you wanted to find walmart from a fragment like wal, you'd more likely use something like:
a substring or prefix search through all entries; or
a reverse-index-of-character-n-grams; or
some sort of edit-distance calculated against all entries or a subset of likely candidates
That is, from your example desired output, this is not really a job for word-vectors, even though some algorithms, like FastText, will be able to provide rough vectors for word-fragments based on their overlap with trained words.
If in fact you want to find similar stores, word-vectors might theoretically be useful. But the problem given your example input is that such word-vector algorithms require examples of tokens used in context, from sequences-of-tokens that co-appear in natural-language-like relationships. And you want lots of data featuring varied examples-in-context, to capture subtle gradations of mutual relationships.
While your existing single-column of short entity-names (stores) can't provide that, maybe you have something applicable elsewhere, if you have richer data sources. Some ideas might be:
lists of stores visited by a single customer
lists of stores carrying the same product/UPC
text from a much larger corpus (such as web-crawled text, or maybe Wikipedia) in which there are sufficient in-context usages of each store-name. (You'd just throw out all the other words created from such training - but the vectors for your tokens-of-interest might still be of use in your domain.)
Is there a way to find similar docs like we do in word2vec
Like:
model2.most_similar(positive=['good','nice','best'],
negative=['bad','poor'],
topn=10)
I know we can use infer_vector,feed them to have similar ones, but I want to feed many positive and negative examples as we do in word2vec.
is there any way we can do that! thanks !
The doc-vectors part of a Doc2Vec model works just like word-vectors, with respect to a most_similar() call. You can supply multiple doc-tags or full vectors inside both the positive and negative parameters.
So you could call...
sims = d2v_model.docvecs.most_similar(positive=['doc001', 'doc009'], negative=['doc102'])
...and it should work. The elements of the positive or negative lists could be doc-tags that were present during training, or raw vectors (like those returned by infer_vector(), or your own averages of multiple such vectors).
Don't believe there is a pre-written function for this.
One approach would be to write a function that iterates through each word in the positive list to get top n words for a particular word.
So for positive words in your question example, you would end up with 3 lists of 10 words.
You could then identify words that are common across the 3 lists as the top n similar to your positive list. Since not all words will be common across the 3 lists, you probably need to get top 20 similar words when iterating so you end up with top 10 words as you want in your example.
Then do the same for negative words.
Gensim's Word2Vec model takes as an input a list of lists with the inner list containing individual tokens/words of a sentence. As I understand Word2Vec is used to "quantify" the context of words within a text using vectors.
I am currently dealing with a corpus of text that has already been split into individual tokens and no longer contains an obvious sentence format (punctuation has been removed). I was wondering how should I input this into the Word2Vec model?
Say if I simply split the corpus into "sentences" of uniform length (10 tokens per sentence for example), would this be a good way of inputting the data into the model?
Essentially, I am wondering how the format of the input sentences (list of lists) affects the output of Word2Vec?
That sounds like a reasonable solution. If you have access to data that is similar to your cleaned data you could get average sentence length from that data set. Otherwise, you could find other data in the language you are working with (from wikipedia or another source) and get average sentence length from there.
Of course your output vectors will not be as reliable as if you had the correct sentence boundaries, but it sounds like word order was preserved so there shouldn't be too much noise from incorrect sentence boundaries.
Most typically, text is passed to Word2Vec in logical units (like sentences or paragraphs). Also, the published papers and early demo code tended to convert punctuation into tokens, as well.
But text without punctuation, and arbitrary breaks between texts, are a reasonable workaround and still give pretty good results.
For example, the text8/text9 corpuses often used in demos (including the word2vec intro Jupyter notebook bundled in gensim) are just giant runs-of-words, lacking punctuation and line-breaks. So, the utility LineSentence class used in gensim will break them into individual 10,000-token texts.
It's probably better to go larger in your arbitrary breaks (eg 10,000), rather than smaller (eg 10), for a couple reasons:
source texts are usually longer than 10 words
often the source material that was run-together was still semantically-related across its original boundaries
the optimized algorithms work better on larger chunks of data
the harm of "false context windows" (created by the concatenation) is probably just noise with no net biasing effect, while more "true windows" (by creating as few false splits as possible) likely retains more of the original corpus' learnable word-to-word relationships signal
you can always simulate more-conservative contexts with a smaller window parameter (if the original source really did have tiny sentences that weren't sequentially-related)
But, gensim's cython-optimized training path has an implementation limit of 10,000 tokens per text – with any more being silently ignored – so you wouldn't intentionally want to supply longer texts for any reason.
I'm trying to use NLTK to perform some NLP NLTK classification for Arabic phrases. If I enter the native words as is in the classifier then it complains about non-ascii characters. Currently, I'm doing word.decode('utf-8') and then entering that as input to the trainer.
When I test the classifier, the results make some sense if there was an exact match. However, if I test a substring of a word in the original training words then results looks somewhat random.
I just want to distinguish if this was a bad classifier or if there's something fundamental in the encoding that degrades the performance of the classifier. Is this a reasonable way to input non-ascii text to classifiers?
#!/usr/bin/python
# -*- coding: utf-8 -*-
from textblob.classifiers import NaiveBayesClassifier
x = "الكتاب".decode('utf-8')
...
train = [
(x,'pos'),
]
cl = NaiveBayesClassifier(train)
t = "كتاب".decode('utf-8')
cl.classify(t)
The word in t is simply x with the first two letters removed. I'm running this with a much bigger dataset of course.
Your post contains, basically, two questions. The first is concerned with encoding, the second one is about predicting substrings of words seen in training.
For encoding, you should use unicode literals directly, so you can omit the decode() part. Like this:
x = u"الكتاب"
Then you will have a decoded representation already.
Concerning substrings, the classifier won't do that for you. If you ask for predictions for a token that wasn't included in the training in exactly the same spelling, then it will be treated as an unknown word – no matter if it's a substring of a word that occurred in training or not.
The substring case wouldn't be well-defined, anyway: Let's say you look up the single letter Alif – probably, a whole lot of words seen in training contain it. Which one should be used? A random one? The one with the highest probability? The sum of the probabilities of all matching ones? There's no easy answer to this.
I suspect that you are trying to match morphological variants of the same root. If this is the case, then you should try using a lemmatiser. So, before training, and also before prediction, you preprocess all tokens by converting them to their lemma (which is usually the root in Arabic, I think). I doubt that NLTK ships with a morphological model for Arabic, though, so you probably need to look for that elsewhere (but this is beyond the scope of this answer now).