Find similarity with doc2vec like word2vec - python

Is there a way to find similar docs like we do in word2vec
Like:
model2.most_similar(positive=['good','nice','best'],
negative=['bad','poor'],
topn=10)
I know we can use infer_vector,feed them to have similar ones, but I want to feed many positive and negative examples as we do in word2vec.
is there any way we can do that! thanks !

The doc-vectors part of a Doc2Vec model works just like word-vectors, with respect to a most_similar() call. You can supply multiple doc-tags or full vectors inside both the positive and negative parameters.
So you could call...
sims = d2v_model.docvecs.most_similar(positive=['doc001', 'doc009'], negative=['doc102'])
...and it should work. The elements of the positive or negative lists could be doc-tags that were present during training, or raw vectors (like those returned by infer_vector(), or your own averages of multiple such vectors).

Don't believe there is a pre-written function for this.
One approach would be to write a function that iterates through each word in the positive list to get top n words for a particular word.
So for positive words in your question example, you would end up with 3 lists of 10 words.
You could then identify words that are common across the 3 lists as the top n similar to your positive list. Since not all words will be common across the 3 lists, you probably need to get top 20 similar words when iterating so you end up with top 10 words as you want in your example.
Then do the same for negative words.

Related

Find the closest word to set of words

I would need to find something like the opposite of model.most_similar()
While most_similar() returns an array of words most similar to the one given as input, I need to find a sort of "center" of a list of words.
Is there a function in gensim or any other tool that could help me?
Example:
Given {'chimichanga', 'taco', 'burrito'} the center would be maybe mexico or food, depending on the corpus that the model was trained on
If you supply a list of words as the positive argument to most_similar(), it will report words closest to their mean (which would seem to be one reasonable interpretation of the words' 'center').
For example:
sims = model.most_similar(positive=['chimichanga', 'taco', 'burrito'])
(I somewhat doubt the top result sims[0] here will be either 'mexico' or 'food'; it's most likely to be another mexican-food word. There isn't necessarily a "more generic"/hypernym relation to be found either between word2vec words, or in certain directions... but some other embedding techniques, such as hyperbolic embeddings, might provide that.)

Testing the Keras sentiment classification with model.predict

I have trained the imdb_lstm.py on my PC.
Now I want to test the trained network by inputting some text of my own. How do I do it?
Thank you!
So what you basically need to do is as follows:
Tokenize sequnces: convert the string into words (features): For example: "hello my name is georgio" to ["hello", "my", "name", "is", "georgio"].
Next, you want to remove stop words (check Google for what stop words are).
This stage is optional, it may lead to faulty results but I think it worth a try. Stem your words (features), that way you'll reduce the number of features which will lead to a faster run. Again, that's optional and might lead to some failures, for example: if you stem the word 'parking' you get 'park' which has a different meaning.
Next thing is to create a dictionary (check Google for that). Each word gets a unique number and from this point we will use this number only.
Computers understand numbers only so we need to talk in their language. We'll take the dictionary from stage 4 and replace each word in our corpus with its matching number.
Now we need to split our data set to two groups: training and testing sets. One (training) will train our NN model and the second (testing) will help us to figure out how good is our NN. You can use Keras' cross validation function.
Next thing is defining whats the max number of features our NN can get as an input. Keras call this parameter - 'maxlen'. But you don't really have to do this manually, Keras can do that automatically just by searching for the longest sentence you have in your corpus.
Next, let's say that Keras found out that the longest sentence in your corpus has 20 words (features) and one of your sentences is the example in the first stage, which its length is 5 (if we'll remove stop words it'll be shorter), in such case we'll need to add zeros, 15 zeros actually. This is called pad sequence, we do that so every input sequence will be in the same length.
This might help.
http://keras.io/models/
Here is an sample usage.
How to use keras for XOR
Probably you have to convert ur corpus into ndarray first and throw it to your model.predict
From what it seem so far the model.predict input of the training model should be 100 words corpus which represent an index of each word in dictionary. So if you want to train it with ur corpus, you have to convert ur corpus according to those dictionary and see if the result is 0 or 1

What's best WordNet function for similarity between words?

I aim to find the similarities between words for about ~10,000 words. I'm using the "word.path_similarity(otherword)" method of the wordnet library but the results I'm getting for the path_similarity are in the range 0-0.1 as opposed to being distributed over 0-1. How is it possible that similarities between 10,000 random words all end up in that narrow range?
Is there a better way to use WordNet for finding similarity between two words?
For context, here's how this is calculated:
Claculate the length of the shortest path between the two synsets/words (inclusive).
Return the score as 1/pathlen
Therefore a score <.2 is indicative of a pathlength > 5 steps. Inclusive of the two input synsets, that means there are at least 4 synsets between them.
With that said: you're complaint seems to be "according to this metric, two words chosen at random are pretty consistently unrelated! What's going on?" Well, your similarity metric is telling you that random words are generally not closely related. This shouldn't be that surprising. Why are you calculating similarities between random words to begin with?

Looking for a better evaluation method for a genetic algorithm

I'm currently trying to solve the hard Challenge #151 on reddit with a unuasual method, a genetic algorithm.
In short, after seperating a string to consonants and vowels and removing spaces I need to put it together without knowing what character comes first.
hello world is seperated to hllwrld and eoo and needs to be put together again. One solution for example would be hlelworlod, but that doesn't make much sense. The exhaustive approach that takes all possible solutions works, but isn't feasible for longer problem sets.
What I already have
A database with the frequenzy of english words
An algorithm that constructs a relative cost database using Zipf's law and can consistently seperate words from sentences without spaces (borrowed from this question/answer
A method that puts consonants and vowels into a stack and randomly takes a character from either one and encodes this in a string that consists of 1 and 2, effectively encoding the construction in a gene. The correct gene for the example would be 1211212111
A method that mutates such a string, randomly swapping characters around
What I tried
Generating 500 random sequences, using the infer_spaces() method and evaluating fitness with the cost of all the words, taking the best 25% and mutate 4 new from those, works for small strings, but falls into local minima very often, especially for longer sequences. Hello World is found already in the first generation, thisisnotworkingverygood (which is correctly seperated and has a cost of 41.223) converges to th iss n ti wo or king v rye good (270 cost) already in the second generation.
What I need
Clearly, using the calculated cost as a evaluation method does only work for the separation of sentences that are grammatically correct, not for for this genetic algorithm. Do you have better ideas I could try? Or is another part of solution, for example the representation of the gene, the problem?
I would simplify the problem into two parts,
Finding candidate words to split the string into (so hllwrld => hll wrld)
How to then expand those words by adding vowels.
I would first take your dictionary of word frequencies, and process it to create a second list of words without vowels, along with a list of the possible vowel list for each collapsed word (and the associated frequency). You technically don't need a GA to solve this (and I think it would be easier to solve without one), but as you asked, I will provide 2 answers:
Without GA: you should be able to solve the first problem using a depth first search, matching substrings of the word against that dictionary, and doing so with the remaining word parts, only accepting partitions of the word into words (without vowels) where all words are in the second dictionary. Then you have to substitute in the vowels. Given that second dictionary, and the partition you already have, this should be easy. You can also use the list of vowels to further constrain the partitioning, as valid words in the partitions can only be made whole using vowels from the vowel list that is input into the algorithm. Starting at the left hand side of the string and iterating over all valid partitions in a depth first manner should solve this problem relatively quickly.
With GA: To solve this with a GA, I would create the dictionary of words without vowels. Then using the GA, create binary strings (as your chromosomes) of the same length as the input string of consonants, where a 1 = split a word at that position, and 0 = leave unchanged. These strings will all be the same length. Then create a fitness function that returns the proportion of words obtained after performing a split using the chromosome that are valid words without vowels, according to that dictionary. Create a second fitness function that takes the valid no-vowel words, and computes the proportion of overlap between the vowels missing in all these valid no-vowel words, and the original vowel list. Combine both fitness functions into one by multiplying the value from the first one by ten (assuming the second one returns a value between 0 and 1). That will force the algorithm to focus on the segmentation problem first and the vowel insertion problem second, and will also favor segmentations that are of the same quality, but preferring those that have a closer set of missing vowels to the original list. I would also include cross over in the solution. As all your chromosomes are the same length, this should be trivial. Once you have a solution that scores perfectly on the fitness function, then it should be trivial to recreate the original sentence given that dictionary of words without vowels (provided you maintain a second dictionary that list the possible missing vowel set for each non-vowel word - there could be multiple for each, as some vowel-less words will be the same with the vowels removed.
Let's say you have several generations and you plot the cost for the best specimen in each generation (we consider long sentence). Does this graph go down or converges after 2-3 generations to a specific value (let the algorithm run for example for 10 generations)? Can you run your algorithm several times with various initial conditions (random sequences) and see whether you get good results sometimes or not?
Depending of the results, you may try the following (this graph is a really good tool to improve the performance):
1) If you have a graph that goes up and down too much all the time - you have too much mutation (average number of swaps per gene for example), try to decrease it.
2) If you stuck up in a local minimum (cost of the best specimen doesn't change much after some time) try to increase mutation or run several isolated populations (3-4) of let's say 100 species at the beginning of your algorithm for a few generations. Then select the best population (that's closer to global minimum) and try to improve it as much as possible through mutation
PS: By the way interesting problem, I tried to figure out on how to use crossover to improve the algorithm but haven't figured it out
The fitness function is the key to the success of GA algorithm ( Which I kind of agree is suitable here ).
I agree with #Simon that the vowel non-vowel separation is not that important. just trip your text corpus to remove the vowels.
what is important in the fitness:
matched word frequency ( frequent words better )
grammar - structure of the sentence ( which you might need to use NLTK to get related infomation )
and don't forget to update the end result ^^

Scoring a string based on how English-like it is

I'm not sure how exactly to word this question, so here's an example:
string1 = "THEQUICKBROWNFOX"
string2 = "KLJHQKJBKJBHJBJLSDFD"
I want a function that would score string1 higher than string2 and a million other gibberish strings. Note the lack of spaces, so this is a character-by-character function, not word-by-word.
In the 90s I wrote a trigram-scoring function in Delphi and populated it with trigrams from Huck Finn, and I'm considering porting the code to C or Python or kludging it into a stand-alone tool, but there must be more efficient ways by now. I'll be doing this millions of times, so speed is nice. I tried the Reverend.Thomas Beyse() python library and trained it with some all-caps-strings, but it seems to require spaces between words and thus returns a score of []. I found some Markov Chain libraries, but they also seemed to require spaces between words. Though from my understanding of them, I don't see why that should be the case...
Anyway, I do a lot of cryptanalysis, so in the future scoring functions that use spaces and punctuation would be helpful, but right now I need just ALLCAPITALLETTERS.
Thanks for the help!
I would start with a simple probability model for how likely each letter is, given the previous (possibly-null, at start-of-word) letter. You could build this based on a dictionary file. You could then expand this to use 2 or 3 previous letters as context to condition the probabilities if the initial model is not good enough. Then multiply all the probabilities to get a score for the word, and possibly take the Nth root (where N is the length of the string) if you want to normalize the results so you can compare words of different lengths.
I don't see why a Markov chain couldn't be modified to work. I would create a text file dictionary of sorts, and read that in to initially populate the data structure. You would just be using a chain of n letters to predict the next letter, rather than n words to predict the next word. Then, rather than randomly generating a letter, you would likely want to pull out the probability of the next letter. For instance if you had the current chain of "TH" and the next letter was "E", you would go to your map, and see the probability that an "E" would follow "TH". Personally I would simply add up all of these probabilities while looping through the string, but how to exactly create a score from the probability is up to you. You could normalize it for string length, to let you compare short and long strings.
Now that I think about it, this method would favor strings with longer words, since a dictionary would not include phrases. Then again, you could populate the dictionary with not only single words, but short phrases with the spaces removed as well. Then the scoring would not only score based on how english the seperate words are, but how english series of words are. It's not a perfect system, but it would provide consistent scoring.
I don't know how it works, but Mail::SpamAssassin::Plugin::TextCat analyzes email and guesses what language it is (with dozens of languages supported).
The Index of Coincidence might be of help here, see https://en.wikipedia.org/wiki/Index_of_coincidence.
For a start just compute the difference of the IC to the expected value of 1.73 (see Wikipedia above). For an advanced usage you might want to calculate the expected value yourself using some example language corpus.
I'm thinking that maybe you could apply some text-to-speech synthesis ideas here. In particular, if a speech synthesis program is able to produce a pronunciation for a word, then that can be considered "English."
The pre-processing step is called grapheme-to-phoneme conversion, and typically leads to probabilities of mapping strings to sounds.
Here's a paper that describes some approaches to this problem. (I don't claim this paper is authoritative, as it just was a highly ranked search result, and I don't really have expertise in this area.)

Categories