I have a large dataset containing a mix of words and short phrases, such as:
dataset = [
"car",
"red-car",
"lorry",
"broken lorry",
"truck owner",
"train",
...
]
I am trying to find a way to determine the most similar word from a short sentence, such as:
input = "I love my car that is red" # should map to "red-car"
input = "I purchased a new lorry" # should map to "lorry"
input = "I hate my redcar" # should map to "red-car"
input = "I will use my truck" # should map to "truck owner"
input = "Look at that yellow lorri" # should map to "lorry"
I have tried a number of approaches to this with no avail, including:
Vectoring the dataset and the input using TfidfVectorizer, then calculating the Cosine similarity of the vectorized input value against each individual, vectorized item value from the dataset.
The problem is, this only really works if the input contains the exact word(s) that are in the dataset - so for example, in the case where the input = "trai" then it would have a cosine value of 0, whereas I am trying to get it to map to the value "train" in the dataset.
The most obvious solution would be to perform a simple spell check, but that may not be a valid option, because I still want to choose the most similar result, even when the words are slightly different, i.e.:
input = "broke" # should map to "broken lorry" given the above dataset
If someone could suggest other potential approach I could try, that would be much appreciated.
As #Aaalok has suggested in the comments, one idea is to use a different distance/similarity function. Possible candidates include
Levenshtein distance (measures the number of changes to transform one string into the other)
N-gram similarity (measures the number of shared n-grams between both strings)
Another possibility is feature generation, i.e. enhancing the items in your dataset with additional strings. These could be n-grams, stems, or whatever suits your needs. For example, you could (automatically) expand red-car into
red-car red car
Paragraph vector or doc2vec should solve your problem. Provided you've enough and proper dataset. Of course, you'll have to do lot of tuning to get your results right. You could try gensim/deeplearning4j. But you may have to use some other methods to manage spelling mistakes.
Related
Questions like 1 and 2 give answers for retrieving vocabulary frequencies from gensim word2vec models.
For some reason, they actually just give a deprecating counter from n (size of vocab) to 0, alongside the most frequent tokens, ordered.
For example:
for idx, w in enumerate(model.vocab):
print(idx, w, model.vocab[w].count)
Gives:
0 </s> 111051
1 . 111050
2 , 111049
3 the 111048
4 of 111047
...
111050 tokiwa 2
111051 muzorewa 1
Why is it doing this? How can I extract term frequencies from the model, given a word?
Those answers are correct for reading the declared token-counts out of a model which has them.
But in some cases, your model may only have been initialized with a fake, descending-by-1 count for each word. This is most likely, in using Gensim, if it was loaded from a source where either the counts weren't available, or weren't used.
In particular, if you created the model using load_word2vec_format(), that simple vectors-only format (whether binary or plain-text) inherently contains no word counts. But such words are almost always, by convention, sorted in most-frequent to least-frequent order.
So, Gensim has chosen, when frequencies are not present, to synthesize fake counts, with linearly descending int values, where the (first) most-frequent word begins with the count of all unique words, and the (last) least-frequent word has a count of 1.
(I'm not sure this is a good idea, but Gensim's been doing it for a while, and it ensures code relying on the per-token count won't break, and will preserve the original order, though obviously not the unknowable original true-proportions.)
In some cases, the original source of the file may have saved a separate .vocab file with the word-frequencies alongside the word2vec_format vectors. (In Google's original word2vec.c code release, this is the file generated by the optional -save-vocab flag. In Gensim's .save_word2vec_format() method, the optional fvocab parameter can be used to generate this side file.)
If so, that 'vocab' frequencies filename may be supplied, when you call .load_word2vec_format(), as the fvocab parameter - and then your vector-set will have true counts.
If you word-vectors were originally created in Gensim from a corpus giving actual frequencies, and were always saved/loaded using the Gensim native functions .save()/.load() which use an extended form of Python-pickling, then the original true count info will never have been lost.
If you've lost the original frequency data, but you know the data was from a real natural-language source, and you want a more realistic (but still faked) set of frequencies, an option could be to use the Zipfian distribution. (Real natural-language usage frequencies tend to roughly fit this 'tall head, long tail' distribution.) A formula for creating such more-realistic dummy counts is available in the answer:
Gensim: Any chance to get word frequency in Word2Vec format?
I'm working with a kind of unique situation. I have words in Language1 that I've defined in English. I then took each English word, took its word vector from a pretrained GoogleNews w2v model, and average the vectors for every definition. The result, an example with a 3 dimension vector:
L1_words={
'word1': array([ 5.12695312e-02, -2.23388672e-02, -1.72851562e-01], dtype=float32),
'word2': array([ 5.09211312e-02, -2.67828571e-01, -1.49875201e-03], dtype=float32)
}
What I want to do is cluster (using K-means probably, but I'm open to other ideas) the keys of the dict by their numpy-array values.
I've done this before with standard w2v models, but the issue I'm having is that this is a dictionary. Is there another data set I can convert this to? I'm inclined to write it to a csv/make it into a pandas datafram and use Pandas or R to work on it like that, but I'm told that floats are problem when it comes to things requiring binary (as in: they lose information in unpredictable ways). I tried saving my dictionary to hdf5, but dictionaries are not supported.
Thanks in advance!
If I understand your question correctly, you want to cluster words according to their W2V representation, but you are saving it as dictionary representation. If that's the case, I don't think it is a unique situation at all. All you got to do is to convert the dictionary into a matrix and then perform clustering in the matrix. If you represent each line in the matrix as one word in your dictionary you should be able to reference the words back after clustering.
I couldn't test the code below, so it may not be completely functional, but the idea is the following:
from nltk.cluster import KMeansClusterer
import nltk
# make the matrix with the words
words = L1_words.keys()
X = []
for w in words:
X.append(L1_words[w])
# perform the clustering on the matrix
NUM_CLUSTERS=3
kclusterer = KMeansClusterer(NUM_CLUSTERS,distance=nltk.cluster.util.cosine_distance)
assigned_clusters = kclusterer.cluster(X, assign_clusters=True)
# print the cluster each word belongs
for i in range(len(X)):
print(words[i], assigned_clusters[i])
You can read more in details in this link.
I have 12 Million company names in my db. I want to match them with a list offline.
I want to know the best algorithm to do so. I have done that through Levenstiens distance but it is not giving the expected results. Could you please suggest some algorithms for the same.Problem is matching the companies like
G corp. ----this need to be mapped to G corporation
water Inc -----Water Incorporated
You should probably start by expanding the known suffixes in both lists (the database and the list). This will take some manual work to figure out the correct mapping, e.g. with regexps:
\s+inc\.?$ -> Incorporated
\s+corp\.?$ -> Corporation
You may want to do other normalization as well, such as lower-casing everything, removing punctuation, etc.
You can then use Levenshtein distance or another fuzzy matching algorithm.
You can use fuzzyset, put all your companies names in the fuzzy set and then match a new term to get matching scores. An example :
import fuzzyset
fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
fz.add(l)
#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
Also, if you want to work with semantics, instead of just the string( which works better in such scenarios ), then have a look at spacy similarity. An example from the spacy docs:
import spacy
nlp = spacy.load('en_core_web_md') # make sure to use larger model!
tokens = nlp(u'dog cat banana')
for token1 in tokens:
for token2 in tokens:
print(token1.text, token2.text, token1.similarity(token2))
Interzoid's Company Name Match Advanced API generates similarity keys to help solve this...you call the API to generate a similarity key that eliminates all of the noise, known synonyms, soundex, ML, etc... you then match on the similarity key rather than the data itself for much higher match rates (commercial API, disclaimer: I work for Interzoid)
https://interzoid.com/services/getcompanymatchadvanced
Use MatchKraft to fuzzy match company names on two lists.
http://www.matchkraft.com/
Levenstiens distance is not enough to solve this problem. You also need the following:
Heuristics to improve execution time
Information retrieval (Lucene) and SQL
Company names database
It is better to use an existing tool rather than creating your program in Python.
When I train Doc2vec (using Gensim's Doc2vec in Python) on corpus of about 10k documents (each has few hundred words) and then infer document vectors using the same documents, they are not at all similar to the trained document vectors. I would expect they would be at least somewhat similar.
That is I do model.docvecs['some_doc_id'] and model.infer_vector(documents['some_doc_id']).
Cosine distances between trained and inferred vectors for few first documents:
0.38277733326
0.284007549286
0.286488652229
0.173178792
0.370117008686
0.275438070297
0.377647638321
0.171194493771
0.350615143776
0.311795353889
0.342757165432
As you can see, they are not really similar. If the similarity is so terrible even for documents used for training, I can't even begin to try to infer unseen documents.
Training configuration:
model = Doc2Vec(documents=documents, dm=1, size=100, window=6, alpha=0.1, workers=4,
seed=44, sample=1e-5, iter=15, hs=0, negative=8, dm_mean=1, min_alpha=0.01, min_count=2)
Inferring:
model.infer_vector(tokens, steps=20, alpha=0.025)
Note on the side: Documents are always preprocessed the same way (I checked that the same list of tokens goes into training and into inferring).
Also I played with parameters around a bit, too, and results were similar. So if your suggestion would be something like "try increasing or decreasing this or that training parameter", I've most likely tried it. Maybe I just didn't come across the 'correct' parameters though.
Thanks for any suggestions as to what can I do to make it work better.
EDIT: I am willing and able to use any other available Python implementation of paragraph vectors (doc2vec). It doesn't have to be this one. If you know of another that can achieve better results.
EDIT: Minimal working example
import fnmatch
import os
from scipy.spatial.distance import cosine
from gensim.models import Doc2Vec
from gensim.models.doc2vec import TaggedDocument
from keras.preprocessing.text import text_to_word_sequence
files = {}
folder = 'some path' # each file contains few regular sentences
for f in fnmatch.filter(os.listdir(folder), '*.sent'):
files[f] = open(folder + '/' + f, 'r', encoding="UTF-8").read()
documents = []
for k, v in files.items():
words = text_to_word_sequence(v, lower=True) # converts string to list of words, removes commas etc.
documents.append(TaggedDocument(tags=[k], words=words))
d2 = Doc2Vec(size=200, documents=documents)
for doc in documents:
trained = d2.docvecs[doc.tags[0]]
inferred = d2.infer_vector(doc.words, steps=50)
print(cosine(trained, inferred)) # cosine similarity from scipy
What is the type of your documents object, and are you sure that it is a multiply-iterable object, so that the model can do all of its 16 passes over the set of TaggedDocument-shaped text examples? That is, does iter(documents) always return a fresh iterator, with all items as TaggedDocument-shaped objects with the right list-of-words in words and list-of-tags in tags? (A common error is to supply a corpus that can be iterated over only once, and then ignoring any logged hints/warnings that no real training has happening. The inference/similarity results from such a model will be essentially random.)
Then for infer_vector(), does documents[tag] really return just the list-of-words it expects (not TaggedDocument or string)? (Users often supply strings, rather than lists-of-tokens, for training or inference words and get results that are just noise.)
Was there evaluation-guided reason for changing various defaults, either a little (window=6, negative=8) or a lot (alpha=0.1, min_count=2)? Such tweaks may not be a major factor in your problem, and there's nothing magical about the class defaults. But until you have the basics working, it's best to stick close to common configuration. (And then even after the basics are working, limit changes to those that can be demonstrated as better via a repeatable scoring process.)
Some report needing much higher steps values – 100 or more – to get better inference results, though that would be most crucial for very-small documents (of a handful to couple dozen words) rather than the few-hundred-words documents you describe.
A corpus of 10k documents is on the small side for Paragraph Vectors (Doc2Vec), but with your smallish vector-size (100) and larger number of iterations (15), it might be workable.
If you're still having problems, you should expand your question with more code showing how documents works, some suggestive example documents, and your cosine-similarity evaluation process – to see if there are any oversights at each of those steps.
I'm trying to solve a problem of sentence comparison using naive approach of summing up word vectors and comparing the results. My goal is to match people by interest, so the dataset consists of names and short sentences describing their hobbies. The batches are fairly small, few hundreds of people, so i wanted to give it a try before digging into doc2vec.
I prepare the data by cleaning it completely, removing stop words, tokenizing and lemmatizing. I use pre-trained model for word vectors which returns adequate results when finding similarities for some test words. Also tried summing up the sentence words to find similarities in the original model - the matches do make sense. The similarities would be around general sense of the phrase.
For sentence matching I'm trying the following: create an empty model
b = gs.models.Word2Vec(min_count=1, size=300, sample=0, hs=0)
Build vocab out of names (or person id's), no training
#first create vocab with an empty vector
test = [['test']]
b.build_vocab(test)
b.wv.syn0[b.wv.vocab['test'].index] = b.wv.syn0[b.wv.vocab['test'].index]*0
#populate vocab from an array
b.build_vocab([personIds], update=True)
Summ each sentence's word vectors and store the results into the model for a corresponding id
#sentences are pulled from pandas dataset df. 'a' is a pre-trained model i use to get vectors for each word
def summ(phrase, start_model):
'''
vector addition function
'''
#starting with a vector of 0's
sum_vec = start_model.word_vec("cat_NOUN")*0
for word in phrase:
sum_vec += start_model.word_vec(word)
return sum_vec
for i, row in df.iterrows():
try:
personId = row["ID"]
summVec = summ(df.iloc[i,1],a)
#updating syn0 for each name/id in vocabulary
b.wv.syn0[b.wv.vocab[personId].index] = summVec
except:
pass
I understand that i shouldn't be expecting much accuracy here, but the t-SNE print doesn't show any clustering whatsoever. Finding similarities method also fails to find matches (<0.2 similarity coefficient basically for everything). [
Wondering if anyone has an idea of where did i go wrong? Is my approach valid at all?
Your code, as shown, neither does any train() of word-vectors (using your local text), nor does it pre-load any vectors from elsewhere. So any vectors which do exist – created by the build_vocab() calls – will still just be in their randomly-initialized starting locations, and be useless for any semantic purposes.
Suggestions:
either (a) train your own vectors from your text, which makes sense if you have a good quantity of text; or (b) load vectors from elsewhere. But don't try to do both. (Or, in the case of the code above, neither.)
The update=True option for build_vocab() should be considered an expert, experimental option – only worth tinkering with if you've already had things working in simpler modes, and you're sure you need it and understand all the implications.
Normal use won't ever explicitly re-assign new values into the Word2Vec model's syn0 property - those are managed by the class's training routines, so you never need to zero them out or modify them. You should tally up your own text summary vectors, based on word-vectors, outside the model in your own data structures.