Best match for input query from a set of documents - python

I have 8 documents and I ran TF-IDF on it to get an array. I don't understand how I find out which is the best document match for a given input query?
all_documents = [doc1, doc2, ...., doc7]
sklearn_tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
sklearn_representation = sklearn_tfidf.fit_transform(all_documents).toarray()

Transform the input to tf-idf format using TfidfVectorizer. You can then use a distance metric (cosine, euclidean, manhattan, ...) to calculate the document that is closest to your input.
Each of the documents should use the same vocabulary. I assume that your 8 document vectors have the same length? The sklearn_tfidf object that you created has an attribute vocabulary_ that contains all words that are used in the vectors. Your input query should be reduced to only containing those words.
Example
Document1: dogs are cute
Document2: cats are awful
Leads to a vocabulary of [dogs, cats, are, cute, awful]. A query containing other words than these 5 cannot be used. For example if your query is cute animals, the animals has no meaning, because it cannot be found in one of the documents. The query thus reduces to following vector: [0,0,0,1,0] since cute is the only word that can be found in the documents.

Related

Word2Vec/Doc2vec + K-Means clustering - how to extract "semantically meaningful" centroids?

I have a list of unique terms, say for example:
['Dollars','Cash','International Currency','Credit card','Comics','loans','David Beckham','soccer','Iron Man','checks','Euros','World Cup','Marvel Cinematic Universe','Champions league','Superman']
Ultimately I want to achieve the following mapping:
['Dollars','Cash','International Currency','Credit card','loans','checks','Euros','World']: 'Money and finance'
['Comics','Iron Man','Marvel Cinematic Universe','Superman']: 'Comics and Superherores'
['David Beckham','soccer','World Cup','Champions league']: 'Soccer, Football'
My idea is to use a text embedding like word2vec or doc2vec, and then cluster the embeddings using K-Means. So far this is very straightforward. But then I would like to map the resulting embedding centroids to 2 or 3 relevant terms. Is there a way to go from the numerical embedding centroid to semantically meaningful terms?
If there is a better way to do this other than Embedding > Clustering > Extract meaning from centroid I could try that as well.
A couple of things to note about this is that: The terms in my lists are unique - either individual words, compound terms, or vert short sentences, not paragraphs or documents - so frequency or word count based methods are not applicable. And the lists also contain a lot of noise, e.g. "xxx thx u" and "Hello Mr. Johnson", etc...
So my two asks are:
What is the best way to achieve this mapping?
And how can we map a centroid from an embedding space to a small set of meaningful terms?

Gensim Doc2Vec model only generates a limited number of vectors

I am using gensim Doc2Vec model to generate my feature vectors. Here is the code I am using (I have explained what my problem is in the code):
cores = multiprocessing.cpu_count()
# creating a list of tagged documents
training_docs = []
# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences)
for index, doc in enumerate(all_docs):
# 'doc' is in unicode format and I have already preprocessed it
training_docs.append(TaggedDocument(doc.split(), str(index+1)))
# at this point, I have 53 strings in my 'training_docs' list
model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores)
# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list.
print(len(model.docvecs))
# output: 10
I am just wondering if I am doing a mistake or if there is any other parameter that I should set?
UPDATE: I was playing with the tags parameter in TaggedDocument, and when I changed it to a mixture of text and numbers like: Doc1, Doc2, ... I see a different number for the count of generated vectors, but still I do not have the same number of feature vectors as expected.
Look at the actual tags it has discovered in your corpus:
print(model.docvecs.offset2doctag)
Do you see a pattern?
The tags property of each document should be a list of tags, not a single tag. If you supply a simple string-of-an-integer, it will see it as a list-of-digits, and thus only learn the tags '0', '1', ..., '9'.
You could replace str(index+1) with [str(index+1)] and get the behavior you were expecting.
But, since your document IDs are just ascending integers, you can also just use plain Python ints as your doctags. This will save some memory, buy avoiding the creation of a lookup dict from string-tag to array-slot (int). To do this, replace the str(index+1) with [index]. (This starts the doc-IDs from 0 – which is a teensy bit more Pythonic, and also avoids wasting an unused 0 position in the raw array that holds the trained vectors.)

Python - tf-idf predict a new document similarity

Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.
The code below finds the cosine similarity of the first vector and not a new query
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea.
How can I infer the vector of a new document, and find the related docs, same as the code below?
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])
This problem can be partially addressed by combining the vector space model (which is the tf-idf & cosine similarity) together with the boolean model. These are concepts of information theory and they are used (and nicely explained) in ElasticSearch- a pretty good search engine.
The idea is simple: you store your documents as inverted indices. Which is comparable to the words present at the end of a book, which hold a reference to the pages (documents) they were mentioned in.
Instead of calculating the tf-idf vector for all document it will only calculate it for documents that have at least one (or specify a threshold) of the words in common. This can be simply done by looping over the words in the queried document, finding the documents that also have this word using the inverted index and calculate the similarity for those.
You should take a look at gensim. Example starting code looks like this:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]
tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
At prediction time you first get the vector for the new doc:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_tfidf = tfidf[vec_bow]
Then get the similarities (sorted by most similar):
sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
This does a linear scan like you wanted to do but they have a more optimized implementation. If the speed is not enough then you can look into approximate similarity search (Annoy, Falconn, NMSLIB).
For huge data sets, there is a solution called Text Clustering By Concept. search engines use this Technic,
At first step, you cluster your documents to some groups(e.g 50 cluster), then each cluster has a representative document(that contain some words that have some useful information about it's cluster)
At second step, for calculating cosine similarity between New Document and you data set, you loop through all representative(50 numbers) and find top near representatives(e.g 2 representative)
At final step, you can loop through all documents in selected representative and find nearest cosine similarity
With this Technic, you can reduce the number of loops and improve performace,
You can read more tecnincs in some chapter of this book: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Matching words and vectors in gensim Word2Vec model

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.
As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.
This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!
Problem:
Match words to embedding vectors created by Word2Vec () -- how do I do it?
My approach:
After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0.
Thus
for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
print i
word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
wordvector= newmod.syn0[i] #get the vector with the corresponding index
print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?
Does this even get me correct results?
*My code to create the word vectors:
model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings
model.save("newmodel")
I found this question which is similar but has not really been answered.
I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !
This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441
If all you want to do is map a word to a vector, you can simply use the [] operator, e.g. model["hello"] will give you the vector corresponding to hello.
If you need to recover a word from a vector you could loop through your list of vectors and check for a match, as you propose. However, this is inefficient and not pythonic. A convenient solution is to use the similar_by_vector method of the word2vec model, like this:
import gensim
documents = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
model = gensim.models.Word2Vec(documents, min_count=1)
print model.similar_by_vector(model["survey"], topn=1)
which outputs:
[('survey', 1.0000001192092896)]
where the number represents the similarity.
However, this method is still inefficient, as it still has to scan all of the word vectors to search for the most similar one. The best solution to your problem is to find a way to keep track of your vectors during the clustering process so you don't have to rely on expensive reverse mappings.
So I found an easy way to do this, where nmodel is the name of your model.
#zip the two lists containing vectors and words
zipped = zip(nmodel.wv.index2word, nmodel.wv.syn0)
#the resulting list contains `(word, wordvector)` tuples. We can extract the entry for any `word` or `vector` (replace with the word/vector you're looking for) using a list comprehension:
wordresult = [i for i in zipped if i[0] == word]
vecresult = [i for i in zipped if i[1] == vector]
This is based on the gensim code. For older versions of gensim, you might need to drop the wv after the model.
As #bpachev mentioned, gensim does have an option of searching by vector, namely similar_by_vector.
It however implements a brute force linear search, i.e. computes cosine similarity between given vector and vectors of all words in vocabulary, and gives off the top neighbours. An alternate option, as mentioned in the other answer is to use an approximate nearest neighbour search algorithm like FLANN.
Sharing a gist demonstrating the same:
https://gist.github.com/kampta/139f710ca91ed5fabaf9e6616d2c762b

Use sklearn to find string similarity between two texts with large group of documents

Given a large set of documents (book titles, for example), how to compare two book titles that are not in the original set of documents, or without recomputing the entire TF-IDF matrix?
For example,
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
book_titles = ["The blue eagle has landed",
"I will fly the eagle to the moon",
"This is not how You should fly",
"Fly me to the moon and let me sing among the stars",
"How can I fly like an eagle",
"Fixing cars and repairing stuff",
"And a bottle of rum"]
vectorizer = TfidfVectorizer(stop_words='english', norm='l2', sublinear_tf=True)
tfidf_matrix = vectorizer.fit_transform(book_titles)
To check the similarity between the first and the second book titles, one would do
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix[1:2])
and so on. This considers that the TF-IDF will be calculated with respect all the entries in the matrix, so the weights will be proportional to the number of times a token appears in all corpus.
Let's say now that two titles should be compared, title1 and title2, that are not in the original set of book titles. The two titles can be added to the book_titles collection and compared afterwards, so the word "rum", for example, will be counted including the one in the previous corpus:
title1="The book of rum"
title2="Fly safely with a bottle of rum"
book_titles.append(title1, title2)
tfidf_matrix = vectorizer.fit_transform(book_titles)
index = tfidf_matrix.shape()[0]
cosine_similarity(tfidf_matrix[index-3:index-2], tfidf_matrix[index-2:index-1])
what is really impratical and very slow if documents grow very large or need to be stored out of memory. What can be done in this case? If I compare only between title1 and title2, the previous corpus will not be used.
Why do you append them to the list and recompute everything? Just do
new_vectors = vectorizer.transform([title1, title2])

Categories