how to compare two text document with tfidf vectorizer?

how to compare two text document with tfidf vectorizer? - python

I have two different text which I want to compare using tfidf vectorization.
What I am doing is:
tokenizing each document
vectorizing using TFIDFVectorizer.fit_transform(tokens_list)
Now the vectors that I get after step 2 are of different shape.
But as per the concept, we should have the same shape for both the vectors. Only then the vectors can be compared.
What am I doing wrong? Please help.
Thanks in advance.

As G. Anderson already pointed out, and to help the future guys on this, when we use the fit function of TFIDFVectorizer on document D1, it means that for the D1, the bag of words are constructed.
The transform() function computes the tfidf frequency of each word in the bag of word.
Now our aim is to compare the document D2 with D1. It means we want to see how many words of D1 match up with D2. Thats why we perform fit_transform() on D1 and then only the transform() function on D2 would apply the bag of words of D1 and count the inverse frequency of tokens in D2.
This would give the relative comparison of D1 against D2.

I'm one of those later people :)
So my understanding with TF-IDF is the IDF is computed the frequency of the word (or Ngram) in both documents? So comparing what matches with each, doesn't really cover how common the word is in both documents for weeding out common words? Is there a way to do that with Ngrams without the indice error?
ValueError: Shape of passed values is (26736, 1), indices imply (60916, 1)
# Applying TFIDF to vectors
#instantiate tfidVectorizers()
ngram_vectorizer1 = TfidfVectorizer(ngram_range = (2,2)) #bigrams 1st vector
ngram_vectorizer2 = TfidfVectorizer(ngram_range = (2,2)) #bigrams 2nd
ngram_vectorizert = TfidfVectorizer(ngram_range = (2,2)) #bigrams total
# fit model
ngram_vector1 = ngram_vectorizer1.fit_transform(text)
ngram_vector2 = ngram_vectorizer2.fit_transform(text2)
ngram_vectort = ngram_vectorizert.fit_transform(total)
ngramfeatures1 = (ngram_vectorizer1.get_feature_names()) #save feature names
ngramfeatures2 = (ngram_vectorizer2.get_feature_names()) #save feature names
ngramfeaturest = (ngram_vectorizert.get_feature_names())
print("\n\nngramfeatures1 : \n", ngramfeatures1)
print("\n\nngramfeatures2 : \n", ngramfeatures2)
print("\n\nngram_vector1 : \n", ngram_vector1.toarray())
print("\n\nngram_vector2 : \n", ngram_vector2.toarray())
#Compute the IDF values
first_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
second_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
total_tfidf_transformer_ngram=TfidfTransformer(smooth_idf=True,use_idf=True)
first_tfidf_transformer_ngram.fit(ngram_vector1)
second_tfidf_transformer_ngram.fit(ngram_vector2)
total_tfidf_transformer_ngram.fit(ngram_vectort)
# print 1st idf values
ngram_first_idf = pd.DataFrame(first_tfidf_transformer_ngram.idf_, index=ngram_vectorizer1.get_feature_names(),columns=["idf_weights"])
# sort ascending
ngram_first_idf.sort_values(by=['idf_weights']) #this one should really be looking towards something from the "Total" calculations if I'm understanding it correctly? ```

Related

Calculating TF-IDF Score of a Single String

I do a string matching using TF-IDF and Cosine Similarity and it's working good for finding the similarity between strings in a list of strings.
Now, I want to do the matching between a new string against the previously calculated matrix. I calculate the TF-IDF score using below code.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, analyzer=ngrams)
tf_idf_matrix = vectorizer.fit_transform(list_string)
How can I calculate the TF-IDF score of a new string against previous matrix? I can add the new string to the series and recalculate the matrix like below, but it will be inefficient since I only want the last index of the matrix and don't need the matrix of the old series to be recalculated.
list_string = list_string.append(new_string)
single_matrix = vectorizer.fit_transform(list_string)
single_matrix = single_matrix[len(list_string) - 1:]
After reading a while about TF-IDF calculation, I am thinking about saving the IDF value of each term and manually calculate the TF-IDF of the new string without using the matrix, but I don't know how to do that. How can I do this? Or is there any better way?

Refitting the TF-IDF in order to calculate the score of a single entry is not the way; you should simply use the .transform() method of the existing fitted vectorizer to your new string (not to the whole matrix):
single_entry = vectorizer.transform(new_string)
See the docs.

TFIDF Vectorizer within SciKit-Learn only returning 5 results

I am currently working with the TFIDF Vectorizer within SciKit-Learn. The Vectorizer is supposed to apply a formula to detect the most frequent word pairs (bigrams) within a Pandas DataFrame.
The below code section however only returns the frequency analysis for five bigrams while the dataset includes thousands of bigrams for which the frequencies should be calculated.
Does anyone have a smart idea to get rid of my error that limits the number of calculations to 5 responses? I have been researching regarding a solution but have not found the right tweak yet.
The relevant code section is shown below:
def get_top_n_bigram_Group2(corpus, n=None):
# settings that you use for count vectorizer will go here
tfidf_vectorizer=TfidfVectorizer(ngram_range=(2, 2), stop_words='english', use_idf=True).fit(corpus)
# just send in all your docs here
tfidf_vectorizer_vectors=tfidf_vectorizer.fit_transform(corpus)
# get the first vector out (for the first document)
first_vector_tfidfvectorizer=tfidf_vectorizer_vectors[0]
# place tf-idf values in a pandas data frame
df1 = pd.DataFrame(first_vector_tfidfvectorizer.T.todense(), index=tfidf_vectorizer.get_feature_names(), columns=["tfidf"])
df2 = df1.sort_values(by=["tfidf"],ascending=False)
return df2
And the output code looks like this:
for i in ['txt_pro','txt_con','txt_adviceMgmt','txt_main']:
# Loop over the common words inside the JSON object
common_words = get_top_n_bigram_Group2(df[i], 500)
common_words.to_csv('output.csv')

The proposed changes to achieve what you asked for and also taking into account your comments is as follows:
def get_top_n_bigram_Group2(corpus, n=None, my_vocabulary=None):
# settings that you use for count vectorizer will go here
count_vectorizer=CountVectorizer(ngram_range=(2, 2),
stop_words='english',
vocabulary=my_vocabulary,
max_features=n)
# just send in all your docs here
count_vectorizer_vectors=count_vectorizer.fit_transform(corpus)
# Create a list of (bigram, frequency) tuples sorted by their frequency
sum_bigrams = count_vectorizer_vectors.sum(axis=0)
bigram_freq = [(bigram, sum_bigrams[0, idx]) for bigram, idx in count_vectorizer.vocabulary_.items()]
# place bigrams and their frequencies in a pandas data frame
df1 = pd.DataFrame(bigram_freq, columns=["bigram", "frequency"]).set_index("bigram")
df1 = df1.sort_values(by=["frequency"],ascending=False)
return df1
# a list of predefined bigrams
my_vocabulary = ['bigram 1', 'bigram 2', 'bigram 3']
for i in ['text']:
# Loop over the common words inside the JSON object
common_words = get_top_n_bigram_Group2(df[i], 500, my_vocabulary)
common_words.to_csv('output.csv')
If you do not provide the my_vocabulary argument in the get_top_n_bigram_Group2() then the CountVectorizer will count all bigrams without any restriction and will return only the top 500 (or whatever number you request in the second argument).
Please let me know if this is what you were looking for.
Note that the TFIDF is not returning frequencies but rather scores (or if you prefer 'weights').
I would understand the necessity to use TFIDF if you did not have a predefined list of bigrams and you were looking for a way to score among all possible bigrams and wanted to reject those that appear in all documents and have little information power (for example the bigram "it is" appears very frequently in texts but means very little).

Remove duplicate tweets that are 90% similar

I have extracted the tweets and i want to remove duplicate tweets. If I use padas drop_duplicates(inplace=True) it will remove 100% duplicated tweets.
I want to know is there a way to remove that are slightly different from each other and 90% similar to each other.
Example
When will this year end? There is only misery and bad stuff! I hate 2020!
When will this year end? There are miseries and bad stuff only! I hate 2020!
These tweets are almost similar, how can I remove them

There's no simple answer to your question, but a couple of naive solutions might look something like the following.
Approach 1
Firstly, you need to define a similarity metric. A common (character based) string comparison metric is Levenshtein distance, but I'd recommend having a look through fuzzywuzzy's README to find one that's appropriate for your use case. For this micro-demo, instead of using †he fuzzywuzzy package, I'm using python-levenshtein.
Secondly, compare all strings to all other tweets and compute the string similarities between them. Note that this is totally impractical if you're dealing with a large number of tweets, but let's roll with it. After comparing the strings, you can filter to get the indexes of other strings which are close matches.
Using those indexes, we can create a graph of strings, for which I use the networkx package. This is necessary so we can extract the connect components of the graph, for which each connected component represents a network of similar strings. This is not necessarily true, since for deep graphs a string at one end won't necessarily be all that similar to a string at the other end. But in practice, it turns out to work pretty well.
Setup
import networkx as nx
import Levenshtein
import random
df = pd.DataFrame({
"tweet":["When will this year end? There is only misery and bad stuff! I hate 2020!",
"When will this year end? There are miseries and bad stuff only! I hate 2020!",
"I am a tweet with no obvious duplicates",
"Tweeeeeet!",
"Tweeet",
"Tweet tweet!"]
})
Logic
def compare(tweet1, threshold=0.7):
# compare tweets using Levenshtein distance (or whatever string comparison metric)
matches = df['tweet'].apply(lambda tweet2: (Levenshtein.ratio(tweet1, tweet2) >= threshold))
# get positive matches
matches = matches[matches].index.tolist()
# convert to list of tuples
return [*zip(iter(matches[:-1]), iter(matches[1:]))]
# create graph objects
nodes = df.index.tolist()
edges = [*itertools.chain(*df["tweet"].apply(compare))]
# create graphs
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_edges_from(edges)
# get connected component indexes
grouped_indexes = [*nx.connected_components(G)]
# get a random choice index from each group
filtered_indexes = [random.choice([*_]) for _ in grouped_indexes]
df.loc[filtered_indexes]
Output
A filtered subset of the original tweets DataFrame.
tweet
0 When will this year end? There is only misery ...
2 I am a tweet with no obvious duplicates
5 Tweet tweet!
Approach 2
We can use an unsupervised learning algorithm to cluster strings together, something like k-means This is your bread and butter of unsupervised algorithms, and it has the drawback that you have to know the optimal number of clusters in advance, or more typically work it out via testing. But it has the great advantage that if you're adding more tweets to your dataset, you can quickly apply your clustering model and find similar tweets.
There's a million and one tutorials on how to do this, but the basic approach here would be to (1) clean your text, (2) convert your text into a TFIDF, (3) compute a similarity metric (cosine similarity is common) between each document pair, (4) then train your k-means (or similar) model.
If you're interested in this approach, here's a few random tutorials to follow I found after a quick Google.
https://towardsdatascience.com/k-means-clustering-8e1e64c1561c
http://brandonrose.org/clustering
https://iq.opengenus.org/document-clustering-nlp-kmeans/
...
Hope this helps!

You can use Cosine Similarity between two tweets:
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
X =tweet1
Y =tweet2
# tokenization
X_set = word_tokenize(X)
Y_set = word_tokenize(Y)
l1 =[];l2 =[]
# form a set containing keywords of both strings
rvector = X_set.union(Y_set)
for w in vector:
if w in X_set: l1.append(1) # create a vector
else: l1.append(0)
if w in Y_set: l2.append(1)
else: l2.append(0)
c = 0
# cosine formula
for i in range(len(rvector)):
c+= l1[i]*l2[i]
cosine = c / float((sum(l1)*sum(l2))**0.5)
print("similarity: ", cosine)
if cosine>=0.90:
print("Similar")

Python - tf-idf predict a new document similarity

Inspired by this answer, I'm trying to find cosine similarity between a trained trained tf-idf vectorizer and a new document, and return the similar documents.
The code below finds the cosine similarity of the first vector and not a new query
>>> from sklearn.metrics.pairwise import linear_kernel
>>> cosine_similarities = linear_kernel(tfidf[0:1], tfidf).flatten()
>>> cosine_similarities
array([ 1. , 0.04405952, 0.11016969, ..., 0.04433602,
0.04457106, 0.03293218])
Since my train data is huge, looping through the entire trained vectorizer sounds like a bad idea.
How can I infer the vector of a new document, and find the related docs, same as the code below?
>>> related_docs_indices = cosine_similarities.argsort()[:-5:-1]
>>> related_docs_indices
array([ 0, 958, 10576, 3277])
>>> cosine_similarities[related_docs_indices]
array([ 1. , 0.54967926, 0.32902194, 0.2825788 ])

This problem can be partially addressed by combining the vector space model (which is the tf-idf & cosine similarity) together with the boolean model. These are concepts of information theory and they are used (and nicely explained) in ElasticSearch- a pretty good search engine.
The idea is simple: you store your documents as inverted indices. Which is comparable to the words present at the end of a book, which hold a reference to the pages (documents) they were mentioned in.
Instead of calculating the tf-idf vector for all document it will only calculate it for documents that have at least one (or specify a threshold) of the words in common. This can be simply done by looping over the words in the queried document, finding the documents that also have this word using the inverted index and calculate the similarity for those.

You should take a look at gensim. Example starting code looks like this:
from gensim import corpora, models, similarities
dictionary = corpora.Dictionary(line.lower().split() for line in open('corpus.txt'))
corpus = [dictionary.doc2bow(line.lower().split()) for line in open('corpus.txt')]
tfidf = models.TfidfModel(corpus)
index = similarities.SparseMatrixSimilarity(tfidf[corpus], num_features=12)
At prediction time you first get the vector for the new doc:
doc = "Human computer interaction"
vec_bow = dictionary.doc2bow(doc.lower().split())
vec_tfidf = tfidf[vec_bow]
Then get the similarities (sorted by most similar):
sims = index[vec_tfidf] # perform a similarity query against the corpus
print(list(enumerate(sims))) # print (document_number, document_similarity) 2-tuples
This does a linear scan like you wanted to do but they have a more optimized implementation. If the speed is not enough then you can look into approximate similarity search (Annoy, Falconn, NMSLIB).

For huge data sets, there is a solution called Text Clustering By Concept. search engines use this Technic,
At first step, you cluster your documents to some groups(e.g 50 cluster), then each cluster has a representative document(that contain some words that have some useful information about it's cluster)
At second step, for calculating cosine similarity between New Document and you data set, you loop through all representative(50 numbers) and find top near representatives(e.g 2 representative)
At final step, you can loop through all documents in selected representative and find nearest cosine similarity
With this Technic, you can reduce the number of loops and improve performace,
You can read more tecnincs in some chapter of this book: http://nlp.stanford.edu/IR-book/html/htmledition/irbook.html

Matching words and vectors in gensim Word2Vec model

I have had the gensim Word2Vec implementation compute some word embeddings for me. Everything went quite fantastically as far as I can tell; now I am clustering the word vectors created, hoping to get some semantic groupings.
As a next step, I would like to look at the words (rather than the vectors) contained in each cluster. I.e. if I have the vector of embeddings [x, y, z], I would like to find out which actual word this vector represents. I can get the words/Vocab items by calling model.vocab and the word vectors through model.syn0. But I could not find a location where these are explicitly matched.
This was more complicated than I expected and I feel I might be missing the obvious way of doing it. Any help is appreciated!
Problem:
Match words to embedding vectors created by Word2Vec () -- how do I do it?
My approach:
After creating the model (code below*), I would now like to match the indexes assigned to each word (during the build_vocab() phase) to the vector matrix outputted as model.syn0.
Thus
for i in range (0, newmod.syn0.shape[0]): #iterate over all words in model
print i
word= [k for k in newmod.vocab if newmod.vocab[k].__dict__['index']==i] #get the word out of the internal dicationary by its index
wordvector= newmod.syn0[i] #get the vector with the corresponding index
print wordvector == newmod[word] #testing: compare result of looking up the word in the model -- this prints True
Is there a better way of doing this, e.g. by feeding the vector into the model to match the word?
Does this even get me correct results?
*My code to create the word vectors:
model = Word2Vec(size=1000, min_count=5, workers=4, sg=1)
model.build_vocab(sentencefeeder(folderlist)) #sentencefeeder puts out sentences as lists of strings
model.save("newmodel")
I found this question which is similar but has not really been answered.

I have been searching for a long time to find the mapping between the syn0 matrix and the vocabulary... here is the answer : use model.index2word which is simply the list of words in the right order !
This is not in the official documentation (why ?) but it can be found directly inside the source code : https://github.com/RaRe-Technologies/gensim/blob/3b9bb59dac0d55a1cd6ca8f984cead38b9cb0860/gensim/models/word2vec.py#L441

If all you want to do is map a word to a vector, you can simply use the [] operator, e.g. model["hello"] will give you the vector corresponding to hello.
If you need to recover a word from a vector you could loop through your list of vectors and check for a match, as you propose. However, this is inefficient and not pythonic. A convenient solution is to use the similar_by_vector method of the word2vec model, like this:
import gensim
documents = [['human', 'interface', 'computer'],
['survey', 'user', 'computer', 'system', 'response', 'time'],
['eps', 'user', 'interface', 'system'],
['system', 'human', 'system', 'eps'],
['user', 'response', 'time'],
['trees'],
['graph', 'trees'],
['graph', 'minors', 'trees'],
['graph', 'minors', 'survey']]
model = gensim.models.Word2Vec(documents, min_count=1)
print model.similar_by_vector(model["survey"], topn=1)
which outputs:
[('survey', 1.0000001192092896)]
where the number represents the similarity.
However, this method is still inefficient, as it still has to scan all of the word vectors to search for the most similar one. The best solution to your problem is to find a way to keep track of your vectors during the clustering process so you don't have to rely on expensive reverse mappings.

So I found an easy way to do this, where nmodel is the name of your model.
#zip the two lists containing vectors and words
zipped = zip(nmodel.wv.index2word, nmodel.wv.syn0)
#the resulting list contains `(word, wordvector)` tuples. We can extract the entry for any `word` or `vector` (replace with the word/vector you're looking for) using a list comprehension:
wordresult = [i for i in zipped if i[0] == word]
vecresult = [i for i in zipped if i[1] == vector]
This is based on the gensim code. For older versions of gensim, you might need to drop the wv after the model.

As #bpachev mentioned, gensim does have an option of searching by vector, namely similar_by_vector.
It however implements a brute force linear search, i.e. computes cosine similarity between given vector and vectors of all words in vocabulary, and gives off the top neighbours. An alternate option, as mentioned in the other answer is to use an approximate nearest neighbour search algorithm like FLANN.
Sharing a gist demonstrating the same:
https://gist.github.com/kampta/139f710ca91ed5fabaf9e6616d2c762b

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

how to compare two text document with tfidf vectorizer? - python

Related

Calculating TF-IDF Score of a Single String

TFIDF Vectorizer within SciKit-Learn only returning 5 results

Remove duplicate tweets that are 90% similar

Python - tf-idf predict a new document similarity

Matching words and vectors in gensim Word2Vec model

Categories

Resources