Re-calculate similarity matrix given new documents - python

I'm running an experiment that include text documents that I need to calculate the (cosine) similarity matrix between all of them (to use for another calculation). For that I use sklearn's TfidfVectorizer:
corpus = [doc1, doc2, doc3, doc4]
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False)
tfidf = vect.fit_transform(corpus)
similarities = tfidf * tfidf.T
pairwise_similarity_matrix = similarities.A
The problem is that with each iteration of my experiment I discover new documents that I need to add to my similarity matrix, and given the number of documents I'm working with (tens of thousands and more) - it is very time consuming.
I wish to find a way to calculate only the similarities between the new batch of documents and the existing ones, without computing it all again one the entire data set.
Note that I'm using a term-frequency (tf) representation, without using inverse-document-frequency (idf), so in theory I don't need to re-calculate the whole matrix each time.

OK, I got it.
The idea is, as I said, to calculate the similarity only between the new batch of files and the existing ones, which their similarity is unchanged. The problem is to keep the TfidfVectorizer's vocabulary updated with the newly seen terms.
The solution has 2 steps:
Update the vocabulary and the tf matrices.
Matrix multiplications and stacking.
Here's the whole script - we first got the original corpus and the trained and calculated objects and matrices:
corpus = [doc1, doc2, doc3]
# Build for the first time:
vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False)
tf_matrix = vect.fit_transform(corpus)
similarities = tf_matrix * tf_matrix.T
similarities_matrix = similarities.A # just for printing
Now, given new documents:
new_docs_corpus = [docx, docy, docz] # New documents
# Building new vectorizer to create the parsed vocabulary of the new documents:
new_vect = TfidfVectorizer(min_df=1, stop_words="english", use_idf=False)
new_vect.fit(new_docs_corpus)
# Merging old and new vocabs:
new_terms_count = 0
for k, v in new_vect.vocabulary_.items():
if k in vect.vocabulary_.keys():
continue
vect.vocabulary_[k] = np.int64(len(vect.vocabulary_)) # important not to assign a simple int
new_terms_count = new_terms_count + 1
new_vect.vocabulary_ = vect.vocabulary_
# Build new docs represantation using the merged vocabulary:
new_tf_matrix = new_vect.transform(new_docs_corpus)
new_similarities = new_tf_matrix * new_tf_matrix.T
# Get the old tf-matrix with the same dimentions:
if new_terms_count:
zero_matrix = csr_matrix((tfidf.shape[0],new_terms_count))
tf_matrix = hstack([tf_matrix, zero_matrix])
# tf_matrix = vect.transform(corpus) # Instead, we just append 0's for the new terms and stack the tf_matrix over the new one, to save time
cross_similarities = new_tf_matrix * tf_matrix.T # Calculate cross-similarities
tf_matrix = vstack([tf_matrix, new_tfidf])
# Stack it all together:
similarities = vstack([hstack([similarities, cross_similarities.T]), hstack([cross_similarities, new_similarities])])
similarities_matrix = similarities.A
# Updating the corpus with the new documents:
corpus = corpus + new_docs_corpus
We can check this by comparing the calculated similarities_matrix we got, with the one we get when we train a TfidfVectorizer on the joint corpus: corpus + new_docs_corpus.
As discussed in the the comments, we can do all that only because we are not using the idf (inverse-document-frequency) element, that will change the representation of existing documents given new ones.

Related

Solving memory issues when using Gensim LDA Multicore

For my project I am trying to use unsupervised learning to identify different topics from application descriptions, but I am running into a strange problem. Firstly, I have 3 different datasets, one with 15k documents another with 50k documents and last with 2m documents. I am trying to test models with different number of topics (k) ranging from 5 to 100 with a step size of 5. This is in order to check which k results in the best model assessed with initially with the highest coherence score. For each k, I also build 3 different models with chunksize 10, 100 and 1000.
So now moving onto the problem I am having. Obviously my own machine is too slow and does not have enough cores for this kind of computation hence I am using my university's server. The problem here is my program seems to be consuming too much memory and I am unsure of the reason. I already made some adjustments such that the corpus is not loaded entirely to memory (or atleast I think I did). The dataset with 50k entries already at iteration k=50 (so halfway) seems to have consumed the alloted 100GB of memory, which seems very huge.
I would appreciate any help in the right direction and thanks for taking the time to look at this. Below is the code from my topic_modelling.py file. Comments on the file are a bit outdated, sorry about that.
class MyCorpus:
texts: list
dictionary: dict
def __init__(self, descriptions, dictionary):
self.texts = descriptions
self.dictionary = dictionary
def __iter__(self):
for line in self.texts:
try:
# assume there's one document per line, tokens separated by whitespace
yield self.dictionary.doc2bow(line)
except StopIteration:
pass
# Function given a dataframe creates a dictionary and corupus
# These are used to create an LDA model. Here we automatically use the Descriptionb column
# from each dataframe
def create_dict_and_corpus(df):
text_descriptions = remove_characters_and_create_list(df, 'Description')
# print(text_descriptions)
dictionary = gensim.corpora.Dictionary(text_descriptions)
corpus = MyCorpus(text_descriptions, dictionary)
return text_descriptions, dictionary, corpus
# Given a dataframe remove and a column name in the data frame, extract all words and return a list
# Also to remove all chracters that are not alphanumeric or spaces
def remove_characters_and_create_list(df, column_name, split=True):
df[column_name] = df[column_name].astype(str)
texts = []
for x in range(df[column_name].size):
current_string = df[column_name][x]
filtered_string = re.sub(r'[^A-Za-z0-9 ]+', '', current_string)
if split:
texts.append(filtered_string.split())
else:
texts.append(filtered_string)
return texts
# This function given the parameters creates an LDA model for each number between
# the start limit and the end limit. After this the coherence and perplexity is calulated
# for each of those models and saved in a csv file to analyze later.
def test_lda_models(text, corpus, dictionary, start_limit, end_limit, path):
results = []
print("============Starting topic modelling============")
for k in range(start_limit, end_limit+1, 5):
for p in range(1, 4):
chunk = pow(10, p)
t0 = time.time()
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=chunk)
# To calculate the goodness of the model
perplexity = lda_model.bound(corpus)
coherence_model_lda = CoherenceModel(model=lda_model, texts=text, dictionary=dictionary, coherence='c_v')
coherence_lda = coherence_model_lda.get_coherence()
t1 = time.time()
print(f"=====Done K={k} model with passes={p} and chunksize={chunk}, took {t1-t0} seconds=====")
results.append((k, chunk, coherence_lda, perplexity))
# Storing teh results in a csv file except the actual lda model (this would not make sense)
path = make_dir_if_not_exists(path)
list_tuples_to_csv(results, ['#OfTopics', 'ChunkSize', 'CoherenceScore', 'Perplexity'], f"{path}/K={start_limit}to{end_limit}.csv")
return results
# Function plot the visualization of an LDA model. This visualization is then
# saved as an html file inside the given path
def single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path):
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.save_html(vis, f"{path}/visualization.html")
# Given the results produced by test_lda_models, loop though the models and save the
# topic words of each model and the visualization of the topics in the given path
def save_lda_result(k, c, lda_model, corpus, dictionary, path):
list_tuples_to_csv(lda_model.print_topics(num_topics=k), ['Topic#', 'Associated Words'], f"{path}/associated_words.csv")
single_lda_model_visualization(k, c, corpus, dictionary, lda_model, path)
# This is the entire pipeline that needs to be performed for a single dataset,
# which includes computing the LDA models from start to end limit and calculating
# and saving the topic words and visual graphs for the top n topics with the highest
# coherence score.
def perform_topic_modelling_single_df(df, start_limit, end_limit, path):
# Extracting the necessary data required for LDA model computation
text_descriptions,dictionary, corpus = create_dict_and_corpus(df)
results_lda = test_lda_models(text_descriptions, corpus, dictionary, start_limit, end_limit, path)
# Sorting the results based on the 2nd tuple value returned which is 'coherence'
results_lda.sort(key=lambda x:x[2],reverse=True)
# Getting the top 5 results to save pass to save_lda_results function
results = results_lda[:5]
corpus_for_saving = [dictionary.doc2bow(text) for text in text_descriptions]
texts = remove_characters_and_create_list(df, 'Description', split=False)
# Perfrom application to topic modelling for the best lda model based on the
# coherence score (TODO maybe test with other lda models?)
print("getting descriptions for csv")
for k, c, _, _ in results:
dir_path = make_dir_if_not_exists(f"{path}/k={k}_chunk={c}")
p = int(math.log10(c))
lda_model = gensim.models.ldamulticore.LdaMulticore(corpus,
num_topics=k,
id2word=dictionary,
passes=p,
chunksize=c)
print(f"=====REDOING K={k} model with passes={p} and chunksize={c}=====")
save_lda_result(k,c, lda_model, corpus_for_saving, dictionary, dir_path)
application_to_topic_modelling(df, k, c, lda_model, corpus_for_saving, texts, dir_path)
# Performs the whole topic modelling pipeline taking different genre data sets
# and the entire dataset as a whole
def perform_topic_modelling_pipeline(path_ex):
# entire_df = pd.read_csv("../data/preprocessed_data/preprocessed_10000_trial.csv")
entire_df = pd.read_csv(os.path.join(ROOT_DIR, f"data/preprocessed_data/preprocessedData_{path_ex}.csv"))
print("size of df")
print(entire_df.shape)
# For entire df go from start limit to ngenres to find best LDA model
nGenres = row_counter(os.path.join(ROOT_DIR, f"data/genre_wise_data/data{path_ex}/genre_frequency.csv"))
nGenres_rounded = math.ceil(nGenres / 5) * 5
print(f"Original number of genres should be {nGenres}, but we are rounding to {nGenres_rounded}")
path = make_dir_if_not_exists(os.path.join(ROOT_DIR, f"results/data{path_ex}/aall_data"))
perform_topic_modelling_single_df(entire_df, 5, 100, path)

SciSpacy equivalent of Gensim's functions/parameters

With Gensim, there are three functions I use regularly, for example this one:
model = gensim.models.Word2Vec(corpus,size=100,min_count=5)
The output from gensim, but I cannot understand how to set the size and min_count parameters in the equivalent SciSpacy command of:
model = spacy.load('en_core_web_md')
(The output is a model of embeddings (too big to add here))).
This is another command I regularly use:
model.most_similar(positive=['car'])
and this is the output from gensim/Expected output from SciSpacy:
[('vehicle', 0.7857330441474915),
('motorbike', 0.7572781443595886),
('train', 0.7457204461097717),
('honda', 0.7383008003234863),
('volkswagen', 0.7298516035079956),
('mini', 0.7158907651901245),
('drive', 0.7093928456306458),
('driving', 0.7084407806396484),
('road', 0.7001082897186279),
('traffic', 0.6991947889328003)]
This is the third command I regularly use:
print(model.wv['car'])
Output from Gensim/Expected output from SciSpacy (in reality this vector is length 100):
[ 1.0942473 2.5680697 -0.43163642 -1.171171 1.8553845 -0.3164575
1.3645878 -0.5003705 2.912658 3.099512 2.0184739 -1.2413547
0.9156444 -0.08406237 -2.2248871 2.0038593 0.8751471 0.8953876
0.2207374 -0.157277 -1.4984075 0.49289042 -0.01171476 -0.57937795...]
Could someone show me the equivalent commands for SciSpacy? For example, for 'gensim.models.Word2Vec' I can't find how to specify the length of the vectors (size parameter), or the minimum number of times the word should be in the corpus (min_count) in SciSpacy (e.g. I looked here and here), but I'm not sure if I'm missing them?
A possible way to achieve your goal would be to:
parse you documents via nlp.pipe
collect all the words and pairwise similarities
process similarities to get the desired results
Let's prepare some data:
import spacy
nlp = spacy.load("en_core_web_md", disable = ['ner', 'tagger', 'parser'])
Then, to get a vector, like in model.wv['car'] one would do:
nlp("car").vector
To get most similar words like model.most_similar(positive=['car']) let's process the corpus:
corpus = ["This is a sentence about cars. This a sentence aboout train"
, "And this is a sentence about a bike"]
docs = nlp.pipe(corpus)
tokens = []
tokens_orth = []
for doc in docs:
for tok in doc:
if tok.orth_ not in tokens_orth:
tokens.append(tok)
tokens_orth.append(tok.orth_)
sims = np.zeros((len(tokens),len(tokens)))
for i, tok in enumerate(tokens):
sims[i] = [tok.similarity(tok_) for tok_ in tokens]
Then to retrieve top=3 most similar words:
def most_similar(word, tokens_orth = tokens_orth, sims=sims, top=3):
tokens_orth = np.array(tokens_orth)
id_word = np.where(tokens_orth == word)[0][0]
sim = sims[id_word]
id_ms = np.argsort(sim)[:-top-1:-1]
return list(zip(tokens_orth[id_ms], sim[id_ms]))
most_similar("This")
[('this', 1.0000001192092896), ('This', 1.0), ('is', 0.5970357656478882)]
PS
I have also noticed you asked for specification of dimension and frequency. Embedding length is fixed at the time the model is initialized, so it can't be changed after that. You can start from a blank model if you wish so, and feed embeddings you're comfortable with. As for the frequency, it's doable, via counting all the words and throwing away anything that is below desired threshold. But again, underlying embeddings will be from a not filtered text. SpaCy is different from Gensim in that it uses readily available embeddings whereas Gensim trains them.

How to call a corpus file in python?

I am currently working on gensim doc2vec model to implement sentence similarity.
I came across this sample code by William Bert where he has mentioned that to train this model I need to provide my own background corpus. The code is copied below for convenience:
import logging, sys, pprint
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
### Generating a training/background corpus from your own source of documents
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
# gensim docs: "Provide a filename or a file-like object as input and TextCorpus will be initialized with a
# dictionary in `self.dictionary`and will support the `iter` corpus method. For other kinds of corpora, you only
# need to override `get_texts` and provide your own implementation."
background_corpus = TextCorpus(input=YOUR_CORPUS)
# Important -- save the dictionary generated by the corpus, or future operations will not be able to map results
# back to original words.
background_corpus.dictionary.save(
"my_dict.dict")
MmCorpus.serialize("background_corpus.mm",
background_corpus) # Uses numpy to persist wiki corpus in Matrix Market format. File will be several GBs.
### Generating a large training/background corpus using Wikipedia
from gensim.corpora import WikiCorpus, wikicorpus
articles = "enwiki-latest-pages-articles.xml.bz2" # available from http://en.wikipedia.org/wiki/Wikipedia:Database_download
# This will take many hours! Output is Wikipedia in bucket-of-words (BOW) sparse matrix.
wiki_corpus = WikiCorpus(articles)
wiki_corpus.dictionary.save("wiki_dict.dict")
MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # File will be several GBs.
### Working with persisted corpus and dictionary
bow_corpus = MmCorpus("wiki_corpus.mm") # Revive a corpus
dictionary = Dictionary.load("wiki_dict.dict") # Load a dictionary
### Transformations among vector spaces
from gensim.models import LsiModel, LogEntropyModel
logent_transformation = LogEntropyModel(wiki_corpus,
id2word=dictionary) # Log Entropy weights frequencies of all document features in the corpus
tokenize_func = wikicorpus.tokenize # The tokenizer used to create the Wikipedia corpus
document = "Some text to be transformed."
# First, tokenize document using the same tokenization as was used on the background corpus, and then convert it to
# BOW representation using the dictionary created when generating the background corpus.
bow_document = dictionary.doc2bow(tokenize_func(
document))
# converts a single document to log entropy representation. document must be in the same vector space as corpus.
logent_document = logent_transformation[[
bow_document]]
# Transform arbitrary documents by getting them into the same BOW vector space created by your training corpus
documents = ["Some iterable", "containing multiple", "documents", "..."]
bow_documents = (dictionary.doc2bow(
tokenize_func(document)) for document in documents) # use a generator expression because...
logent_documents = logent_transformation[
bow_documents] # ...transformation is done during iteration of documents using generators, so this uses constant memory
### Chained transformations
# This builds a new corpus from iterating over documents of bow_corpus as transformed to log entropy representation.
# Will also take many hours if bow_corpus is the Wikipedia corpus created above.
logent_corpus = MmCorpus(corpus=logent_transformation[bow_corpus])
# Creates LSI transformation model from log entropy corpus representation. Takes several hours with Wikipedia corpus.
lsi_transformation = LsiModel(corpus=logent_corpus, id2word=dictionary,
num_features=400)
# Alternative way of performing same operation as above, but with implicit chaining
# lsi_transformation = LsiModel(corpus=logent_transformation[bow_corpus], id2word=dictionary,
# num_features=400)
# Can persist transformation models, too.
logent_transformation.save("logent.model")
lsi_transformation.save("lsi.model")
### Similarities (the best part)
from gensim.similarities import Similarity
# This index corpus consists of what you want to compare future queries against
index_documents = ["A bear walked in the dark forest.",
"Tall trees have many more leaves than short bushes.",
"A starship may someday travel across vast reaches of space to other stars.",
"Difference is the concept of how two or more entities are not the same."]
# A corpus can be anything, as long as iterating over it produces a representation of the corpus documents as vectors.
corpus = (dictionary.doc2bow(tokenize_func(document)) for document in index_documents)
index = Similarity(corpus=lsi_transformation[logent_transformation[corpus]], num_features=400, output_prefix="shard")
print "Index corpus:"
pprint.pprint(documents)
print "Similarities of index corpus documents to one another:"
pprint.pprint([s for s in index])
query = "In the face of ambiguity, refuse the temptation to guess."
sims_to_query = index[lsi_transformation[logent_transformation[dictionary.doc2bow(tokenize_func(query))]]]
print "Similarities of index corpus documents to '%s'" % query
pprint.pprint(sims_to_query)
best_score = max(sims_to_query)
index = sims_to_query.tolist().index(best_score)
most_similar_doc = documents[index]
print "The document most similar to the query is '%s' with a score of %.2f." % (most_similar_doc, best_score)
Where and how should I provide my own corpus in the code?
Thanks in advance for your help.

Python nltk classify with large feature set (Replicate Go Et Al 2009)

I'm trying to replicate Go Et Al. Twitter sentiment Analysis which can be found here http://help.sentiment140.com/for-students
The problem I'm having is the number of features is 364464. I'm currently using nltk and nltk.NaiveBayesClassifier to do this where tweets holds a replication of the 1,600,000 tweets and there polarity:
for tweet in tweets:
tweet[0] = extract_features(tweet[0], features)
classifier = nltk.NaiveBayesClassifier.train(training_set)
# print "NB Classified"
classifier.show_most_informative_features()
print(nltk.classify.util.accuracy(classifier, testdata))
Everything doesn't take very long apart from the extract_features function
def extract_features(tweet, featureList):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
This is because for each tweet it's creating a dictionary of size 364,464 to represent whether something is present or not.
Is there a way to make this faster or more efficient without reducing the number of features like in this paper?
Turns out there is a wonderful function called:
nltk.classify.util.apply_features()
which you can find herehttp://www.nltk.org/api/nltk.classify.html
training_set = nltk.classify.apply_features(extract_features, tweets)
I had to change my extract_features function but it now works with the huge sizes without memory issues.
Here's a lowdown of the function description:
The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).
and my changed function:
def extract_features(tweet):
tweet_words = set(tweet)
global featureList
features = {}
for word in featureList:
features[word] = False
for word in tweet_words:
if word in featureList:
features[word] = True
return features

how to add tokens to gensim dictionary

I use gensim to build dictionary from a collection of documents. Each document is a list of tokens. this my code
def constructModel(self, docTokens):
""" Given document tokens, constructs the tf-idf and similarity models"""
#construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs
#print "dictionary"
self.dictionary = corpora.Dictionary(docTokens)
# prune dictionary: remove words that appear too infrequently or too frequently
print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values())
#self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000)
#self.dictionary.compactify()
print "dictionary size after filter_extremes:",self.dictionary
#construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs
corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens]
#construct the tf-idf model
self.model = models.TfidfModel(corpus_bow,normalize=True)
corpus_tfidf = self.model[corpus_bow] # first transform each raw bow vector in the corpus to the tfidf model's vector space
self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf) # construct the term-document index
my question is how to add a new doc (tokens) to this dictionary and update it. I searched in gensim documents but I didn't find a solution
There is documentation for how to do this on the gensim webpage here
The way to do it is create another dictionary with the new documents and then merge them.
from gensim import corpora
dict1 = corpora.Dictionary(firstDocs)
dict2 = corpora.Dictionary(moreDocs)
dict1.merge_with(dict2)
According to the docs, this will map "same tokens to the same ids and new tokens to new ids".
You can use the add_documents method:
from gensim import corpora
text = [["aaa", "aaa"]]
dictionary = corpora.Dictionary(text)
dictionary.add_documents([['bbb','bbb']])
print(dictionary)
After run the code above, you will get this:
Dictionary(2 unique tokens: ['aaa', 'bbb'])
Read the document for more details.
METHOD 1:
You can just use keyedvectors from gensim.models.keyedvectors. They are very easy to use.
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)
METHOD 2:
AND if you already have built a model using gensim.models.Word2Vec you can just do this. suppose I want to add the token <UKN> with a random vector.
model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length
The complete example would be like this:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8") # load dataset as iterable
model = Word2Vec(dataset)
model.wv["<UNK>"] = np.random.rand(100)

Categories