1. For the below test text,
test=['test test', 'test toy']
the tf-idf score [without normalisation (smartirs: 'ntn')] is
[['test', 1.17]]
[['test', 0.58], ['toy', 1.58]]
This doesn't seem to tally with what I get via direct computation of
tfidf (w, d) = tf x idf
where idf(term)=log (total number of documents / number of documents containing term)
tf = number of instances of word in d document / total number of words of d document
Eg
doc 1: 'test test'
for "test" word
tf= 1
idf= log(2/2) = 0
tf-idf = 0
Can someone show me the computation using my above test text?
2) When I change to cosine normalisation (smartirs:'ntc'), I get
[['test', 1.0]]
[['test', 0.35], ['toy', 0.94]]
Can someone show me the computation too?
Thank you
import gensim
from gensim import corpora
from gensim import models
import numpy as np
from gensim.utils import simple_preprocess
test=['test test', 'test toy']
texts = [simple_preprocess(doc) for doc in test]
mydict= corpora.Dictionary(texts)
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in texts]
tfidf = models.TfidfModel(mycorpus, smartirs='ntn')
for doc in tfidf[mycorpus]:
print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])
If you care to know the details of the implementation of the model.TfidfModel you can check them directly in the GitHub repository for gensim. The particular calculation scheme corresponding to smartirs='ntn' is described on the Wikipedia page for SMART Information Retrieval System and the exact calculations are different than the ones you use hence the difference in the results.
E.g. the particular discrepancy you are referring to:
idf= log(2/2) = 0
should actually be log2(N+1/n_k):
idf= log(2/1) = 1
I suggest that you review both the implementation and the mentioned page so that to ensure your manual checks follow the implementation for the chosen smartirs flags.
Related
With Gensim, there are three functions I use regularly, for example this one:
model = gensim.models.Word2Vec(corpus,size=100,min_count=5)
The output from gensim, but I cannot understand how to set the size and min_count parameters in the equivalent SciSpacy command of:
model = spacy.load('en_core_web_md')
(The output is a model of embeddings (too big to add here))).
This is another command I regularly use:
model.most_similar(positive=['car'])
and this is the output from gensim/Expected output from SciSpacy:
[('vehicle', 0.7857330441474915),
('motorbike', 0.7572781443595886),
('train', 0.7457204461097717),
('honda', 0.7383008003234863),
('volkswagen', 0.7298516035079956),
('mini', 0.7158907651901245),
('drive', 0.7093928456306458),
('driving', 0.7084407806396484),
('road', 0.7001082897186279),
('traffic', 0.6991947889328003)]
This is the third command I regularly use:
print(model.wv['car'])
Output from Gensim/Expected output from SciSpacy (in reality this vector is length 100):
[ 1.0942473 2.5680697 -0.43163642 -1.171171 1.8553845 -0.3164575
1.3645878 -0.5003705 2.912658 3.099512 2.0184739 -1.2413547
0.9156444 -0.08406237 -2.2248871 2.0038593 0.8751471 0.8953876
0.2207374 -0.157277 -1.4984075 0.49289042 -0.01171476 -0.57937795...]
Could someone show me the equivalent commands for SciSpacy? For example, for 'gensim.models.Word2Vec' I can't find how to specify the length of the vectors (size parameter), or the minimum number of times the word should be in the corpus (min_count) in SciSpacy (e.g. I looked here and here), but I'm not sure if I'm missing them?
A possible way to achieve your goal would be to:
parse you documents via nlp.pipe
collect all the words and pairwise similarities
process similarities to get the desired results
Let's prepare some data:
import spacy
nlp = spacy.load("en_core_web_md", disable = ['ner', 'tagger', 'parser'])
Then, to get a vector, like in model.wv['car'] one would do:
nlp("car").vector
To get most similar words like model.most_similar(positive=['car']) let's process the corpus:
corpus = ["This is a sentence about cars. This a sentence aboout train"
, "And this is a sentence about a bike"]
docs = nlp.pipe(corpus)
tokens = []
tokens_orth = []
for doc in docs:
for tok in doc:
if tok.orth_ not in tokens_orth:
tokens.append(tok)
tokens_orth.append(tok.orth_)
sims = np.zeros((len(tokens),len(tokens)))
for i, tok in enumerate(tokens):
sims[i] = [tok.similarity(tok_) for tok_ in tokens]
Then to retrieve top=3 most similar words:
def most_similar(word, tokens_orth = tokens_orth, sims=sims, top=3):
tokens_orth = np.array(tokens_orth)
id_word = np.where(tokens_orth == word)[0][0]
sim = sims[id_word]
id_ms = np.argsort(sim)[:-top-1:-1]
return list(zip(tokens_orth[id_ms], sim[id_ms]))
most_similar("This")
[('this', 1.0000001192092896), ('This', 1.0), ('is', 0.5970357656478882)]
PS
I have also noticed you asked for specification of dimension and frequency. Embedding length is fixed at the time the model is initialized, so it can't be changed after that. You can start from a blank model if you wish so, and feed embeddings you're comfortable with. As for the frequency, it's doable, via counting all the words and throwing away anything that is below desired threshold. But again, underlying embeddings will be from a not filtered text. SpaCy is different from Gensim in that it uses readily available embeddings whereas Gensim trains them.
I have two lists of words, say,
list 1 : future proof
list 2 : house past foo bar
I would like to calculate the semantic distance between each word of list 1 with each word of list 2.
Fasttext has a nice function to display the nearest neighbours but it would be nice if there was a way to read the semantic distance between two defined words out.
Can anyone help, please?
Thanks
Unfortunately, there's no direct usage of word similarity functions in NLTK, although there are support for synset similarities through the WordNet API in NLTK.
Though not exhaustive, here's a list of pre-trained word embeddings that can be used to find out cosine similarity of word vectors: https://github.com/alvations/vegetables
To use, here's an example of using the HLBL Embeddings (from Turian et al. 2011) https://www.kaggle.com/alvations/vegetables-hlbl-embeddings (scroll down to the data explorer and download the directory directly, the top download button on the dataset page seem to lead to some corrupted data).
After downloading, you can load the embeddings using numpy:
>>> import pickle
>>> import numpy as np
>>> embeddings = np.load('hlbl.rcv1.original.50d.npy')
>>> tokens = [line.strip() for line in open('hlbl.rcv1.original.50d.txt')]
>>> embeddings[tokens.index('hello')]
array([-0.21167406, -0.04189226, 0.22745571, -0.09330438, 0.13239339,
0.25136262, -0.01908735, -0.02557277, 0.0029353 , -0.06194451,
-0.22384156, 0.04584747, 0.03227248, -0.13708033, 0.17901117,
-0.01664691, 0.09400477, 0.06688628, -0.09019949, -0.06918809,
0.08437972, -0.01485273, -0.12062263, 0.05024147, -0.00416972,
0.04466985, -0.05316647, 0.00998635, -0.03696947, 0.10502578,
-0.00190554, 0.03435732, -0.05715087, -0.06777468, -0.11803425,
0.17845355, 0.18688948, -0.07509124, -0.16089943, 0.0396672 ,
-0.05162677, -0.12486628, -0.03870481, 0.0928738 , 0.06197058,
-0.14603543, 0.04026282, 0.14052328, 0.1085517 , -0.15121481])
To compute similarity of two numpy array, you can try Cosine Similarity between 2 Number Lists
import numpy as np
cos_similarity = lambda a, b: np.dot(a, b)/(np.linalg.norm(a)*np.linalg.norm(b))
x, y = np.array([1,2,3]), np.array([2,2,1])
cos_similarity(x,y)
I am currently working on gensim doc2vec model to implement sentence similarity.
I came across this sample code by William Bert where he has mentioned that to train this model I need to provide my own background corpus. The code is copied below for convenience:
import logging, sys, pprint
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
### Generating a training/background corpus from your own source of documents
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
# gensim docs: "Provide a filename or a file-like object as input and TextCorpus will be initialized with a
# dictionary in `self.dictionary`and will support the `iter` corpus method. For other kinds of corpora, you only
# need to override `get_texts` and provide your own implementation."
background_corpus = TextCorpus(input=YOUR_CORPUS)
# Important -- save the dictionary generated by the corpus, or future operations will not be able to map results
# back to original words.
background_corpus.dictionary.save(
"my_dict.dict")
MmCorpus.serialize("background_corpus.mm",
background_corpus) # Uses numpy to persist wiki corpus in Matrix Market format. File will be several GBs.
### Generating a large training/background corpus using Wikipedia
from gensim.corpora import WikiCorpus, wikicorpus
articles = "enwiki-latest-pages-articles.xml.bz2" # available from http://en.wikipedia.org/wiki/Wikipedia:Database_download
# This will take many hours! Output is Wikipedia in bucket-of-words (BOW) sparse matrix.
wiki_corpus = WikiCorpus(articles)
wiki_corpus.dictionary.save("wiki_dict.dict")
MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # File will be several GBs.
### Working with persisted corpus and dictionary
bow_corpus = MmCorpus("wiki_corpus.mm") # Revive a corpus
dictionary = Dictionary.load("wiki_dict.dict") # Load a dictionary
### Transformations among vector spaces
from gensim.models import LsiModel, LogEntropyModel
logent_transformation = LogEntropyModel(wiki_corpus,
id2word=dictionary) # Log Entropy weights frequencies of all document features in the corpus
tokenize_func = wikicorpus.tokenize # The tokenizer used to create the Wikipedia corpus
document = "Some text to be transformed."
# First, tokenize document using the same tokenization as was used on the background corpus, and then convert it to
# BOW representation using the dictionary created when generating the background corpus.
bow_document = dictionary.doc2bow(tokenize_func(
document))
# converts a single document to log entropy representation. document must be in the same vector space as corpus.
logent_document = logent_transformation[[
bow_document]]
# Transform arbitrary documents by getting them into the same BOW vector space created by your training corpus
documents = ["Some iterable", "containing multiple", "documents", "..."]
bow_documents = (dictionary.doc2bow(
tokenize_func(document)) for document in documents) # use a generator expression because...
logent_documents = logent_transformation[
bow_documents] # ...transformation is done during iteration of documents using generators, so this uses constant memory
### Chained transformations
# This builds a new corpus from iterating over documents of bow_corpus as transformed to log entropy representation.
# Will also take many hours if bow_corpus is the Wikipedia corpus created above.
logent_corpus = MmCorpus(corpus=logent_transformation[bow_corpus])
# Creates LSI transformation model from log entropy corpus representation. Takes several hours with Wikipedia corpus.
lsi_transformation = LsiModel(corpus=logent_corpus, id2word=dictionary,
num_features=400)
# Alternative way of performing same operation as above, but with implicit chaining
# lsi_transformation = LsiModel(corpus=logent_transformation[bow_corpus], id2word=dictionary,
# num_features=400)
# Can persist transformation models, too.
logent_transformation.save("logent.model")
lsi_transformation.save("lsi.model")
### Similarities (the best part)
from gensim.similarities import Similarity
# This index corpus consists of what you want to compare future queries against
index_documents = ["A bear walked in the dark forest.",
"Tall trees have many more leaves than short bushes.",
"A starship may someday travel across vast reaches of space to other stars.",
"Difference is the concept of how two or more entities are not the same."]
# A corpus can be anything, as long as iterating over it produces a representation of the corpus documents as vectors.
corpus = (dictionary.doc2bow(tokenize_func(document)) for document in index_documents)
index = Similarity(corpus=lsi_transformation[logent_transformation[corpus]], num_features=400, output_prefix="shard")
print "Index corpus:"
pprint.pprint(documents)
print "Similarities of index corpus documents to one another:"
pprint.pprint([s for s in index])
query = "In the face of ambiguity, refuse the temptation to guess."
sims_to_query = index[lsi_transformation[logent_transformation[dictionary.doc2bow(tokenize_func(query))]]]
print "Similarities of index corpus documents to '%s'" % query
pprint.pprint(sims_to_query)
best_score = max(sims_to_query)
index = sims_to_query.tolist().index(best_score)
most_similar_doc = documents[index]
print "The document most similar to the query is '%s' with a score of %.2f." % (most_similar_doc, best_score)
Where and how should I provide my own corpus in the code?
Thanks in advance for your help.
I am using the Gensim HDP module on a set of documents.
>>> hdp = models.HdpModel(corpusB, id2word=dictionaryB)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> hdp = models.HdpModel(corpusA, id2word=dictionaryA)
>>> topics = hdp.print_topics(topics=-1, topn=20)
>>> len(topics)
150
>>> len(corpusA)
1113
>>> len(corpusB)
17
Why is the number of topics independent of corpus length?
#Aaron's code above is broken due to gensim API changes. I rewrote and simplified it as follows. Works as of June 2017 with gensim v2.1.0
import pandas as pd
def topic_prob_extractor(gensim_hdp):
shown_topics = gensim_hdp.show_topics(num_topics=-1, formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})
#Aron's and #Roko Mijic's approaches neglect the fact that the function show_topics returns by default the top 20 words of each topic only. If one returns all the words that compose a topic, all the approximated topic probabilities in that case will be 1 (or 0.999999). I experimented with the following code, which is an adaptation of #Roko Mijic's:
def topic_prob_extractor(gensim_hdp, t=-1, w=25, isSorted=True):
"""
Input the gensim model to get the rough topics' probabilities
"""
shown_topics = gensim_hdp.show_topics(num_topics=t, num_words=w ,formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
if (isSorted):
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights}).sort_values(by = "weight", ascending=False);
else:
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights});
A better, yet I'm not sure if 100% valid, approach is the one mentioned here. You can get the topics' true weights (alpha vector) of the HDP model as:
alpha = hdpModel.hdp_to_lda()[0];
Examining the topics' equivalent alpha values is more logical than tallying up the weights of the first 20 words of each topic to approximate its probability of usage in the data.
There is apparently a bug in Gensim(version 3.8.3), in which giving -1 to show_topics doesn't return anything at all. So I have tweaked the answers by Roko Mijic and aaron.
def topic_prob_extractor(gensim_hdp):
shown_topics = gensim_hdp.show_topics(num_topics=gensim_hdp.m_T, formatted=False)
topics_nos = [x[0] for x in shown_topics ]
weights = [ sum([item[1] for item in shown_topics[topicN][1]]) for topicN in topics_nos ]
return pd.DataFrame({'topic_id' : topics_nos, 'weight' : weights})
#user3907335 is exactly correct here: HDP will calculate as many topics as the assigned truncation level. However, it may be the case that many of these topics have basically zero probability of occurring. To help with this in my own work, I wrote a handy little function that performs a rough estimate of the probability weight associated with each topic. Note that this is a rough metric only: it does not account for the probability associated with each word. Even so, it provides a pretty good metric for which topics are meaningful and which aren't:
import pandas as pd
import numpy as np
def topic_prob_extractor(hdp=None, topn=None):
topic_list = hdp.show_topics(topics=-1, topn=topn)
topics = [int(x.split(':')[0].split(' ')[1]) for x in topic_list]
split_list = [x.split(' ') for x in topic_list]
weights = []
for lst in split_list:
sub_list = []
for entry in lst:
if '*' in entry:
sub_list.append(float(entry.split('*')[0]))
weights.append(np.asarray(sub_list))
sums = [np.sum(x) for x in weights]
return pd.DataFrame({'topic_id' : topics, 'weight' : sums})
I assume that you already know how to calculate an HDP model. Once you have an hdp model calculated by gensim you call the function as follows:
topic_weights = topic_prob_extractor(hdp, 500)
I think you misunderstood the operation performed by the called method. Directly from the documentation you can see:
Alias for show_topics() that prints the top n most probable words for topics number of topics to log. Set topics=-1 to print all topics.
You trained the model without specifying the truncation level on the number of topics and the default one is 150. Calling the print_topics with topics=-1 you'll get the top 20 words for each topic , in your case 150 topics.
I'm still a newbie of the library, so maybe I' wrong
I haven't used gensim for HDPs, but is it possible that most of the topics in the smaller corpus have extremely low probability of occurring ? Can you trying printing the topic probabilities? Maybe, the length of the topics array doesn't necessarily mean that all those topics were actually found in the corpus.
Deriving the average coherence of HDP topics from their coherence at the individual text level is a way to order (and potentially truncate) them. The following function does just that:
def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
"""
Orders topics based on their average coherence across the corpus
Parameters
----------
dirichlet_model : gensim.models.hdpmodel.HdpModel
bow_corpus : list of lists (contains (id, freq) tuples)
num_topics : int (default=10)
num_keywords : int (default=10)
Returns
-------
ordered_topics: list of lists containing topic tokens
"""
shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
num_words=num_keywords,
formatted=False)
model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0
topics_per_response = [response for response in topic_corpus]
flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]
significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) \
for topic_num in significant_topics]
topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]
significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # truncate if desired
return ordered_topics
A version of this function that includes an output of the averages coherences associated with the topics for keyword (tag) generation for a corpus can be found in this answer. A similar process for keywords for individual texts can further be found in this answer.
I use gensim to build dictionary from a collection of documents. Each document is a list of tokens. this my code
def constructModel(self, docTokens):
""" Given document tokens, constructs the tf-idf and similarity models"""
#construct dictionary for the BOW (vector-space) model : Dictionary = a mapping between words and their integer ids = collection of (word_index,word_string) pairs
#print "dictionary"
self.dictionary = corpora.Dictionary(docTokens)
# prune dictionary: remove words that appear too infrequently or too frequently
print "dictionary size before filter_extremes:",self.dictionary#len(self.dictionary.values())
#self.dictionary.filter_extremes(no_below=1, no_above=0.9, keep_n=100000)
#self.dictionary.compactify()
print "dictionary size after filter_extremes:",self.dictionary
#construct the corpus bow vectors; bow vector = collection of (word_id,word_frequency) pairs
corpus_bow = [self.dictionary.doc2bow(doc) for doc in docTokens]
#construct the tf-idf model
self.model = models.TfidfModel(corpus_bow,normalize=True)
corpus_tfidf = self.model[corpus_bow] # first transform each raw bow vector in the corpus to the tfidf model's vector space
self.similarityModel = similarities.MatrixSimilarity(corpus_tfidf) # construct the term-document index
my question is how to add a new doc (tokens) to this dictionary and update it. I searched in gensim documents but I didn't find a solution
There is documentation for how to do this on the gensim webpage here
The way to do it is create another dictionary with the new documents and then merge them.
from gensim import corpora
dict1 = corpora.Dictionary(firstDocs)
dict2 = corpora.Dictionary(moreDocs)
dict1.merge_with(dict2)
According to the docs, this will map "same tokens to the same ids and new tokens to new ids".
You can use the add_documents method:
from gensim import corpora
text = [["aaa", "aaa"]]
dictionary = corpora.Dictionary(text)
dictionary.add_documents([['bbb','bbb']])
print(dictionary)
After run the code above, you will get this:
Dictionary(2 unique tokens: ['aaa', 'bbb'])
Read the document for more details.
METHOD 1:
You can just use keyedvectors from gensim.models.keyedvectors. They are very easy to use.
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)
METHOD 2:
AND if you already have built a model using gensim.models.Word2Vec you can just do this. suppose I want to add the token <UKN> with a random vector.
model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length
The complete example would be like this:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8") # load dataset as iterable
model = Word2Vec(dataset)
model.wv["<UNK>"] = np.random.rand(100)