SciSpacy equivalent of Gensim's functions/parameters - python

With Gensim, there are three functions I use regularly, for example this one:
model = gensim.models.Word2Vec(corpus,size=100,min_count=5)
The output from gensim, but I cannot understand how to set the size and min_count parameters in the equivalent SciSpacy command of:
model = spacy.load('en_core_web_md')
(The output is a model of embeddings (too big to add here))).
This is another command I regularly use:
model.most_similar(positive=['car'])
and this is the output from gensim/Expected output from SciSpacy:
[('vehicle', 0.7857330441474915),
('motorbike', 0.7572781443595886),
('train', 0.7457204461097717),
('honda', 0.7383008003234863),
('volkswagen', 0.7298516035079956),
('mini', 0.7158907651901245),
('drive', 0.7093928456306458),
('driving', 0.7084407806396484),
('road', 0.7001082897186279),
('traffic', 0.6991947889328003)]
This is the third command I regularly use:
print(model.wv['car'])
Output from Gensim/Expected output from SciSpacy (in reality this vector is length 100):
[ 1.0942473 2.5680697 -0.43163642 -1.171171 1.8553845 -0.3164575
1.3645878 -0.5003705 2.912658 3.099512 2.0184739 -1.2413547
0.9156444 -0.08406237 -2.2248871 2.0038593 0.8751471 0.8953876
0.2207374 -0.157277 -1.4984075 0.49289042 -0.01171476 -0.57937795...]
Could someone show me the equivalent commands for SciSpacy? For example, for 'gensim.models.Word2Vec' I can't find how to specify the length of the vectors (size parameter), or the minimum number of times the word should be in the corpus (min_count) in SciSpacy (e.g. I looked here and here), but I'm not sure if I'm missing them?

A possible way to achieve your goal would be to:
parse you documents via nlp.pipe
collect all the words and pairwise similarities
process similarities to get the desired results
Let's prepare some data:
import spacy
nlp = spacy.load("en_core_web_md", disable = ['ner', 'tagger', 'parser'])
Then, to get a vector, like in model.wv['car'] one would do:
nlp("car").vector
To get most similar words like model.most_similar(positive=['car']) let's process the corpus:
corpus = ["This is a sentence about cars. This a sentence aboout train"
, "And this is a sentence about a bike"]
docs = nlp.pipe(corpus)
tokens = []
tokens_orth = []
for doc in docs:
for tok in doc:
if tok.orth_ not in tokens_orth:
tokens.append(tok)
tokens_orth.append(tok.orth_)
sims = np.zeros((len(tokens),len(tokens)))
for i, tok in enumerate(tokens):
sims[i] = [tok.similarity(tok_) for tok_ in tokens]
Then to retrieve top=3 most similar words:
def most_similar(word, tokens_orth = tokens_orth, sims=sims, top=3):
tokens_orth = np.array(tokens_orth)
id_word = np.where(tokens_orth == word)[0][0]
sim = sims[id_word]
id_ms = np.argsort(sim)[:-top-1:-1]
return list(zip(tokens_orth[id_ms], sim[id_ms]))
most_similar("This")
[('this', 1.0000001192092896), ('This', 1.0), ('is', 0.5970357656478882)]
PS
I have also noticed you asked for specification of dimension and frequency. Embedding length is fixed at the time the model is initialized, so it can't be changed after that. You can start from a blank model if you wish so, and feed embeddings you're comfortable with. As for the frequency, it's doable, via counting all the words and throwing away anything that is below desired threshold. But again, underlying embeddings will be from a not filtered text. SpaCy is different from Gensim in that it uses readily available embeddings whereas Gensim trains them.

Related

Topic wise document distribution in Gensim LDA

Is there a way in python to map documents belonging to a certain topic. For example a list of documents that are primarily "Topic 0". I know there are ways to list topics for each document but how do I do it the other way around?
Edit:
I am using the following script for LDA:
doc_set = []
for file in files:
newpath = (os.path.join(my_path, file))
newpath1 = textract.process(newpath)
newpath2 = newpath1.decode("utf-8")
doc_set.append(newpath2)
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in stopwords.words()]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, random_state=0, id2word = dictionary, passes=1)
You've got a tool/API (Gensim LDA) that, when given a document, gives you a list of topics.
But you want the reverse: a list of documents, for a topic.
Essentially, you'll want to build the reverse-mapping yourself.
Fortunately Python's native dicts & idioms for working with mapping make this pretty simple - just a few lines of code - as long as you're working with data that fully fits in memory.
Very roughly the approach would be:
create a new structure (dict or list) for mapping topics to lists-of-documents
iterate over all docs, adding them (perhaps with scores) to that topic-to-docs mapping
finally, look up (& perhaps sort) those lists-of-docs, for each topic of interest
If your question could be edited to include more information about the format/IDs of your documents/topics, and how you've trained your LDA model, this answer could be expanded with more specific example code to build the kind of reverse-mapping you'd need.
Update for your code update:
OK, if your model is in ldamodel and your BOW-formatted docs in corpus, you'd do something like:
# setup: get the model's topics in their native ordering...
all_topics = ldamodel.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]
# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
# ...get its topics...
doc_topics = ldamodel.get_document_topics(doc_bow)
# ...& for each of its topics...
for topic_id, score in doc_topics:
# ...add the doc_id & its score to the topic's doc list
docs_per_topic[topic_id].append((doc_id, score))
After this, you can see the list of all (doc_id, score) values for a certain topic like this (for topic 0):
print(docs_per_topic[0])
If you're interested in the top docs per topic, you can further sort each list's pairs by their score:
for doc_list in docs_per_topic:
doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)
Then, you could get the top-10 docs for topic 0 like:
print(docs_per_topic[0][:10])
Note that this does everything using all-in-memory lists, which might become impractical for very-large corpuses. In some cases, you might need to compile the per-topic listings into disk-backed structures, like files or a database.

Gensim's `model.wv.most_similar` returns phonologically similar words

gensim's wv.most_similar returns phonologically close words (similar sounds) instead of semantically similar ones. Is this normal? Why might this happen?
Here's the documentation on most_similar: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar
In [144]: len(vectors.vocab)
Out[144]: 32966
...
In [140]: vectors.most_similar('fight')
Out[140]:
[('Night', 0.9940935373306274),
('knight', 0.9928507804870605),
('fright', 0.9925899505615234),
('light', 0.9919329285621643),
('bright', 0.9914385080337524),
('plight', 0.9912853240966797),
('Eight', 0.9912533760070801),
('sight', 0.9908033013343811),
('playwright', 0.9905624985694885),
('slight', 0.990411102771759)]
In [141]: vectors.most_similar('care')
Out[141]:
[('spare', 0.9710584878921509),
('scare', 0.9626247882843018),
('share', 0.9594929218292236),
('prepare', 0.9584596157073975),
('aware', 0.9551078081130981),
('negare', 0.9550014138221741),
('glassware', 0.9507938027381897),
('Welfare', 0.9489598274230957),
('warfare', 0.9487678408622742),
('square', 0.9473209381103516)]
The training data contains academic papers and this was my training script:
from gensim.models.fasttext import FastText as FT_gensim
import gensim.models.keyedvectors as word2vec
dim_size = 300
epochs = 10
model = FT_gensim(size=dim_size, window=3, min_count=1)
model.build_vocab(sentences=corpus_reader, progress_per=1000)
model.train(sentences=corpus_reader, total_examples=total_examples, epochs=epochs)
# saving vectors to disk
path = "/home/ubuntu/volume/my_vectors.vectors"
model.wv.save_word2vec_format(path, binary=True)
# loading vectors
vectors = word2vec.KeyedVectors.load_word2vec_format(path)
You've chosen to use the FastText algorithm to train your vectors. That algorithm specifically makes use of subword fragments (like 'ight' or 'are') to have a chance of synthesizing good guess-vectors for 'out-of-vocabulary' words that weren't in the training set, and that could be one contributor to the results you're seeing.
However, usually words' unique meanings predominate, with the influence of such subwords only coming into play for unknown words. And, it's rare for the most-similar lists of any words in a healthy set of word-vectors to have so many 0.99+ similarities.
So, I suspect there's something weird or deficient in your training data.
What kind of text is it, and how many total words of example usages does it contain?
Were there any perplexing aspects of training progress/speed shown in INFO-level logs during training?
(300 dimensions may also be a bit excessive with a vocabulary of only 33K unique words; that's a vector-size that's common in work with hundreds of thousands to millions of unique words, and plentiful training data.)
That's a good call-out on the dimension size. Reducing that param definitely did make a difference.
1. Reproducing the original behavior (where dim_size=300) with a larger corpus (33k --> 275k unique vocab):
Note: I've also tweaked a few other params, like min_count, window, etc.)
from gensim.models.fasttext import FastText as FT_gensim
fmodel0 = FT_gensim(size=300, window=5, min_count=3, workers=10) # window is The maximum distance between the current and predicted word within a sentence.
fmodel0.build_vocab(sentences=corpus)
fmodel0.train(sentences=corpus, total_examples=fmodel0.corpus_count, epochs=5)
fmodel0.wv.vocab['cancer'].count # number of times the word occurred in the corpus
fmodel0.wv.most_similar('cancer')
fmodel0.wv.most_similar('care')
fmodel0.wv.most_similar('fight')
# -----------
# cancer
[('breastcancer', 0.9182084798812866),
('noncancer', 0.9133851528167725),
('skincancer', 0.898530900478363),
('cancerous', 0.892244279384613),
('cancers', 0.8634265065193176),
('anticancer', 0.8527657985687256),
('Cancer', 0.8359113931655884),
('lancer', 0.8296531438827515),
('Anticancer', 0.826178252696991),
('precancerous', 0.8116365671157837)]
# care
[('_care', 0.9151567816734314),
('încălcare', 0.874087929725647),
('Nexcare', 0.8578598499298096),
('diacare', 0.8515325784683228),
('încercare', 0.8445525765419006),
('fiecare', 0.8335763812065125),
('Mulcare', 0.8296753168106079),
('Fiecare', 0.8292017579078674),
('homecare', 0.8251558542251587),
('carece', 0.8141698837280273)]
# fight
[('Ifight', 0.892048180103302),
('fistfight', 0.8553390502929688),
('dogfight', 0.8371964693069458),
('fighter', 0.8167843818664551),
('bullfight', 0.8025394678115845),
('gunfight', 0.7972971200942993),
('fights', 0.790093183517456),
('Gunfight', 0.7893823385238647),
('fighting', 0.775499701499939),
('Fistfight', 0.770946741104126)]
2. Reducing the dimension size to 5:
_fmodel = FT_gensim(size=5, window=5, min_count=3, workers=10)
_fmodel.build_vocab(sentences=corpus)
_fmodel.train(sentences=corpus, total_examples=_fmodel.corpus_count, epochs=5) # workers is specified in the constructor
_fmodel.wv.vocab['cancer'].count # number of times the word occurred in the corpus
_fmodel.wv.most_similar('cancer')
_fmodel.wv.most_similar('care')
_fmodel.wv.most_similar('fight')
# cancer
[('nutrient', 0.999614417552948),
('reuptake', 0.9987781047821045),
('organ', 0.9987629652023315),
('tracheal', 0.9985960721969604),
('digestion', 0.9984923601150513),
('cortes', 0.9977986812591553),
('liposomes', 0.9977765679359436),
('adder', 0.997713565826416),
('adrenals', 0.9977011680603027),
('digestive', 0.9976763129234314)]
# care
[('lappropriate', 0.9990135431289673),
('coping', 0.9984776973724365),
('promovem', 0.9983049035072327),
('requièrent', 0.9982239603996277),
('diverso', 0.9977829456329346),
('feebleness', 0.9977156519889832),
('pathetical', 0.9975940585136414),
('procure', 0.997504472732544),
('delinking', 0.9973599910736084),
('entonces', 0.99733966588974)]
# fight
[('decied', 0.9996457099914551),
('uprightly', 0.999250054359436),
('chillies', 0.9990670680999756),
('stuttered', 0.998710036277771),
('cries', 0.9985755681991577),
('famish', 0.998246431350708),
('immortalizes', 0.9981046915054321),
('misled', 0.9980905055999756),
('whore', 0.9980045557022095),
('chanted', 0.9978444576263428)]
It's not GREAT, but it's no longer returning words that merely contain the subwords.
3. And for good measure, benchmark against Word2Vec:
from gensim.models.word2vec import Word2Vec
wmodel300 = Word2Vec(corpus, size=300, window=5, min_count=2, workers=10)
wmodel300.total_train_time # 187.1828162111342
wmodel300.wv.most_similar('cancer')
[('cancers', 0.6576876640319824),
('melanoma', 0.6564366817474365),
('malignancy', 0.6342018842697144),
('leukemia', 0.6293295621871948),
('disease', 0.6270142197608948),
('adenocarcinoma', 0.6181445121765137),
('Cancer', 0.6010828614234924),
('tumors', 0.5926551222801208),
('carcinoma', 0.5917977094650269),
('malignant', 0.5778893828392029)]
^ Better captures distributional similarity + much more realisitic similarity measures.
But with a smaller dim_size, the result is somewhat worse (also the similarities are less realistic, all around .99):
wmodel5 = Word2Vec(corpus, size=5, window=5, min_count=2, workers=10)
wmodel5.total_train_time # 151.4945764541626
wmodel5.wv.most_similar('cancer')
[('insulin', 0.9990534782409668),
('reaction', 0.9970406889915466),
('embryos', 0.9970351457595825),
('antibiotics', 0.9967449903488159),
('supplements', 0.9962579011917114),
('synthesize', 0.996055543422699),
('allergies', 0.9959680438041687),
('gadgets', 0.9957243204116821),
('mild', 0.9953152537345886),
('asthma', 0.994774580001831)]
Therefore, increasing the dimension size seems to help Word2Vec, but not fastText...
I'm sure this contrast has to do with the fact that the fastText model is learning subword info and somehow that's interacting with the param in a way increasing its value is hurtful. But I'm not sure how exactly... I'm trying to reconcile this finding with the intuition that increasing the size of the vectors should help in general because larger vectors capture more information.
I had the same issue with a corpus of 366k words. I think the problem is in the min_n max_n parameters. Try using
word_ngrams = 0
It is equivalent to word2vec according to documentation. Or try set min_n and max_n to bigger values.

how to get words of clusters

How can I get the words of each cluster
I divided them into groups
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in train['KARMA'].values:
all_content_train.append(LabeledSentence1(em,[j]))
j+=1
print('Number of texts processed: ', j)
d2v_model = Doc2Vec(all_content_train, vector_size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001)
d2v_model.train(all_content_train, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016)```
```kmeans_model = KMeans(n_clusters=10, init='k-means++', max_iter=100)
X = kmeans_model.fit(d2v_model.docvecs.doctag_syn0)
labels=kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(d2v_model.docvecs.doctag_syn0)
pca = PCA(n_components=2).fit(d2v_model.docvecs.doctag_syn0)
datapoint = pca.transform(d2v_model.docvecs.doctag_syn0)
I can get the text and its cluster but how can I learn the words which mainly created those groups
It's not an inherent feature of Doc2Vec to list words most-related to any document or doc-vector. (Other algorithms, such as LDA, will offer that.)
So, you could potentially write your own code, once you've split your documents into clusters, to report the words that are "most over-represented" in each cluster.
For example, calculate every word's frequency in the entire corpus, then each word's frequency in each cluster. For each cluster, report the N words whose in-cluster-frequency is the largest multiple of the full-corpus-frequency. Would this give helpful results on your data, for your needs? You'd have to try it.
Separately, regarding your use of Doc2Vec:
there's no good reason to alias the existing class TaggedDocument to a strange class name like LabeldSentence1. Just use TaggedDocument directly.
if you supply your corpus, all_content_train, to the object-inittialization – as your code does – then you don't need to also call train(). Training will have already happened automatically. If you do want more than the default amount of training (epochs=5), just supply a larger epochs value to the initialization.
the learning-rate values you've supplied to train() – start_alpha=0.002, end_alpha=-0.016 – are nonsensical & destructive. Few users should need to tinker with these alpha values at all, but especially, they should never increase from the beginning to end of a training cycle, as these values do.
If you were running with logging enabled at the INFO level, and/or watching the output closely, you would likely see readouts and warnings indicating that excessive training was happening, or problematic values used.

How to call a corpus file in python?

I am currently working on gensim doc2vec model to implement sentence similarity.
I came across this sample code by William Bert where he has mentioned that to train this model I need to provide my own background corpus. The code is copied below for convenience:
import logging, sys, pprint
logging.basicConfig(stream=sys.stdout, level=logging.INFO)
### Generating a training/background corpus from your own source of documents
from gensim.corpora import TextCorpus, MmCorpus, Dictionary
# gensim docs: "Provide a filename or a file-like object as input and TextCorpus will be initialized with a
# dictionary in `self.dictionary`and will support the `iter` corpus method. For other kinds of corpora, you only
# need to override `get_texts` and provide your own implementation."
background_corpus = TextCorpus(input=YOUR_CORPUS)
# Important -- save the dictionary generated by the corpus, or future operations will not be able to map results
# back to original words.
background_corpus.dictionary.save(
"my_dict.dict")
MmCorpus.serialize("background_corpus.mm",
background_corpus) # Uses numpy to persist wiki corpus in Matrix Market format. File will be several GBs.
### Generating a large training/background corpus using Wikipedia
from gensim.corpora import WikiCorpus, wikicorpus
articles = "enwiki-latest-pages-articles.xml.bz2" # available from http://en.wikipedia.org/wiki/Wikipedia:Database_download
# This will take many hours! Output is Wikipedia in bucket-of-words (BOW) sparse matrix.
wiki_corpus = WikiCorpus(articles)
wiki_corpus.dictionary.save("wiki_dict.dict")
MmCorpus.serialize("wiki_corpus.mm", wiki_corpus) # File will be several GBs.
### Working with persisted corpus and dictionary
bow_corpus = MmCorpus("wiki_corpus.mm") # Revive a corpus
dictionary = Dictionary.load("wiki_dict.dict") # Load a dictionary
### Transformations among vector spaces
from gensim.models import LsiModel, LogEntropyModel
logent_transformation = LogEntropyModel(wiki_corpus,
id2word=dictionary) # Log Entropy weights frequencies of all document features in the corpus
tokenize_func = wikicorpus.tokenize # The tokenizer used to create the Wikipedia corpus
document = "Some text to be transformed."
# First, tokenize document using the same tokenization as was used on the background corpus, and then convert it to
# BOW representation using the dictionary created when generating the background corpus.
bow_document = dictionary.doc2bow(tokenize_func(
document))
# converts a single document to log entropy representation. document must be in the same vector space as corpus.
logent_document = logent_transformation[[
bow_document]]
# Transform arbitrary documents by getting them into the same BOW vector space created by your training corpus
documents = ["Some iterable", "containing multiple", "documents", "..."]
bow_documents = (dictionary.doc2bow(
tokenize_func(document)) for document in documents) # use a generator expression because...
logent_documents = logent_transformation[
bow_documents] # ...transformation is done during iteration of documents using generators, so this uses constant memory
### Chained transformations
# This builds a new corpus from iterating over documents of bow_corpus as transformed to log entropy representation.
# Will also take many hours if bow_corpus is the Wikipedia corpus created above.
logent_corpus = MmCorpus(corpus=logent_transformation[bow_corpus])
# Creates LSI transformation model from log entropy corpus representation. Takes several hours with Wikipedia corpus.
lsi_transformation = LsiModel(corpus=logent_corpus, id2word=dictionary,
num_features=400)
# Alternative way of performing same operation as above, but with implicit chaining
# lsi_transformation = LsiModel(corpus=logent_transformation[bow_corpus], id2word=dictionary,
# num_features=400)
# Can persist transformation models, too.
logent_transformation.save("logent.model")
lsi_transformation.save("lsi.model")
### Similarities (the best part)
from gensim.similarities import Similarity
# This index corpus consists of what you want to compare future queries against
index_documents = ["A bear walked in the dark forest.",
"Tall trees have many more leaves than short bushes.",
"A starship may someday travel across vast reaches of space to other stars.",
"Difference is the concept of how two or more entities are not the same."]
# A corpus can be anything, as long as iterating over it produces a representation of the corpus documents as vectors.
corpus = (dictionary.doc2bow(tokenize_func(document)) for document in index_documents)
index = Similarity(corpus=lsi_transformation[logent_transformation[corpus]], num_features=400, output_prefix="shard")
print "Index corpus:"
pprint.pprint(documents)
print "Similarities of index corpus documents to one another:"
pprint.pprint([s for s in index])
query = "In the face of ambiguity, refuse the temptation to guess."
sims_to_query = index[lsi_transformation[logent_transformation[dictionary.doc2bow(tokenize_func(query))]]]
print "Similarities of index corpus documents to '%s'" % query
pprint.pprint(sims_to_query)
best_score = max(sims_to_query)
index = sims_to_query.tolist().index(best_score)
most_similar_doc = documents[index]
print "The document most similar to the query is '%s' with a score of %.2f." % (most_similar_doc, best_score)
Where and how should I provide my own corpus in the code?
Thanks in advance for your help.

Python nltk classify with large feature set (Replicate Go Et Al 2009)

I'm trying to replicate Go Et Al. Twitter sentiment Analysis which can be found here http://help.sentiment140.com/for-students
The problem I'm having is the number of features is 364464. I'm currently using nltk and nltk.NaiveBayesClassifier to do this where tweets holds a replication of the 1,600,000 tweets and there polarity:
for tweet in tweets:
tweet[0] = extract_features(tweet[0], features)
classifier = nltk.NaiveBayesClassifier.train(training_set)
# print "NB Classified"
classifier.show_most_informative_features()
print(nltk.classify.util.accuracy(classifier, testdata))
Everything doesn't take very long apart from the extract_features function
def extract_features(tweet, featureList):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
This is because for each tweet it's creating a dictionary of size 364,464 to represent whether something is present or not.
Is there a way to make this faster or more efficient without reducing the number of features like in this paper?
Turns out there is a wonderful function called:
nltk.classify.util.apply_features()
which you can find herehttp://www.nltk.org/api/nltk.classify.html
training_set = nltk.classify.apply_features(extract_features, tweets)
I had to change my extract_features function but it now works with the huge sizes without memory issues.
Here's a lowdown of the function description:
The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).
and my changed function:
def extract_features(tweet):
tweet_words = set(tweet)
global featureList
features = {}
for word in featureList:
features[word] = False
for word in tweet_words:
if word in featureList:
features[word] = True
return features

Categories