1. For the below test text,
test=['test test', 'test toy']
the tf-idf score [without normalisation (smartirs: 'ntn')] is
[['test', 1.17]]
[['test', 0.58], ['toy', 1.58]]
This doesn't seem to tally with what I get via direct computation of
tfidf (w, d) = tf x idf
where idf(term)=log (total number of documents / number of documents containing term)
tf = number of instances of word in d document / total number of words of d document
Eg
doc 1: 'test test'
for "test" word
tf= 1
idf= log(2/2) = 0
tf-idf = 0
Can someone show me the computation using my above test text?
2) When I change to cosine normalisation (smartirs:'ntc'), I get
[['test', 1.0]]
[['test', 0.35], ['toy', 0.94]]
Can someone show me the computation too?
Thank you
import gensim
from gensim import corpora
from gensim import models
import numpy as np
from gensim.utils import simple_preprocess
test=['test test', 'test toy']
texts = [simple_preprocess(doc) for doc in test]
mydict= corpora.Dictionary(texts)
mycorpus = [mydict.doc2bow(doc, allow_update=True) for doc in texts]
tfidf = models.TfidfModel(mycorpus, smartirs='ntn')
for doc in tfidf[mycorpus]:
print([[mydict[id], np.around(freq, decimals=2)] for id, freq in doc])
If you care to know the details of the implementation of the model.TfidfModel you can check them directly in the GitHub repository for gensim. The particular calculation scheme corresponding to smartirs='ntn' is described on the Wikipedia page for SMART Information Retrieval System and the exact calculations are different than the ones you use hence the difference in the results.
E.g. the particular discrepancy you are referring to:
idf= log(2/2) = 0
should actually be log2(N+1/n_k):
idf= log(2/1) = 1
I suggest that you review both the implementation and the mentioned page so that to ensure your manual checks follow the implementation for the chosen smartirs flags.
With Gensim, there are three functions I use regularly, for example this one:
model = gensim.models.Word2Vec(corpus,size=100,min_count=5)
The output from gensim, but I cannot understand how to set the size and min_count parameters in the equivalent SciSpacy command of:
model = spacy.load('en_core_web_md')
(The output is a model of embeddings (too big to add here))).
This is another command I regularly use:
model.most_similar(positive=['car'])
and this is the output from gensim/Expected output from SciSpacy:
[('vehicle', 0.7857330441474915),
('motorbike', 0.7572781443595886),
('train', 0.7457204461097717),
('honda', 0.7383008003234863),
('volkswagen', 0.7298516035079956),
('mini', 0.7158907651901245),
('drive', 0.7093928456306458),
('driving', 0.7084407806396484),
('road', 0.7001082897186279),
('traffic', 0.6991947889328003)]
This is the third command I regularly use:
print(model.wv['car'])
Output from Gensim/Expected output from SciSpacy (in reality this vector is length 100):
[ 1.0942473 2.5680697 -0.43163642 -1.171171 1.8553845 -0.3164575
1.3645878 -0.5003705 2.912658 3.099512 2.0184739 -1.2413547
0.9156444 -0.08406237 -2.2248871 2.0038593 0.8751471 0.8953876
0.2207374 -0.157277 -1.4984075 0.49289042 -0.01171476 -0.57937795...]
Could someone show me the equivalent commands for SciSpacy? For example, for 'gensim.models.Word2Vec' I can't find how to specify the length of the vectors (size parameter), or the minimum number of times the word should be in the corpus (min_count) in SciSpacy (e.g. I looked here and here), but I'm not sure if I'm missing them?
A possible way to achieve your goal would be to:
parse you documents via nlp.pipe
collect all the words and pairwise similarities
process similarities to get the desired results
Let's prepare some data:
import spacy
nlp = spacy.load("en_core_web_md", disable = ['ner', 'tagger', 'parser'])
Then, to get a vector, like in model.wv['car'] one would do:
nlp("car").vector
To get most similar words like model.most_similar(positive=['car']) let's process the corpus:
corpus = ["This is a sentence about cars. This a sentence aboout train"
, "And this is a sentence about a bike"]
docs = nlp.pipe(corpus)
tokens = []
tokens_orth = []
for doc in docs:
for tok in doc:
if tok.orth_ not in tokens_orth:
tokens.append(tok)
tokens_orth.append(tok.orth_)
sims = np.zeros((len(tokens),len(tokens)))
for i, tok in enumerate(tokens):
sims[i] = [tok.similarity(tok_) for tok_ in tokens]
Then to retrieve top=3 most similar words:
def most_similar(word, tokens_orth = tokens_orth, sims=sims, top=3):
tokens_orth = np.array(tokens_orth)
id_word = np.where(tokens_orth == word)[0][0]
sim = sims[id_word]
id_ms = np.argsort(sim)[:-top-1:-1]
return list(zip(tokens_orth[id_ms], sim[id_ms]))
most_similar("This")
[('this', 1.0000001192092896), ('This', 1.0), ('is', 0.5970357656478882)]
PS
I have also noticed you asked for specification of dimension and frequency. Embedding length is fixed at the time the model is initialized, so it can't be changed after that. You can start from a blank model if you wish so, and feed embeddings you're comfortable with. As for the frequency, it's doable, via counting all the words and throwing away anything that is below desired threshold. But again, underlying embeddings will be from a not filtered text. SpaCy is different from Gensim in that it uses readily available embeddings whereas Gensim trains them.
gensim's wv.most_similar returns phonologically close words (similar sounds) instead of semantically similar ones. Is this normal? Why might this happen?
Here's the documentation on most_similar: https://radimrehurek.com/gensim/models/keyedvectors.html#gensim.models.keyedvectors.WordEmbeddingsKeyedVectors.most_similar
In [144]: len(vectors.vocab)
Out[144]: 32966
...
In [140]: vectors.most_similar('fight')
Out[140]:
[('Night', 0.9940935373306274),
('knight', 0.9928507804870605),
('fright', 0.9925899505615234),
('light', 0.9919329285621643),
('bright', 0.9914385080337524),
('plight', 0.9912853240966797),
('Eight', 0.9912533760070801),
('sight', 0.9908033013343811),
('playwright', 0.9905624985694885),
('slight', 0.990411102771759)]
In [141]: vectors.most_similar('care')
Out[141]:
[('spare', 0.9710584878921509),
('scare', 0.9626247882843018),
('share', 0.9594929218292236),
('prepare', 0.9584596157073975),
('aware', 0.9551078081130981),
('negare', 0.9550014138221741),
('glassware', 0.9507938027381897),
('Welfare', 0.9489598274230957),
('warfare', 0.9487678408622742),
('square', 0.9473209381103516)]
The training data contains academic papers and this was my training script:
from gensim.models.fasttext import FastText as FT_gensim
import gensim.models.keyedvectors as word2vec
dim_size = 300
epochs = 10
model = FT_gensim(size=dim_size, window=3, min_count=1)
model.build_vocab(sentences=corpus_reader, progress_per=1000)
model.train(sentences=corpus_reader, total_examples=total_examples, epochs=epochs)
# saving vectors to disk
path = "/home/ubuntu/volume/my_vectors.vectors"
model.wv.save_word2vec_format(path, binary=True)
# loading vectors
vectors = word2vec.KeyedVectors.load_word2vec_format(path)
You've chosen to use the FastText algorithm to train your vectors. That algorithm specifically makes use of subword fragments (like 'ight' or 'are') to have a chance of synthesizing good guess-vectors for 'out-of-vocabulary' words that weren't in the training set, and that could be one contributor to the results you're seeing.
However, usually words' unique meanings predominate, with the influence of such subwords only coming into play for unknown words. And, it's rare for the most-similar lists of any words in a healthy set of word-vectors to have so many 0.99+ similarities.
So, I suspect there's something weird or deficient in your training data.
What kind of text is it, and how many total words of example usages does it contain?
Were there any perplexing aspects of training progress/speed shown in INFO-level logs during training?
(300 dimensions may also be a bit excessive with a vocabulary of only 33K unique words; that's a vector-size that's common in work with hundreds of thousands to millions of unique words, and plentiful training data.)
That's a good call-out on the dimension size. Reducing that param definitely did make a difference.
1. Reproducing the original behavior (where dim_size=300) with a larger corpus (33k --> 275k unique vocab):
Note: I've also tweaked a few other params, like min_count, window, etc.)
from gensim.models.fasttext import FastText as FT_gensim
fmodel0 = FT_gensim(size=300, window=5, min_count=3, workers=10) # window is The maximum distance between the current and predicted word within a sentence.
fmodel0.build_vocab(sentences=corpus)
fmodel0.train(sentences=corpus, total_examples=fmodel0.corpus_count, epochs=5)
fmodel0.wv.vocab['cancer'].count # number of times the word occurred in the corpus
fmodel0.wv.most_similar('cancer')
fmodel0.wv.most_similar('care')
fmodel0.wv.most_similar('fight')
# -----------
# cancer
[('breastcancer', 0.9182084798812866),
('noncancer', 0.9133851528167725),
('skincancer', 0.898530900478363),
('cancerous', 0.892244279384613),
('cancers', 0.8634265065193176),
('anticancer', 0.8527657985687256),
('Cancer', 0.8359113931655884),
('lancer', 0.8296531438827515),
('Anticancer', 0.826178252696991),
('precancerous', 0.8116365671157837)]
# care
[('_care', 0.9151567816734314),
('încălcare', 0.874087929725647),
('Nexcare', 0.8578598499298096),
('diacare', 0.8515325784683228),
('încercare', 0.8445525765419006),
('fiecare', 0.8335763812065125),
('Mulcare', 0.8296753168106079),
('Fiecare', 0.8292017579078674),
('homecare', 0.8251558542251587),
('carece', 0.8141698837280273)]
# fight
[('Ifight', 0.892048180103302),
('fistfight', 0.8553390502929688),
('dogfight', 0.8371964693069458),
('fighter', 0.8167843818664551),
('bullfight', 0.8025394678115845),
('gunfight', 0.7972971200942993),
('fights', 0.790093183517456),
('Gunfight', 0.7893823385238647),
('fighting', 0.775499701499939),
('Fistfight', 0.770946741104126)]
2. Reducing the dimension size to 5:
_fmodel = FT_gensim(size=5, window=5, min_count=3, workers=10)
_fmodel.build_vocab(sentences=corpus)
_fmodel.train(sentences=corpus, total_examples=_fmodel.corpus_count, epochs=5) # workers is specified in the constructor
_fmodel.wv.vocab['cancer'].count # number of times the word occurred in the corpus
_fmodel.wv.most_similar('cancer')
_fmodel.wv.most_similar('care')
_fmodel.wv.most_similar('fight')
# cancer
[('nutrient', 0.999614417552948),
('reuptake', 0.9987781047821045),
('organ', 0.9987629652023315),
('tracheal', 0.9985960721969604),
('digestion', 0.9984923601150513),
('cortes', 0.9977986812591553),
('liposomes', 0.9977765679359436),
('adder', 0.997713565826416),
('adrenals', 0.9977011680603027),
('digestive', 0.9976763129234314)]
# care
[('lappropriate', 0.9990135431289673),
('coping', 0.9984776973724365),
('promovem', 0.9983049035072327),
('requièrent', 0.9982239603996277),
('diverso', 0.9977829456329346),
('feebleness', 0.9977156519889832),
('pathetical', 0.9975940585136414),
('procure', 0.997504472732544),
('delinking', 0.9973599910736084),
('entonces', 0.99733966588974)]
# fight
[('decied', 0.9996457099914551),
('uprightly', 0.999250054359436),
('chillies', 0.9990670680999756),
('stuttered', 0.998710036277771),
('cries', 0.9985755681991577),
('famish', 0.998246431350708),
('immortalizes', 0.9981046915054321),
('misled', 0.9980905055999756),
('whore', 0.9980045557022095),
('chanted', 0.9978444576263428)]
It's not GREAT, but it's no longer returning words that merely contain the subwords.
3. And for good measure, benchmark against Word2Vec:
from gensim.models.word2vec import Word2Vec
wmodel300 = Word2Vec(corpus, size=300, window=5, min_count=2, workers=10)
wmodel300.total_train_time # 187.1828162111342
wmodel300.wv.most_similar('cancer')
[('cancers', 0.6576876640319824),
('melanoma', 0.6564366817474365),
('malignancy', 0.6342018842697144),
('leukemia', 0.6293295621871948),
('disease', 0.6270142197608948),
('adenocarcinoma', 0.6181445121765137),
('Cancer', 0.6010828614234924),
('tumors', 0.5926551222801208),
('carcinoma', 0.5917977094650269),
('malignant', 0.5778893828392029)]
^ Better captures distributional similarity + much more realisitic similarity measures.
But with a smaller dim_size, the result is somewhat worse (also the similarities are less realistic, all around .99):
wmodel5 = Word2Vec(corpus, size=5, window=5, min_count=2, workers=10)
wmodel5.total_train_time # 151.4945764541626
wmodel5.wv.most_similar('cancer')
[('insulin', 0.9990534782409668),
('reaction', 0.9970406889915466),
('embryos', 0.9970351457595825),
('antibiotics', 0.9967449903488159),
('supplements', 0.9962579011917114),
('synthesize', 0.996055543422699),
('allergies', 0.9959680438041687),
('gadgets', 0.9957243204116821),
('mild', 0.9953152537345886),
('asthma', 0.994774580001831)]
Therefore, increasing the dimension size seems to help Word2Vec, but not fastText...
I'm sure this contrast has to do with the fact that the fastText model is learning subword info and somehow that's interacting with the param in a way increasing its value is hurtful. But I'm not sure how exactly... I'm trying to reconcile this finding with the intuition that increasing the size of the vectors should help in general because larger vectors capture more information.
I had the same issue with a corpus of 366k words. I think the problem is in the min_n max_n parameters. Try using
word_ngrams = 0
It is equivalent to word2vec according to documentation. Or try set min_n and max_n to bigger values.
How can I get the words of each cluster
I divided them into groups
LabeledSentence1 = gensim.models.doc2vec.TaggedDocument
all_content_train = []
j=0
for em in train['KARMA'].values:
all_content_train.append(LabeledSentence1(em,[j]))
j+=1
print('Number of texts processed: ', j)
d2v_model = Doc2Vec(all_content_train, vector_size = 100, window = 10, min_count = 500, workers=7, dm = 1,alpha=0.025, min_alpha=0.001)
d2v_model.train(all_content_train, total_examples=d2v_model.corpus_count, epochs=10, start_alpha=0.002, end_alpha=-0.016)```
```kmeans_model = KMeans(n_clusters=10, init='k-means++', max_iter=100)
X = kmeans_model.fit(d2v_model.docvecs.doctag_syn0)
labels=kmeans_model.labels_.tolist()
l = kmeans_model.fit_predict(d2v_model.docvecs.doctag_syn0)
pca = PCA(n_components=2).fit(d2v_model.docvecs.doctag_syn0)
datapoint = pca.transform(d2v_model.docvecs.doctag_syn0)
I can get the text and its cluster but how can I learn the words which mainly created those groups
It's not an inherent feature of Doc2Vec to list words most-related to any document or doc-vector. (Other algorithms, such as LDA, will offer that.)
So, you could potentially write your own code, once you've split your documents into clusters, to report the words that are "most over-represented" in each cluster.
For example, calculate every word's frequency in the entire corpus, then each word's frequency in each cluster. For each cluster, report the N words whose in-cluster-frequency is the largest multiple of the full-corpus-frequency. Would this give helpful results on your data, for your needs? You'd have to try it.
Separately, regarding your use of Doc2Vec:
there's no good reason to alias the existing class TaggedDocument to a strange class name like LabeldSentence1. Just use TaggedDocument directly.
if you supply your corpus, all_content_train, to the object-inittialization – as your code does – then you don't need to also call train(). Training will have already happened automatically. If you do want more than the default amount of training (epochs=5), just supply a larger epochs value to the initialization.
the learning-rate values you've supplied to train() – start_alpha=0.002, end_alpha=-0.016 – are nonsensical & destructive. Few users should need to tinker with these alpha values at all, but especially, they should never increase from the beginning to end of a training cycle, as these values do.
If you were running with logging enabled at the INFO level, and/or watching the output closely, you would likely see readouts and warnings indicating that excessive training was happening, or problematic values used.
I have a dataset of 6000 observations; a sample of it is the following:
job_id job_title job_sector
30018141 Secondary Teaching Assistant Education
30006499 Legal Sales Assistant / Executive Sales
28661197 Private Client Practitioner Legal
28585608 Senior hydropower mechanical project manager Engineering
28583146 Warehouse Stock Checker - Temp / Immediate Start Transport & Logistics
28542478 Security Architect Contract IT & Telecoms
The goal is to predict the job sector of each row based on the job title.
Firstly, I apply some preprocessing on the job_title column:
def preprocess(document):
lemmatizer = WordNetLemmatizer()
stemmer_1 = PorterStemmer()
stemmer_2 = LancasterStemmer()
stemmer_3 = SnowballStemmer(language='english')
# Remove all the special characters
document = re.sub(r'\W', ' ', document)
# remove all single characters
document = re.sub(r'\b[a-zA-Z]\b', ' ', document)
# Substituting multiple spaces with single space
document = re.sub(r' +', ' ', document, flags=re.I)
# Converting to lowercase
document = document.lower()
# Tokenisation
document = document.split()
# Stemming
document = [stemmer_3.stem(word) for word in document]
document = ' '.join(document)
return document
df_first = pd.read_csv('../data.csv', keep_default_na=True)
for index, row in df_first.iterrows():
df_first.loc[index, 'job_title'] = preprocess(row['job_title'])
Then I do the following with Gensim and Doc2Vec:
X = df_first.loc[:, 'job_title'].values
y = df_first.loc[:, 'job_sector'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=0)
tagged_train = TaggedDocument(words=X_train.tolist(), tags=y_train.tolist())
tagged_train = list(tagged_train)
tagged_test = TaggedDocument(words=X_test.tolist(), tags=y_test.tolist())
tagged_test = list(tagged_test)
model = Doc2Vec(vector_size=5, min_count=2, epochs=30)
training_set = [TaggedDocument(sentence, tag) for sentence, tag in zip(X_train.tolist(), y_train.tolist())]
model.build_vocab(training_set)
model.train(training_set, total_examples=model.corpus_count, epochs=model.epochs)
test_set = [TaggedDocument(sentence, tag) for sentence, tag in zip(X_test.tolist(), y_test.tolist())]
predictors_train = []
for sentence in X_train.tolist():
sentence = sentence.split()
predictor = model.infer_vector(doc_words=sentence, steps=20, alpha=0.01)
predictors_train.append(predictor.tolist())
predictors_test = []
for sentence in X_test.tolist():
sentence = sentence.split()
predictor = model.infer_vector(doc_words=sentence, steps=20, alpha=0.025)
predictors_test.append(predictor.tolist())
sv_classifier = SVC(kernel='linear', class_weight='balanced', decision_function_shape='ovr', random_state=0)
sv_classifier.fit(predictors_train, y_train)
score = sv_classifier.score(predictors_test, y_test)
print('accuracy: {}%'.format(round(score*100, 1)))
However, the result which I am getting is 22% accuracy.
This makes me a lot suspicious especially because by using the TfidfVectorizer instead of the Doc2Vec (both with the same classifier) then I am getting 88% accuracy (!).
Therefore, I guess that I must be doing something wrong in how I apply the Doc2Vec of Gensim.
What is it and how can I fix it?
Or it it simply that my dataset is relatively small while more advanced methods such as word embeddings etc require way more data?
You don't mention the size of your dataset - in rows, total words, unique words, or unique classes. Doc2Vec works best with lots of data. Most published work trains on tens-of-thousands to millions of documents, of dozens to thousands of words each. (Your data appears to only have 3-5 words per document.)
Also, published work tends to train on data where every document has a unique-ID. It can sometimes make sense to use known-labels as tags instead of, or in addition to, unique-IDs. But it isn't necessarily a better approach. By using known-labels as the only tags, you're effectively only training one doc-vector per label. (It's essentially similar to concatenating all rows with the same tag into one document.)
You're inexplicably using fewer steps in inference than epochs in training - when in fact these are analogous values. In recent versions of gensim, inference will by default use the same number of inference epochs as the model was configured to use for training. And, it's more common to use more epochs during inference than training. (Also, you're inexplicably using different starting alpha values for inference for both classifier-training and classifier-testing.)
But the main problem is likely your choice of tiny size=5 doc vectors. Instead of the TfidfVectorizer, which will summarize each row as a vector of width equal to the unique-word count – perhaps hundreds or thousands of dimensions – your Doc2Vec model summarizes each document as just 5 values. You've essentially lobotomized Doc2Vec. Usual values here are 100-1000 – though if the dataset is tiny smaller sizes may be required.
Finally, the lemmatization/stemming may not be strictly necessary and may even be destructive. Lots of Word2Vec/Doc2Vec work doesn't bother to lemmatize/stem - often because there's plentiful data, with many appearances of all word forms.
These steps are most likely to help with smaller data, by making sure rarer word forms are combined with related longer forms to still get value from words that would otherwise be too rare to be retained (or get useful vectors).
But I can see many ways they might hurt for your domain. Manager and Management won't have exactly the same implications in this context, but could both be stemmed to manag. Similar for Security and Securities both becoming secur, and other words. I'd only perform these steps if you can prove through evaluation that they're helping. (Are the words passed to the TfidfVectorizer being lemmatized/stemmed?)
usually to train doc2vec/word2vec requires lots of generalised data (word2vec trained on 3 milian Wikipedia articles), since it's performing poorly on doc2vec consider experimenting with pre trained doc2vec refer this
Or you can try using word2vec and averaging it for entire document since word2vec gives vector for each word.
Let me know how this helps ?
The tools you are using are not suitable for classification. Id suggest you look into something like a char-rnn.
https://pytorch.org/tutorials/intermediate/char_rnn_classification_tutorial.html
This tutorial works on a similar problem, where it classifies names.