nodevectors not returning all nodes - python

I'm trying to use nodevector's Node2Vec class to get an embedding for my graph. I can't show the entire code, but basically this is what I'm doing:
import networkx as nx
import pandas as pd
import nodevectors
n2v = nodevectors.Node2Vec(n_components=128,
walklen=80,
epochs=3,
return_weight=1,
neighbor_weight=1,
threads=4)
G = nx.from_pandas_edgelist(df, 'customer', 'item', edge_attr='weight', create_using=nx.Graph)
n2v.fit(G)
model = n2v.model
shape = model.ww.vectors.shape
I know G has all the nodes from my scope. Then, I fit the model, but model.ww.vectors has a number of rows smaller than my number of nodes.
I'm not successfully finding why do the number of nodes represented in my embedding by model.ww.vectors is lower than my actual number of nodes in G.
Does anyone know why it happens?

TL;DR: Your non-default epochs=3 can result in some nodes appearing only 3 times – but the inner Word2Vec model by default ignores tokens appearing fewer than 5 times. Upping to epochs=5 may be a quick fix - but read on for the reasons & tradeoffs with various defaults.
--
If you're using the nodevectors package described here, it seems to be built on Gensim's Word2Vec – which uses a default min_count=5.
That means any tokens – in this case, nodes – which appear fewer than 5 times are ignored. Especially in the natural-language contexts where Word2Vec was pioneered, discarding such rare words entirely usually has multiple benefits:
from only a few idiosyncratic examples, such rare words themselves get peculiar vectors less-likely to generalize to downstream uses (other texts)
compared to other frequent words, each gets very little training effort overall, & thus provides only a little pushback on shared model weights (based on their peculiar examples) - so the vectors are weaker & retain more arbitrary influence from random-initialization & relative positioning in the corpus. (More-frequent words provide more varied, numerous examples to extract their unique meaning.)
because of the Zipfian distribution of word-frequencies in natural language, there are a lot of such low-frequency words – often even typos – and altogether they take up a lot of the model's memory & training-time. But they don't individually get very good vectors, or have generalizable beneficial influences on the shared model. So they wind up serving a lot like noise that weakens other vectors for more-frequent words, as well.
So typically in Word2Vec, discarding rare words only gives up low-value vectors while simultaneously speeding training, shrinking memory requirements, & improving the quality of the remaining vectors: a big win.
Although the distribution of node-names in graph random-walks may be very different from natural-language word-frequencies, some of the same concerns still apply for nodes that appear rarely. On the other hand, if a node truly only appears at the end of a long chain of nodes, every walk to or from it will include the exact same neighbors - and maybe extra appearances in more walks would add no new variety-of-information (at least within the inner Word2Vec window of analysis).
You may be able to confirm if the default min_count is your issue by using the Node2Vec keep_walks parameter to store the generated walks, then checking: are exactly the nodes that are 'missing' appearing fewer than min_count times in the walks?
If so, a few options may be:
override min_count using the Node2Vec w2vparams option to something like min_count=1. As noted above, this is always a bad idea in traditional natural-language Word2Vec - but maybe it's not so bad in a graph application, where for rare/outer-edge nodes one walk is enough, and then at least you have whatever strange/noisy vector results from that minimal training.
try to influence the walks to ensure all nodes appear enough times. I suppose some values of the Node2Vec walklen, return_weight, & neighbor_weight could improve coverage - but I don't think they could guarantee all nodes appear in at least N (say, 5, to match the default min_count) different walks. But it looks like the Node2Vec epochs parameter controls how many time every node is used as a starting point – so epochs=5 would guarantee every node appears at least 5 times, as the start of 5 separate walks. (Notably: the Node2Vec default is epochs=20 - which would never trigger a bad interaction with the internal Word2Vec min_count=5. But setting your non-default epochs=3 risks leaving some nodes with only 3 appearances.)

Related

Can doc2vec training result could change with same input data, and same parameter?

I'm using Doc2Vec in gensim library, and finding similiarity between movie, with its name as input.
model = doc2vec.Doc2Vec(vector_size=100, alpha=0.025, min_alpha=0.025, window=5)
model.build_vocab(tagged_corpus_list)
model.train(tagged_corpus_list, total_examples=model.corpus_count, epochs=50)
I set parameter like this, and didn't change preprocessing mechanism of input data, didn't changed original data.
similar_doc = model.dv.most_similar(input)
I also used this code to find most similar movie.
When I restarted code to train this model, the most similar movie has changed, with changed score.
Is this possible? Why? If then, how can I fix the training result?
Yes, this sort of change from run to run is normal. It's well-explained in question 11 of the Gensim FAQ:
Q11: I've trained my Word2Vec / Doc2Vec / etc model repeatedly using the exact same text corpus, but the vectors are different each time. Is there a bug or have I made a mistake? (*2vec training non-determinism)
Answer: The *2vec models (word2vec, fasttext, doc2vec…) begin with random initialization, then most modes use additional randomization
during training. (For example, the training windows are randomly
truncated as an efficient way of weighting nearer words higher. The
negative examples in the default negative-sampling mode are chosen
randomly. And the downsampling of highly-frequent words, as controlled
by the sample parameter, is driven by random choices. These
behaviors were all defined in the original Word2Vec paper's algorithm
description.)
Even when all this randomness comes from a
pseudorandom-number-generator that's been seeded to give a
reproducible stream of random numbers (which gensim does by default),
the usual case of multi-threaded training can further change the exact
training-order of text examples, and thus the final model state.
(Further, in Python 3.x, the hashing of strings is randomized each
re-launch of the Python interpreter - changing the iteration ordering
of vocabulary dicts from run to run, and thus making even the same
string-of-random-number-draws pick different words in different
launches.)
So, it is to be expected that models vary from run to run, even
trained on the same data. There's no single "right place" for any
word-vector or doc-vector to wind up: just positions that are at
progressively more-useful distances & directions from other vectors
co-trained inside the same model. (In general, only vectors that were
trained together in an interleaved session of contrasting uses become
comparable in their coordinates.)
Suitable training parameters should yield models that are roughly as
useful, from run-to-run, as each other. Testing and evaluation
processes should be tolerant of any shifts in vector positions, and of
small "jitter" in the overall utility of models, that arises from the
inherent algorithm randomness. (If the observed quality from
run-to-run varies a lot, there may be other problems: too little data,
poorly-tuned parameters, or errors/weaknesses in the evaluation
method.)
You can try to force determinism, by using workers=1 to limit
training to a single thread – and, if in Python 3.x, using the
PYTHONHASHSEED environment variable to disable its usual string hash
randomization. But training will be much slower than with more
threads. And, you'd be obscuring the inherent
randomness/approximateness of the underlying algorithms, in a way that
might make results more fragile and dependent on the luck of a
particular setup. It's better to tolerate a little jitter, and use
excessive jitter as an indicator of problems elsewhere in the data or
model setup – rather than impose a superficial determinism.
If the change between runs is small – nearest neighbors mostly the same, with a few in different positions – it's best to tolerate it.
If the change is big, there's likely some other problem, like insufficient training data or poorly-chosen parameters.
Notably, min_alpha=0.025 isn't a sensible value - the training is supposed to use a gradually-decreasing value, and the usual default (min_alpha=0.0001) usually doesn't need changing. (If you copied this from an online example: that's a bad example! Don't trust that site unless it explains why it's doing an odd thing.)
Increasing the number of training epochs, from the default epochs=5 to something like 10 or 20 may also help make run-to-run results more consistent, especially if you don't have plentiful training data.

How to work with a range of n-grams on Facebook FastText?

I am trying to train a text classifier with FastText. It has a bunch of options along with a facility to train from the command line. One of the options is wordNgrams.
In my particular dataset, I discovered that many irrelevant queries are being classified with high confidence because they share similar tokens. So, my plan was to ignore the unigram tokens and start from bigram. Now I go from 1st gram up to 5th gram by setting wordNgrams = 5, but my plan is to go from 2nd gram to 5th gram. But it seems to be that FastText doesn't support this. Is there any way to achieve this, this is required to minimize these False Positives.
As far as I can tell, even though Facebook's fasttext lets users set a range for character-n-grams (subword information), using -minn & -maxn, it only offers a single -wordNgrams parameter setting the maximum length of word-multigrams.
However, it is also the case that the -supervised mode combines all the given tokens in an order-oblivious way. Thus, you could in your own preprocessing create whatever mix of n-grams (or other token-represented features) you'd like, then pass those to fasttext (which it would consider as all unigrams). As long as you apply the same preprocessing in training as in later classification, the effect should be the same.
(You could even use the sklearn CountVectorizer's preprocessing.)
Keep in mind the warning from ~Erwan in a comment, though: adding so many distinct features increases the risk of overfitting, which could show up as your stated problem "many irrelevant queries are being classified with high confidence because they share similar tokens". (The model, made large by the inclusion of so many n-grams, has memorized idiosyncratic minutia from the training-data. That then leads it astray applying non-generalizable inferences to out-of-training data.)

very different values from normed_vector cosine similarity and most_similar

I have a certain Doc2Vec model built on website data. I am trying to use the embeddings to find websites that are most similar to each other. To do so, I am doing a cosine similarity of the matrix. I am also comparing this to the output of most_similar().
The problem, they are providing substantively different matches (not only slightly different).
To make concrete, for a firm of index value 791 and text on value text I compare.
text = self.website_info.iloc[791].text
tokens = text.split()
vec = self.word2vec_model.infer_vector(tokens,negative=0)
most_similar = self.word2vec_model.docvecs.most_similar([vec])
to
self.word2vec_model.init_sims()
mat = self.word2vec_model.docvecs.get_normed_vectors()
w2v_sim = np.dot(mat, mat.T)
sims = pd.DataFrame(pd.Series(w2v_sim[791]))
sims.rename(columns={0:'sim'}, inplace = True)
sims.sort_values(by='sim',ascending=False,inplace=True)
most_similar = sims.head(20)
I also see that the embedding vectors real and inferred are substantively different. Not just normalization or values, but big differences in the sign of the components.
There's a bunch in your code that doesn't quite make sense.
If you're using Gensim-4.0 or higher – where .get_normed_vectors() exists – there's never a need to call .init_sims(). (In fact, it should be showing a deprecation warning.)
Only a Doc2Vec model will support .infer_vector() – so it's odd to name your variable, word2vec_model
The .infer_vector() method doesn't take a negative=0 argument - so that code would generate an error in the Gensim library that you otherwise appear to be using. (And, if it did somehow take a negative argument changing the inference to use 0 negative-examples, that'd break inference - which should use the same negative value as during model training.)
I'm also not sure about your alternate calculation - in particular, dirving it through Pandas instead of native Numpy operations seems unnnecessary, and an all-to-all comparison would often be very expensive, in any model with a sufficiently-large number of documents.
But also: the inferred vector will essentially never be identical to the vector for the same text during training. It'll just be 'close', if everything about the model is working well, with sufficient training data & parameters (especially enough epochs & not too many vector_size dimensions). (See this FAQ item for more details on how there's an inherent 'jitter' between runs/inferences.)
So I'd suggest:
1st, check how similar the inferred-vector for your item #791 is to the vector created by bulk-training. (You could both compare them to each other, or compare the list of top-N .most_similar() items to each.) If they're very different, there may be other problems with the model training (data/parameters) that make the model underpowered. (In some cases, more epochs or fewer vector_size dimensions can help a little to make a model more consistent, run-to-run, but if your data is thin there will be limits to how well Doc2Vec can work.)
Check that your alternate calculation of the nearest-neighbors exactly matches that returned by .most_similar(), when using the exact same (not inferred) origin-vector. If it doesn't, that'd be a separate issue than any looseness/variance between the vector from bulk-training and that from later re-inference.
Try to evaluate the actual quality of the .most_similar() results - either by ad hoc eyeballing, or some sort of rigorous domain-expert golden-standard of which docs 'should' be judged alike. The calculation by the .most_similar() method is a typical approach, and usually what people want - so knowing if that's helpful for your data/model/goals may be more interesting than whether you can match it with a separate external calculation.
If you're still having problems, be sure in any followup comments, question edits, or new questions to say a bit more about:
the size of your training set, in documents/words-per-doc/unique-vocabulary;
the model-parameters you've chosen; and…
the code process you used to train the model (to be sure it doesn't mimic some serious errors common in poor-quality online guides).
Those can help determine if something else more foundational is wrong/weak with your model.

how calculate distance between 2 node2vec model

I have 2 node2vec models in different timestamps. I want to calculate the distance between 2 models. Two models have the same vocab and we update the models.
My models are like this
model1:
"1":0.1,0.5,...
"2":0.3,-0.4,...
"3":0.2,0.5,...
.
.
.
model2:
"1":0.15,0.54,...
"2":0.24,-0.35,...
"3":0.24,0.47,...
.
.
.
Assuming you've used a standard word2vec library to train your models, each run bootstraps a wholly-separate model whose coordinates are not necessarily comparable to any other model.
(Due to some inherent randomness in the algorithm, or in the multi-threaded handling of training input, even running two training sessions on the exact same data will result in different models. They should each be about as useful for downstream applications, but individual tokens could be in arbitrarily-different positions.)
That said, you could try to synthesize some measures of how much two models are different. For example, you might:
Pick a bunch of random (or domain-significant) word-pairs. Check the similarity between each pair, in each model individually, then compare those values between models. (That is, compare model1.similarity(token_a, token_b) with model2.similarity(token_a, token_b).) Consider the difference-between-the-models as as some weighted combination of all the tested similarity-differences.
For some significant set of relevant tokens, collect the top-N most-similar tokens in each model. Compare this lists via some sort of rank-correlation measure, to see how much one model has changed the 'neighborhoods' of each token.
For each of these, I'd suggest verifying their operation against a baseline case of the exact-same training data that's been shuffled and/or trained with a different starting random seed. Do they show such models as being "nearly equivalent"? If not, you'd need to adjust the training parameters or synthetic measure until it does have the expected result - that models from the same data are judged as alike, even though tokens have very different coordinates.
Another option might be to train one giant combined model from a synthetic corpus where:
all the original unmodified 'texts' from both eras all appear once
texts from each separate era appear again, but with some random-proportion of their tokens modified with an era-specific modifier. (For example, 'foo' sometimes becomes 'foo_1' when in first-era texts, and sometimes becomes 'foo_2' in second-era texts. (You don't want to convert all tokens in any one text to era-specific tokens, because only tokens that co-appear with each other influence each other, and you thus want tokens from either era to sometimes appear with common/shared variants, but also often appear with era-specific variants.)
At the end, the original token 'foo' will get three vectors: 'foo', 'foo_1', and 'foo_2'. They should all be quite similar, but the era-specific variants will be relatively more-influenced by the era-specific contexts. Thus the differences between those three (and relative movement in the now common coordinate space) will be an indication of the magnitude and kinds of changes that happened between the two eras' data.

Gensim: quality of word2vec model seems not to correlate with num of iterations in training

I'm playing with gensim's wordvec and try to build a model using the terms from a large medical thesaurus as sentences. There are about 1 million terms (most of the multiword terms which I treat as sentences) and the hope is, that if word2vec sees terms like "breast cancer" and "breast tumor" etc. it will be able to conclude that "cancer" and "tumor" are somewhat similar.
I run experiments in which I track how similar terms like that are when using different numbers of iterations but it seems that the results don't correlate. I'd expect that when considering word pairs like (wound, lesion), (thorax, lung), (cancer, tumor) etc, when going from 5 to 100 iterations there'd be a tendency (even if small) that the one word in the pair is "more similar" to the the other as the number of iterations grows. But no, results appear pretty random or even getting worse.
Specifically: I loop with 1,5,10,20,50,100 iterations and train a w2v model and then for my word pairs above check the rank of the 2nd word in the list (say "lung") of similar words (as returned by w2v) for the first word (say "thorax"), then sum up and build the average. And the average rank is growing (!) not decreasing, meaning as training proceeds, the vectors for "lung" and "thorax" move further and further away from each other.
I didn't expect gensim to detect the clean synonyms and also perhaps 'only' 1 million terms (sentences) is not enough, but still I am puzzled by this effect.
Does anyone have a suspicion?
====================================================
Added after comments and feedback came in:
Thanks for the detailed feedback, gojomo. I had checked many of these issues before:
yes, the thesaurus terms ("sentences") come in the right format, e.g. ['breast', 'cancer']
yes, of the ~1mio terms more than 850.000 are multiword. Clear that 1-word terms won't provide any context. But there should be ample evidence from the multiword terms
the examples I gave ('clinic', 'cancer', 'lung', ...) occur in many hundreds of terms, often many thousands. This is what I find odd: That not even for words this frequent really good similar words are suggested.
you ask for the code: Here it is https://www.dropbox.com/s/fo3fazl6frj99ut/w2vexperiment.py?dl=0 It expects to be called (python3) with the name of the model and then the SKOS-XML files of a large thesaurus like Snomed
python w2vexperiment.py snomed-w2v.model SKOS/*.skos
I the code you see that I create a new model with each new experiment (with a different number of iterations) So there should be no effect that one run pollutes the other (wrong learning rate etc...)
I have set min_count to 10
Still: the models don't get better but often worse as number of iterations grows. And even the better ones (5 or 10 iterations) give me strange results for my test words...
I suspect there's something wrong with your corpus prep, or training – usually word2vec can rank such similarities well.
Are you supplying the terms alone (eg ['breast, 'tumor'] or ['prophylaxis'] as very tiny sentences), or the terms plus definitions/synonyms as somewhat longer sentences?
The latter would be better.
If the former, then 1-word 'sentences' achieve nothing: there's no neighboring 'context' for word2vec to learn anything, and they're essentially skipped.
And mere 2-word sentences might get some effect, but don't necessarily provide the kind of diverse contexts helpful for training to induce the useful vector arrangements.
Also if it's 1-million 'sentences' of just 1-4 words each, it's kind of a small dataset, and individual words might not be appearing often enough, in sufficiently slightly-varied contexts, for them to get good vectors. You should check the words/tokens of interest, in the model.wv.vocab dict, for a count value that indicates there were enough examples to induce a good vector - ideally 10+ occurrences each (and more is better).
So: more data, and more diverse usages from the relevant domain, are always a good idea. A thesaurus with synonyms in each 'sentence', that are many words (5 to dozens), might be enough.
You don't show your code or training-parameters, but people tweaking the defaults, or following outdated online examples, can often sabotage the algorithm's effectiveness.
For example, it's distressingly common to see people who call train() multiple times, in their own iteration loop, to mismanage the learning-rate alpha such that some iterations run with a negative alpha – meaning every backpropagation serves to drive the context-vectors towards lower target-word predictiveness, the exact opposite of what should be happening. (It's best to either supply the corpus & iter on Word2Vec initialization, or call train() just once. Only advanced tinkerers should need to call train() multiple times.)
Similarly, while naive intuition is often "keeping more words/info must be better", and thus people lower min_count to 1 or 0, such low-frequency words can't get good vectors with just 1 (or a few) occurences, but since they are very numerous (in total), they can interfere with other words' meaningful training. (The surviving, more-frequent words get better vectors when low-frequency words are discarded.)
Good luck!

Categories