Gensim Doc2Vec model only generates a limited number of vectors

Gensim Doc2Vec model only generates a limited number of vectors - python

I am using gensim Doc2Vec model to generate my feature vectors. Here is the code I am using (I have explained what my problem is in the code):
cores = multiprocessing.cpu_count()
# creating a list of tagged documents
training_docs = []
# all_docs: a list of 53 strings which are my documents and are very long (not just a couple of sentences)
for index, doc in enumerate(all_docs):
# 'doc' is in unicode format and I have already preprocessed it
training_docs.append(TaggedDocument(doc.split(), str(index+1)))
# at this point, I have 53 strings in my 'training_docs' list
model = Doc2Vec(training_docs, size=400, window=8, min_count=1, workers=cores)
# now that I print the vectors, I only have 10 vectors while I should have 53 vectors for the 53 documents that I have in my training_docs list.
print(len(model.docvecs))
# output: 10
I am just wondering if I am doing a mistake or if there is any other parameter that I should set?
UPDATE: I was playing with the tags parameter in TaggedDocument, and when I changed it to a mixture of text and numbers like: Doc1, Doc2, ... I see a different number for the count of generated vectors, but still I do not have the same number of feature vectors as expected.

Look at the actual tags it has discovered in your corpus:
print(model.docvecs.offset2doctag)
Do you see a pattern?
The tags property of each document should be a list of tags, not a single tag. If you supply a simple string-of-an-integer, it will see it as a list-of-digits, and thus only learn the tags '0', '1', ..., '9'.
You could replace str(index+1) with [str(index+1)] and get the behavior you were expecting.
But, since your document IDs are just ascending integers, you can also just use plain Python ints as your doctags. This will save some memory, buy avoiding the creation of a lookup dict from string-tag to array-slot (int). To do this, replace the str(index+1) with [index]. (This starts the doc-IDs from 0 – which is a teensy bit more Pythonic, and also avoids wasting an unused 0 position in the raw array that holds the trained vectors.)

Related

Gensim- KeyError: 'word not in vocabulary'

I am trying to achieve something similar in calculating product similarity used in this example. how-to-build-recommendation-system-word2vec-python/
I have a dictionary where the key is the item_id and the value is the product associated with it. For eg: dict_items([('100018', ['GRAVY MIX PEPPER']), ('100025', ['SNACK CHEEZIT WHOLEGRAIN']), ('100040', ['CAULIFLOWER CELLO 6 CT.']), ('100042', ['STRIP FRUIT FLY ELIMINATOR'])....)
The data structure is the same as in the example (as far as I know). However, I am getting KeyError: "word '100018' not in vocabulary" when calling the similarity function on the model using the key present in the dictionary.
# train word2vec model
model = Word2Vec(window = 10, sg = 1, hs = 0,
negative = 10, # for negative sampling
alpha=0.03, min_alpha=0.0007,
seed = 14)
model.build_vocab(purchases_train, progress_per=200)
model.train(purchases_train, total_examples = model.corpus_count,
epochs=10, report_delay=1)
def similar_products(v, n = 6): #similarity function
# extract most similar products for the input vector
ms = model.similar_by_vector(v, topn= n+1)[1:]
# extract name and similarity score of the similar products
new_ms = []
for j in ms:
pair = (products_dict[j[0]][0], j[1])
new_ms.append(pair)
return new_ms
I am calling the function using:
similar_products(model['100018'])
Note: I was able to run the example code with the very similar data structure input which was also a dictionary. Can someone tell me what I am missing here?

If you get a KeyError telling you a word isn't in your model, then the word genuinely isn't in the model.
If you've trained the model yourself, and expected the word to be in the resulting model, but it isn't, something went wrong with training.
You should look at the corpus (purchases_train in your code) to make sure each item is of the form the model expects: a list of words. You should enable logging during training, and watch the output to confirm the expected amount of word-discovery and training is happening. You can also look at the exact list-of-words known-to-the-model (in model.wv.key_to_index) to make sure it has all the words you expect.
One common gotcha is that by default, for the best operation of the word2vec algorithm, the Word2Vec class uses a default min_count=5. (Word2vec only works well with multiple varied examples of a word's usage; a word appearing just once, or just a few times, usually won't get a good vector, and further, might make other surrounding word's vectors worse. So the usual best practice is to discard very-rare words.
Is the (pseudo-)word '100018' in your corpus less than 5 times? If so, the model will ignore it as a word too-rare to get a good vector, or have any positive influence on other word-vectors.
Separately, the site you're using example code from may not be a quality source of example code. It's changed a bunch of default values for no good reason - such as changing the alpha and min_alpha values to peculiar non-standard values, with no comment why. This is usually a signal that someone who doesn't know what they're doing is copying someone else who didn't know what they were doing's odd choices.

Extract large spacy vectors quickly in Python

I am trying to extract some word vectors from a transformer-based model. The steps are:
Run text through pipleine using nlp().
Break text into sentences (to be separately classified into binary categories).
Save sentence vectors to a list (to be pickled and used as model inputs once classification has occurred).
The actual model runs quickly enough, but (unlike with the skipgram vectors), the simple act of extracting the vectors from the spacy.tokens.span.Span object and saving them to a list or array object is very slow.
Here is a reproducible example:
import spacy
import requests
import numpy as np
nlp = spacy.load('en_stsb_roberta_large') # from https://github.com/MartinoMensio/spacy-sentence-bert
# Get some random text
def get_text(
url = "https://raw.githubusercontent.com/erikthiem/pride-and-prejudice/master/starwars4.txt",
num_chars = 10_000):
response = requests.get(url)
text = response.content.decode().replace("\n", "")
text = text[0:num_chars]
return text
Running the actual nlp() command takes very little time:
# This takes 0.2 seconds to run
text = get_text()
doc = nlp(text)
sents = list(doc.sents) # doc.sents is a generator
However, if I want to save the vectors from the first sentence (36 words) to a list, it is extremely slow:
# Get vectors - takes 5 seconds
example_sent = sents[0]
sent_vectors = [word.vector for word in example_sent]
Obviously this does not scale well.
I do not understand why it is so slow, as if you run type(example_sent[0].vector), it returns numpy.ndarray, i.e. not a generator, so the vector has already been computed.
Why does moving it from a span to a list take so long? The arrays are dimension (1024,) for each word, rather than (300,) for skip-gram vectors, which copy almost instantly. That is the only difference as far as I can tell.
I have also creating an empty array of the required dimensions (rather than a list), but that makes no difference.
Any ideas?

How to build a content-based recommender system that uses multiple attributes?

I want to build a content-based recommender system in Python that uses multiple attributes to decide whether two items are similar. In my case, the "items" are packages hosted by the C# package manager (example) that have various attributes such as name, description, tags that could help to identify similar packages.
I have a prototype recommender system here that currently uses only a single attribute, the description, to decide whether packages are similar. It computes TF-IDF rankings for the descriptions and prints out the top 10 recommendations based on that:
# Code mostly stolen from http://blog.untrod.com/2016/06/simple-similar-products-recommendation-engine-in-python.html
def train(dataframe):
tfidf = TfidfVectorizer(analyzer='word',
ngram_range=(1, 3),
min_df=0,
stop_words='english')
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)
for idx, row in dataframe.iterrows():
similar_indices = cosine_similarities[idx].argsort()[:-10:-1]
similar_items = [(dataframe['id'][i], cosine_similarities[idx][i])
for i in similar_indices]
id = row['id']
similar_items = [it for it in similar_items if it[0] != id]
# This 'sum' is turns a list of tuples into a single tuple:
# [(1,2), (3,4)] -> (1,2,3,4)
flattened = sum(similar_items, ())
try_print("Top 10 recommendations for %s: %s" % (id, flattened))
How can I combine cosine_similarities with other similarity measures (based on same author, similar names, shared tags, etc.) to give more context to my recommendations?

For some context, my work with content-based recommenders has revolved primarily around raw text and categorical data/features. Here's a high-level approach I've taken that has worked out nicely and is pretty simple to implement.
Suppose I have three feature columns that I can potentially use to make recommendations: description, name, and tags. To me, the path of least resistance entails combining these three feature sets in a useful way.
You're off to a good start, using TF-IDF to encode description. So why not treat name and tags in a similar way by creating a feature "corpus" consisting of description, name, and tags? Literally, this would mean concatenating the contents of each of the three columns into one long text column.
Be wise about the concatenation, though, as it's probably to your advantage to preserve from which column a given word comes from, in the case of features like name and tag, which are assumed to have much lower cardinality than description. To put it more explicitly: instead of just creating your corpus column like this:
df['corpus'] = (pd.Series(df[['description', 'name', 'tags']]
.fillna('')
.values.tolist()
).str.join(' ')
You might try preserving information about where particular data points in name and tags come from. Something like this:
df['name_feature'] = ['name_{}'.format(x) for x in df['name']]
df['tags_feature'] = ['tags_{}'.format(x) for x in df['tags']]
And after you do that, I would take things a step further by considering how the default tokenizer (which you're using above) works in TfidfVectorizer. Suppose you have the name of a given package's author: "Johnny 'Lightning' Thundersmith". If you just concatenate that literal string, the tokenizer will split it up and roll each of "Johnny", "Lightning", and "Thundersmith" into separate features, which could potentially diminish the information added by that row's value for name. I think it's best to try to preserve that information. So I would do something like this to each of your lower-cardinality text columns (e.g. name or tags):
def raw_text_to_feature(s, sep=' ', join_sep='x', to_include=string.ascii_lowercase):
def filter_word(word):
return ''.join([c for c in word if c in to_include])
return join_sep.join([filter_word(word) for word in text.split(sep)])
def['name_feature'] = df['name'].apply(raw_text_to_feature)
The same sort of critical thinking should be applied to tags. If you've got a comma-separated "list" of tags, you'll probably have to parse those individually and figure out the right way to use them.
Ultimately, once you've got all of your <x>_feature columns created, then you can create your final "corpus" and plug that into your recommender system as inputs.
This whole system takes some engineering, to be sure, but I've found it's the easiest way to introduce new information from other columns that have different cardinalities.

As I understand your question, there are two ways this can be done:
Combine the other features with tfidf_matrix and then calculate the cosine similarity
Calculate the similarity of other features using other methods and then somehow combine them with the cosine similarity of tfidf_matrix to get a meaningful metric.
I was talking about the first one.
For example lets say, for your data, the tfidf_matrix (for only the 'description' column) is of shape (3000, 4000)
where 3000 are the rows in the data and 4000 are the unique words (vocabulary) found by the TfidfVectorizer.
Now lets say you do some feature processing on the other columns ('authors', 'id' etc) and that produces 5 columns. So the shape of that data is (3000, 5).
I was saying to combine the two matrices (combine the columns) so that the new shape of your data is (3000, 4005) and then calculate the cosine_similarity.
See below example:
from scipy import sparse
# This is your original matrix
tfidf_matrix = tfidf.fit_transform(dataframe['description'])
# This is the other features
other_matrix = some_processing_on_other_columns()
combined_matrix = sparse.hstack((tfidf_matrix, other_matrix))
cosine_similarities = linear_kernel(combined_matrix, combined_matrix)

You have a vector for a user $\gamma_u$ and an item $\gamma_i$. The scoring function for your recommendation is:
Right now you said your feature vector has only 1 item, but once you get more, this model will scale for that.
In this case you already engineered your vectors, but typically in recommenders, the feature are learned through matrix factorization. This is called a latent factor model, whereas you have a hand-crafted model.

Best match for input query from a set of documents

I have 8 documents and I ran TF-IDF on it to get an array. I don't understand how I find out which is the best document match for a given input query?
all_documents = [doc1, doc2, ...., doc7]
sklearn_tfidf = TfidfVectorizer(norm='l2',min_df=0, use_idf=True, smooth_idf=False, sublinear_tf=True, tokenizer=tokenize)
sklearn_representation = sklearn_tfidf.fit_transform(all_documents).toarray()

Transform the input to tf-idf format using TfidfVectorizer. You can then use a distance metric (cosine, euclidean, manhattan, ...) to calculate the document that is closest to your input.
Each of the documents should use the same vocabulary. I assume that your 8 document vectors have the same length? The sklearn_tfidf object that you created has an attribute vocabulary_ that contains all words that are used in the vectors. Your input query should be reduced to only containing those words.
Example
Document1: dogs are cute
Document2: cats are awful
Leads to a vocabulary of [dogs, cats, are, cute, awful]. A query containing other words than these 5 cannot be used. For example if your query is cute animals, the animals has no meaning, because it cannot be found in one of the documents. The query thus reduces to following vector: [0,0,0,1,0] since cute is the only word that can be found in the documents.

Classify strings via class using sklearn

I'm having issues with presenting my data in a form which sklearn will accept
My raw data is a few hundred strings, and these are classified into one of 5 classes, I've a list of the strings i'd like to classify, and a parallel list of their respective classes. I'm using GaussianNB()
Example Data:
For such a large, successful business, I really feel like they need to be
either choosier in their employee selection or teach their employees to
better serve their customers.|||Class:4
Which represents a given "feature" and a classification
Naturally, the strings themselves have to be converted to vectors prior to their use in the classifier, I've attempted to use DictVector to perform this task
dictionaryTraining = convertListToSentence(data)
vec = DictVectorizer()
print(dictionaryTraining)
vec.fit_transform(dictionaryTraining)
However in order todo it, i have to attach the actual classification of the data into the dictionary, otherwise i get the error 'str' object has no attribute 'items' I understand this is because .fit_transform requires features and indices, but i don't fully understand the purpose of the indice
fit_transform(X[, y]) Learn a list of feature name -> indices mappings and transform X.
My question is, how can i take a list of strings, and a list of numbers representing their classifications, and provide these to a gaussianNB() classifier such that i can present it with a similar string in the future and it will estimate the strings class?

Since your input data are in the format of raw text and not in the format of a dictionary where like {"word":number_of_occurrences, } I believe you should go with a CountVectorizer which will split your input text on white space and transform it on the input vectors you need.
A simple example of such a transformation would be:
from sklearn.feature_extraction.text import CountVectorizer
corpus = ['This is the first document.', 'This is the second second document.',
'And the third one.', 'Is this the first document?',]
x = CountVectorizer().fit_transform(corpus)
print x.todense() #x holds your features. Here I am only vizualizing it

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.