I am using gensim to train a word2vec model. The problem is that my data is very large (about 10 million documents) so my session is crashing when I try to estimate the model.
Note that I am able to load all the data at once in the RAM in a Pandas dataframe df, which looks like:
text id
long long text 1
another long one 2
... ...
My simple approach is to do the following:
tokens = df['text'].str.split(r'[\s]+')
model = Word2Vec(tokens, min_count = 50)
However, my session crashed when it tries to create the tokens all at once. Is there a better way to proceed in gensim? Like feeding the data line by line?
Thanks!
Iterate over your dataframe row by row, tokenizing just one row at a time. Write each tokenized text to a file in turn, with spaces between the tokens, and a line-end at the end of each text.
You can then use the LineSentence utility class in Gensim to provide a read-from-disk iterable corpus to the Word2Vec model.
Related
After i did a lot of research about AI and sentiment analysis i found 2 ways to do text analysis.
After the pre-processing for text is done we must create a classification in order to get the positive and negative, so my question is it better to have example:
first way:
100 records of text to train that includes 2 fields text &
status filed that indicate if its positive 1 or negative 0.
second way:
100 records of text to train and make a vocabulary for bag of word in order to train and compare the tested records based on this bag of word.
if i am mistaking in my question please tel me and correct my question.
I think you might miss something here, so to train a sentiment analysis model, you will have a train data which every row has label (positive or negative) and a raw text. In order to make computer can understand or "see" the text is by representing the text as number (since computer cannot understand text), so one of the way to represent text as number is by using bag of words (there are other methods to represent text like TF/IDF, WORD2VEC, etc.). So when you train the model using data train, the program should preprocess the raw text, then it should make (in this case) a bag of words map where every element position represent one vocabulary, and it will become 1 or more if the word exist in the text and 0 if it doesn't exist.
Now suppose the training has finished, then the program produce a model, this model is what you save, so whenever you want to test a data, you don't need to re-train the program again. Now when you want to test, yes, you will use the bag of words mapping of the train data, suppose there is a word in the test dataset that never occurred in train dataset, then just map it as 0.
in short:
when you want to test, you have to use the bag of words mapping from the data train
I want to train word2vec and fasttext to get vectors for a specific dataset that I have.
What should my model take as input?
My file is like this:
Customer_4: I want to book a ticket to New York.
Agent_9: Okay, when do you want the tickets for
Customer_4: hmm, wait a sec
Agent_9: Sure
Customer_4: When is the least expensive to fly
Now, How should I prepare my data for word2vec to run? Does the word2vec model take inter sentence similaarity into account, i.e. should i not prepare the corpus sentence wise.
One way would be that you first split your document into lines, then for each line, split the line into tokens. Then you end up with a corpus of list of list of tokens. After that, you can feed it into the gensim word2vec model.
I am new to doc2vec. I was initially trying to understand doc2vec and mentioned below is my code that uses Gensim. As I want I get a trained model and document vectors for the two documents.
However, I would like to know the benefits of retraining the model in several epoches and how to do it in Gensim? Can we do it using iter or alpha parameter or do we have to train it in a seperate for loop? Please let me know how I should change the following code to train the model for 20 epoches.
Also, I am interested in knowing is the multiple training iterations are needed for word2vec model as well.
# Import libraries
from gensim.models import doc2vec
from collections import namedtuple
# Load data
doc1 = ["This is a sentence", "This is another sentence"]
# Transform data
docs = []
analyzedDocument = namedtuple('AnalyzedDocument', 'words tags')
for i, text in enumerate(doc1):
words = text.lower().split()
tags = [i]
docs.append(analyzedDocument(words, tags))
# Train model
model = doc2vec.Doc2Vec(docs, size = 100, window = 300, min_count = 1, workers = 4)
# Get the vectors
model.docvecs[0]
model.docvecs[1]
Word2Vec and related algorithms (like 'Paragraph Vectors' aka Doc2Vec) usually make multiple training passes over the text corpus.
Gensim's Word2Vec/Doc2Vec allows the number of passes to be specified by the iter parameter, if you're also supplying the corpus in the object initialization to trigger immediate training. (Your code above does this by supplying docs to the Doc2Vec(docs, ...) constructor call.)
If unspecified, the default iter value used by gensim is 5, to match the default used by Google's original word2vec.c release. So your code above is already using 5 training passes.
Published Doc2Vec work often uses 10-20 passes. If you wanted to do 20 passes instead, you could change your Doc2Vec initialization to:
model = doc2vec.Doc2Vec(docs, iter=20, ...)
Because Doc2Vec often uses unique identifier tags for each document, more iterations can be more important, so that every doc-vector comes up for training multiple times over the course of the training, as the model gradually improves. On the other hand, because the words in a Word2Vec corpus might appear anywhere throughout the corpus, each words' associated vectors will get multiple adjustments, early and middle and late in the process as the model improves – even with just a single pass. (So with a giant, varied Word2Vec corpus, it's thinkable to use fewer than the default-number of passes.)
You don't need to do your own loop, and most users shouldn't. If you do manage the separate build_vocab() and train() steps yourself, instead of the easier step of supplying the docs corpus in the initializer call to trigger immediate training, then you must supply an epochs argument to train() – and it will perform that number of passes, so you still only need one call to train().
I'm looking for a way to dinamically add pre-trained word vectors to a word2vec gensim model.
I have a pre-trained word2vec model in a txt (words and their embedding) and I need to get Word Mover's Distance (for example via gensim.models.Word2Vec.wmdistance) between documents in a specific corpus and a new document.
To prevent the need to load the whole vocabulary, I would want to load only the subset of the pre-trained model's words that are found in the corpus. But if the new document has words that are not found in the corpus but they are in the original model vocabulary add them to the model so they are considered in the computation.
What I want is to save RAM, so possible things that would help me:
Is there a way to add the word vectors directly to the model?
Is there a way to load to gensim from a matrix or another object? I could have that object in RAM and append to it the new words before loading them in the model
I don't need it to be on gensim, so if you know a different implementation for WMD that gets the vectors as input that would work (though I do need it in Python)
Thanks in advance.
METHOD 1:
You can just use keyedvectors from gensim.models.keyedvectors. They are very easy to use.
from gensim.models.keyedvectors import WordEmbeddingsKeyedVectors
w2v = WordEmbeddingsKeyedVectors(50) # 50 = vec length
w2v.add(new_words, their_new_vecs)
METHOD 2:
AND if you already have built a model using gensim.models.Word2Vec you can just do this. suppose I want to add the token <UKN> with a random vector.
model.wv["<UNK>"] = np.random.rand(100) # 100 is the vectors length
The complete example would be like this:
import numpy as np
import gensim.downloader as api
from gensim.models import Word2Vec
dataset = api.load("text8") # load dataset as iterable
model = Word2Vec(dataset)
model.wv["<UNK>"] = np.random.rand(100)
I am looking for a way to load vectors I generated previously using scikit-learn's TfidfVectorizer. In general what I wish is to get a better understanding of the TfidfVectorizer's data persistence.
For instance, what I did so far is:
vectorizer = TfidfVectorizer(stop_words=stop)
vect_train = vectorizer.fit_transform(corpus)
Then I wrote 2 functions in order to be able to save and load my vectorizer:
def save_model(model,name):
'''
Function that enables us to save a trained model
'''
joblib.dump(model, '{}.pkl'.format(name))
def load_model(name):
'''
Function that enables us to load a saved model
'''
return joblib.load('{}.pkl'.format(name))
I checked posts like the one below but i still didn't manage to make much sense.
How do I store a TfidfVectorizer for future use in scikit-learn?
What I ultimately wish is to be able to have a training session and then load this set of produced vectors, transform some newly text input based on those vectors and perform cosine_similarity using old vectors and new ones generated based on them.
One of the reasons that I wish to do this is because the vectorization in such a large dataset takes approximately 10 minutes and I wish to do this once and not every time a new query comes in.
I guess what I should be saving is vect_train right? But then which is the correct way to firstly save it and then load it to a newly created instance of TfidfVectorizer?
First time I tried to save vect_train with joblib as the kind people in scikit-learn advise to do I got 4 files: tfidf.pkl, tfidf.pkl_01.npy, tfidf.pkl_02.npy, tfidf.pkl_03.npy. It would be great if I knew what exactly are those and how I could load them to a new instance of
vectorizer = TfidfVectorizer(stop_words=stop)
created in a different script.
Thank you in advance.
The result of your vect_train = vectorizer.fit_transform(corpus) is twofold: (i) the vectorizer fits your data, that is it learns the corpus vocabulary and the idf for each term, and
(ii) vect_train is instantiated with the vectors of your corpus.
The save_model and load_model functions you propose persist and load the vectorizer, that is the internal parameters that it has learned such as the vocabulary and the idfs. Having loaded the vectorizer, all you need to get vectors is to transform a list with data. It can be unseen data, or the raw data you used during the fit_transform. Therefore, all you need is:
vectorizer = load_model(name)
vect_train = vectorizer.transform(corpus) # (1) or any unseen data
At this point, you have everything you had before saving, but the transformation call (1) will take some time depending on your corpus. In case you want to skip this, you need to also save the content of vect_train, as you correctly wonder in your question. This is a sparse matrix and can be saved/loaded using scipy, you can find information in this question for example. Copying from that question, to actually save the csr matrices you also need:
def save_sparse_csr(filename,array):
np.savez(filename,data = array.data ,indices=array.indices,
indptr =array.indptr, shape=array.shape )
def load_sparse_csr(filename):
loader = np.load(filename)
return csr_matrix(( loader['data'], loader['indices'], loader['indptr']),
shape = loader['shape'])
Concluding, the above functions can be used for saving/loading your vec_train whereas the ones you provided for saving/loading the transformer in order to vectorize the new data.