memory error when using gensim for loading word2vec

memory error when using gensim for loading word2vec - python

I am using gensim library for loading pre-trained word vectors from GoogleNews dataset. this dataset contains 3000000 word vectors each of 300 dimensions. when I want to load GoogleNews dataset, I receive a memory error. I have tried this code before without memory error and I don't know why I receive this error now.
I have checked a lot of sites for solving this issue but I cant understand.
this is my code for loading GoogleNews:
import gensim.models.keyedvectors as word2vec
model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)
and this is the error I received:
File "/home/mahsa/PycharmProjects/tensor_env_project/word_embedding_DUC2007/inspect_word2vec-master/word_embeddings_GoogleNews.py", line 8, in <module>
model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)
File "/home/mahsa/anaconda3/envs/tensorflow_env/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 212, in load_word2vec_format
result.syn0 = zeros((vocab_size, vector_size), dtype=datatype)
MemoryError
can anybody help me? thanks.

Loading just the raw vectors will take...
3,000,000 words * 300 dimensions * 4 bytes/dimension = 3.6GB
...of addressable memory (plus some overhead for the word-key to index-position map).
Additionally, as soon as you want to do a most_similar()-type operation, unit-length normalized versions of the vectors will be created – which will require another 3.6GB. (You may instead clobber the raw vectors in place, saving that extra memory, if you'll only be doing cosine-similarity comparisons between the unit-normed vectors, by 1st doing a forced explicit model.init_sims(replace=True).)
So you'll generally only want to do full operations on a machine with at least 8GB of RAM. (Any swapping at all during full-array most_similar() lookups will make operations very slow.)
If anything else was using Python heap space, that could have accounted for the MemoryError you saw.
The load_word2vec_format() method also has an optional limit argument which will only load the supplied number of vectors – so you could use limit=500000 to cut the memory requirements by about 5/6ths. (And, since the GoogleNews and other vector sets are usually ordered from most- to least-frequent words, you'll get the 500K most-frequent words. Lower-frequency words generally have much less value and even not-as-good vectors, so it may not hurt much to ignore them.)

To load the whole model one needs a bigger RAM.
You may use the following code. Set the limit to which your system can take. It'll load vectors that are at top of the file.
from gensim import models
w = models.KeyedVectors.load_word2vec_format(r"GoogleNews-vectors-negative300.bin.gz", binary=True, limit = 100000)
I set the limit as 100,000. It worked on my 4GB RAM laptop.

Try closing all your browser tabs and everything else that is eating up RAM. For me that worked.

You should increase the RAM it would work

Related

SCISPACY - Maximum length exceeded

I am getting the following error while trying to use a spaCy pipeline for biomedical data.
ValueError: [E088] Text of length 36325726 exceeds the maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in the number of characters, so you can check whether your inputs are too long by checking `len(text)`.
Note: When I am reducing the size, it works fine. But, NLP is all about big data :) (mostly)
Update:
So, the ValueError is resolved. But SciSpacy is using too much processing power and thus forcing Kaggle Kernel to restart.
For now, I have split my dataset (1919 articles into 15 separate items), just to achieve the result.
But please let me know if there is some other way and if I am missing something. Here is the latest Kernel: Cord-19

Training time of gensim word2vec

I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code :
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
class Corpus():
"""Iterate over sentences from the corpus."""
def __init__(self):
self.files = [
"sp_cor1.txt",
"sp_cor2.txt",
"sp_cor3.txt",
"sp_cor4.txt",
"sp_cor5.txt",
"sp_cor6.txt",
"sp_cor7.txt",
"sp_cor8.txt"
]
def __iter__(self):
for fname in self.files:
for line in open(fname):
words = line.split()
yield words
sentences = Corpus()
model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=8, sg=1, hs=1, negative=10)
model.save("word2vec.model")
My model is running now for about more than 30 hours now. This is doubtful since on my i5 laptop with 8 cores, I'm using all the 8 cores at 100% for every moment of time. Plus, my program seems to have read more than 100 GB of data from the disk now. I don't know if there is anything wrong here, but the main reason after my doubt on the training is because of this 100 GB of read from the disk. The whole corpus is of 34 GB, then why my code has read 100 GB of data from the disk? Does anyone know how much time should it take to train word2vec on 34 GB of text, with 8 cores of i5 CPU running all in parallel? Thank you. For more information, I'm also attaching the photo of my process from system monitor.
I want to know why my model has read 112 GB from memory, even when my corpus is of 34 GB in total? Will my training ever get finished? Also I'm bit worried about health of my laptop, since it is running constantly at its peak capacity since last 30 hours. It is really hot now.
Should I add any additional parameter in Word2Vec for quicker training without much performance loss?

Completing a model requires one pass over all the data to discover the vocabulary, then multiple passes, with a default of 5, to perform vector training. So, you should expect to see about 6x your data size in disk-reads, just from the model training.
(If your machine winds up needing to use virtual-memory swapping during the process, there could be more disk activity – but you absolutely do not want that to happen, as the random-access pattern of word2vec training is nearly a worst-case for virtual memory usage, which will slow training immensely.)
If you'd like to understand the code's progress, and be able to estimate its completion time, you should enable Python logging to at least the INFO level. Various steps of the process will report interim results (such as the discovered and surviving vocabulary size) and estimated progress. You can often tell if something is going wrong before the end of a run by studying the logging outputs for sensible values, and once the 'training' phase has begun the completion time will be a simple projection from the training completed so far.
I believe most laptops should throttle their own CPU if it's becoming so hot as to become unsafe or risk extreme wear on the CPU/components, but whether yours does, I can't say, and definitely make sure its fans work & vents are unobstructed.
I'd suggest you choose some small random subset of your data – maybe 1GB? – to be able to run all your steps to completion, becoming familiar with the Word2Vec logging output, resource usage, and results, and tinkering with settings to observe changes, before trying to run on your full dataset, which might require days of training time.
Some of your shown parameters aren't optimal for speedy training. In particular:
min_count=1 retains every word seen in the corpus-survey, including those with only a single occurrence. This results in a much, much larger model - potentially risking a model that doesn't fit into RAM, forcing disastrous swapping. But also, words with just a few usage examples can't possibly get good word vectors, as the process requires seeing many subtly-varied alternate uses. Still, via typical 'Zipfian' word-frequencies, the number of such words with just a few uses may be very large in total, so retaining all those words takes a lot of training time/effort, and even serves a bit like 'noise' making the training of other words, with plenty of usage examples, less effective. So for model size, training speed, and quality of remaining vectors, a larger min_count is desirable. The default of min_count=5 is better for more projects than min_count=1 – this is a parameter that should only really be changed if you're sure you know the effects. And, when you have plentiful data – as with your 34GB – the min_count can go much higher to keep the model size manageable.
hs=1 should only be enabled if you want to use the 'hierarchical-softmax' training mode instead of 'negative-sampling' – and in that case, negative=0 should also be set to disable 'negative-sampling'. You probably don't want to use hierarchical-softmax: it's not the default for a reason, and it doesn't scale as well to larger datasets. But here you've enabled in in addition to negative-sampling, likely more-than-doubling the required training time.
Did you choose negative=10 because you had problems with the default negative=5? Because this non-default choice, again, would slow training noticeably. (But also, again, a non-default choice here would be more common with smaller datasets, while larger datasets like yours are more likely to experiment with a smaller negative value.)
The theme of the above observations is: "only change the defaults if you've already got something working, and you have a good theory (or way of testing) how that change might help".
With a large-enough dataset, there's another default parameter to consider changing to speed up training (& often improve word-vector quality, as well): sample, which controls how-aggressively highly-frequent words (with many redundant usage-examples) may be downsampled (randomly skipped).
The default value, sample=0.001 (aka 1e-03), is very conservative. A smaller value, such as sample=1e-05, will discard many-more of the most-frequent-words' redundant usage examples, speeding overall training considerably. (And, for a corpus of your size, you could eventually experiment with even smaller, more-aggressive values.)
Finally, to the extent all your data (for either a full run, or a subset run) can be in an already-space-delimited text file, you can use the corpus_file alternate method of specifying the corpus. Then, the Word2Vec class will use an optimized multithreaded IO approach to assign sections of the file to alternate worker threads – which, if you weren't previously seeing full saturation of all threads/CPU-cores, could increase our throughput. (I'd put this off until after trying other things, then check if your best setup still leaves some of your 8 threads often idle.)

How to make word2vec model's loading time and memory use more efficient?

I want to use Word2vec in a web server (production) in two different variants where I fetch two sentences from the web and compare it in real-time. For now, I am testing it on a local machine which has 16GB RAM.
Scenario:
w2v = load w2v model
If condition 1 is true:
if normalize:
reverse normalize by w2v.init_sims(replace=False) (not sure if it will work)
Loop through some items:
calculate their vectors using w2v
else if condition 2 is true:
if not normalized:
w2v.init_sims(replace=True)
Loop through some items:
calculate their vectors using w2v
I have already read the solution about reducing the vocabulary size to a small size but I would like to use all the vocabulary.
Are there new workarounds on how to handle this? Is there a way to initially load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary?

As a one-time delay that you should be able to schedule to happen before any service-requests, I would recommend against worrying too much about the first-time load() time. (It's going to inherently take a lot of time to load a lot of data from disk to RAM – but once there, if it's being kept around and shared between processes well, the cost is not spent again for an arbitrarily long service-uptime.)
It doesn't make much sense to "load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary" – as soon as any similarity-calc is needed, the whole set of vectors need to be accessed for any top-N results. (So the "half-loaded" state isn't very useful.)
Note that if you do init_sims(replace=True), the model's original raw vector magnitudes are clobbered with the new unit-normed (all-same-magnitude) vectors. So looking at your pseudocode, the only difference between the two paths is the explicit init_sims(replace=True). But if you're truly keeping the same shared model in memory between requests, as soon as condition 2 occurs, the model is normalized, and thereafter calls under condition 1 are also occurring with normalized vectors. And further, additional calls under condition 2 just redundantly (and expensively) re-normalize the vectors in-place. So if normalized-comparisons are your only focus, best to do one in-place init_sims(replace=True) at service startup - not at the mercy of order-of-requests.
If you've saved the model using gensim's native save() (rather than save_word2vec_format()), and as uncompressed files, there's the option to 'memory-map' the files on a future re-load. This means rather than immediately copying the full vector array into RAM, the file-on-disk is simply marked as providing the addressing-space. There are two potential benefits to this: (1) if you only even access some limited ranges of the array, only those are loaded, on demand; (2) many separate processes all using the same mapped files will automatically reuse any shared ranges loaded into RAM, rather than potentially duplicating the same data.
(1) isn't much of an advantage as soon as you need to do a full-sweep over the whole vocabulary – because they're all brought into RAM then, and further at the moment of access (which will have more service-lag than if you'd just pre-loaded them). But (2) is still an advantage in multi-process webserver scenarios. There's a lot more detail on how you might use memory-mapped word2vec models efficiently in a prior answer of mine, at How to speed up Gensim Word2vec model load time?

Doc2vec MemoryError

I am using the doc2vec model from teh gensim framework to represent a corpus of 15 500 000 short documents (up to 300 words):
gensim.models.Doc2Vec(sentences, size=400, window=10, min_count=1, workers=8 )
After creating the vectors there are more than 18 000 000 vectors representing words and documents.
I want to find the most similar items (words or documents) for a given item:
similarities = model.most_similar(‘uid_10693076’)
but I get a MemoryError when the similarities are computed:
Traceback (most recent call last):
File "article/test_vectors.py", line 31, in <module>
similarities = model.most_similar(item)
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 639, in most_similar
self.init_sims()
File "/usr/local/lib/python2.7/dist-packages/gensim/models/word2vec.py", line 827, in init_sims
self.syn0norm = (self.syn0 / sqrt((self.syn0 ** 2).sum(-1))[..., newaxis]).astype(REAL)
I have a Ubuntu machine with 60GB Ram and 70GB swap . I checked the memory allocation (in htop) and I observed that never the memory was completely used. I also set to unlimited the the maximum address space which may be locked in memory in python:
resource.getrlimit(resource.RLIMIT_MEMLOCK)
Could someone explain the reason for this MemoryError? In my opinion the available memory should be enough for doing this computations. Could be some memory limits in python or OS?
Thanks in advance!

18M vectors * 400 dimensions * 4 bytes/float = 28.8GB for the model's syn0 array (trained vectors)
The syn1 array (hidden weights) will also be 28.8GB – even though syn1 doesn't really need entries for doc-vectors, which are never target-predictions during training.
The vocabulary structures (vocab dict and index2word table) will likely add another GB or more. So that's all your 60GB RAM.
The syn0norm array, used for similarity calculations, will need another 28.8GB, for a total usage of around 90GB. It's the syn0norm creation where you're getting the error. But even if syn0norm creation succeeded, being that deep into virtual memory would likely ruin performance.
Some steps that might help:
Use a min_count of at least 2: words appearing once are unlikely to contribute much, but likely use a lot of memory. (But since words are a tiny portion of your syn0, this will only save a little.)
After training but before triggering init_sims(), discard the the syn1 array. You won't be able to train more, but your existing word/doc vectors remain accessible.
After training but before calling most_similar(), call init_sims() yourself with a replace=True parameter, to discard the non-normalized syn0 and replace it with the syn0norm. Again you won't be able to train more, but you'll save the syn0 memory.
In-progress work separating out the doc and word vectors, which will appear in gensim past verstion 0.11.1, should also eventually offer some relief. (It'll shrink the syn1 to only include word entries, and allow doc-vectors to come from a file-backed (memmap'd) array.)

Memory Error Using scikit-learn on Predicting but Not Fitting

I'm classifying text data using TfidfVectorizer from scikit-learn.
I transform my dataset and it turns into a 75k by 172k sparse matrix with 865k stored elements. (I used ngrams ranges of 1,3)
I can fit my data after a long time, but it still indeed fits.
However, when I try to predict the test set I get memory issues. Why is this? I would think that the most memory intensive part would be fitting not predicting?
I've tried doing a few things to circumvent this but have had no luck. First I tried dumping the data locally with joblib.dump, quitting python and restarting. This unfortunately didn't work.
Then I tried switching over to a HashingVectorizer but ironically, the hashing vectorizer causes memory issues on the same data set. I was under the impression a Hashing Vectorizer would be more memory efficient?
hashing = HashingVectorizer(analyzer='word',ngram_range=(1,3))
tfidf = TfidfVectorizer(analyzer='word',ngram_range=(1,3))
xhash = hashing.fit_transform(x)
xtfidf = tfidf.fit_transform(x)
pac.fit(xhash,y) # causes memory error
pac.fit(xtfidf,y) # works fine
I am using scikit learn 0.15 (bleeding edge) and windows 8.
I have 8 GB RAM and a hard drive with 100 GB free space. I set my virtual RAM to be 50 GB for the purposes of this project. I can set my virtual RAM even higher if needed, but I'm trying to understand the problem before just blunt force try solutions like I have been for the past couple days...I've tried with a few different classifiers: mostly PassiveAggressiveClassifier, Perception, MultinomialNB, and LinearSVC.
I should also note that at one point I was using a 350k by 472k sparse matrix with 12M stored elements. I was still able to fit the data despite it taking some time. However had memory errors when predicting.

The scikit-learn library is strongly optimized (and uses NumPy and SciPy). TfidVectorizer stores sparse matrices (relatively small in size, compared with standard dense matrices).
If you think that it's issue with memory, you can set the max_features attribute when you create Tfidfvectorizer. It maybe useful for check your assumptions
(for more detail about Tfidfvectorizer, see the documentation).
Also, I can reccomend that you reduce training set, and check again; it can also be useful for check your assumptions.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.