SCISPACY - Maximum length exceeded

SCISPACY - Maximum length exceeded - python

I am getting the following error while trying to use a spaCy pipeline for biomedical data.
ValueError: [E088] Text of length 36325726 exceeds the maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in the number of characters, so you can check whether your inputs are too long by checking `len(text)`.
Note: When I am reducing the size, it works fine. But, NLP is all about big data :) (mostly)
Update:
So, the ValueError is resolved. But SciSpacy is using too much processing power and thus forcing Kaggle Kernel to restart.
For now, I have split my dataset (1919 articles into 15 separate items), just to achieve the result.
But please let me know if there is some other way and if I am missing something. Here is the latest Kernel: Cord-19

Related

How to handle with large dataset in spacy

I use the following code to clean my dataset and print all tokens (words).
with open(".data.csv", "r", encoding="utf-8") as file:
text = file.read()
text = re.sub(r"[^a-zA-Z0-9ß\.,!\?-]", " ", text)
text = text.lower()
nlp = spacy.load("de_core_news_sm")
doc = nlp(text)
for token in doc:
print(token.text)
When I execute this code with a small string it works fine. But when I use a 50 megabyte csv I get the message
Text of length 62235045 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
When I increase the limit to this size my computer gets hard problems..
How can I fix this? It can't be anything special to want to tokenize this amount of data.

de_core_web_sm isn't just tokenizing. It is running a number of pipeline components including a parser and NER, where you are more likely to run out of RAM on long texts. This is why spacy includes this default limit.
If you only want to tokenize, use spacy.blank("de") and then you can probably increase nlp.max_length to a fairly large limit without running out of RAM. (You'll still eventually run out of RAM if the text gets extremely long, but this takes much much longer than with the parser or NER.)
If you want to run the full de_core_news_sm pipeline, then you'd need to break your text up into smaller units. Meaningful units like paragraphs or sections can make sense. The linguistic analysis from the provided pipelines mostly depends on local context within a few neighboring sentences, so having longer texts isn't helpful. Use nlp.pipe to process batches of text more efficiently, see: https://spacy.io/usage/processing-pipelines#processing
If you have CSV input, then it might make sense to use individual text fields as the units?

Training time of gensim word2vec

I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code :
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
class Corpus():
"""Iterate over sentences from the corpus."""
def __init__(self):
self.files = [
"sp_cor1.txt",
"sp_cor2.txt",
"sp_cor3.txt",
"sp_cor4.txt",
"sp_cor5.txt",
"sp_cor6.txt",
"sp_cor7.txt",
"sp_cor8.txt"
]
def __iter__(self):
for fname in self.files:
for line in open(fname):
words = line.split()
yield words
sentences = Corpus()
model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=8, sg=1, hs=1, negative=10)
model.save("word2vec.model")
My model is running now for about more than 30 hours now. This is doubtful since on my i5 laptop with 8 cores, I'm using all the 8 cores at 100% for every moment of time. Plus, my program seems to have read more than 100 GB of data from the disk now. I don't know if there is anything wrong here, but the main reason after my doubt on the training is because of this 100 GB of read from the disk. The whole corpus is of 34 GB, then why my code has read 100 GB of data from the disk? Does anyone know how much time should it take to train word2vec on 34 GB of text, with 8 cores of i5 CPU running all in parallel? Thank you. For more information, I'm also attaching the photo of my process from system monitor.
I want to know why my model has read 112 GB from memory, even when my corpus is of 34 GB in total? Will my training ever get finished? Also I'm bit worried about health of my laptop, since it is running constantly at its peak capacity since last 30 hours. It is really hot now.
Should I add any additional parameter in Word2Vec for quicker training without much performance loss?

Completing a model requires one pass over all the data to discover the vocabulary, then multiple passes, with a default of 5, to perform vector training. So, you should expect to see about 6x your data size in disk-reads, just from the model training.
(If your machine winds up needing to use virtual-memory swapping during the process, there could be more disk activity – but you absolutely do not want that to happen, as the random-access pattern of word2vec training is nearly a worst-case for virtual memory usage, which will slow training immensely.)
If you'd like to understand the code's progress, and be able to estimate its completion time, you should enable Python logging to at least the INFO level. Various steps of the process will report interim results (such as the discovered and surviving vocabulary size) and estimated progress. You can often tell if something is going wrong before the end of a run by studying the logging outputs for sensible values, and once the 'training' phase has begun the completion time will be a simple projection from the training completed so far.
I believe most laptops should throttle their own CPU if it's becoming so hot as to become unsafe or risk extreme wear on the CPU/components, but whether yours does, I can't say, and definitely make sure its fans work & vents are unobstructed.
I'd suggest you choose some small random subset of your data – maybe 1GB? – to be able to run all your steps to completion, becoming familiar with the Word2Vec logging output, resource usage, and results, and tinkering with settings to observe changes, before trying to run on your full dataset, which might require days of training time.
Some of your shown parameters aren't optimal for speedy training. In particular:
min_count=1 retains every word seen in the corpus-survey, including those with only a single occurrence. This results in a much, much larger model - potentially risking a model that doesn't fit into RAM, forcing disastrous swapping. But also, words with just a few usage examples can't possibly get good word vectors, as the process requires seeing many subtly-varied alternate uses. Still, via typical 'Zipfian' word-frequencies, the number of such words with just a few uses may be very large in total, so retaining all those words takes a lot of training time/effort, and even serves a bit like 'noise' making the training of other words, with plenty of usage examples, less effective. So for model size, training speed, and quality of remaining vectors, a larger min_count is desirable. The default of min_count=5 is better for more projects than min_count=1 – this is a parameter that should only really be changed if you're sure you know the effects. And, when you have plentiful data – as with your 34GB – the min_count can go much higher to keep the model size manageable.
hs=1 should only be enabled if you want to use the 'hierarchical-softmax' training mode instead of 'negative-sampling' – and in that case, negative=0 should also be set to disable 'negative-sampling'. You probably don't want to use hierarchical-softmax: it's not the default for a reason, and it doesn't scale as well to larger datasets. But here you've enabled in in addition to negative-sampling, likely more-than-doubling the required training time.
Did you choose negative=10 because you had problems with the default negative=5? Because this non-default choice, again, would slow training noticeably. (But also, again, a non-default choice here would be more common with smaller datasets, while larger datasets like yours are more likely to experiment with a smaller negative value.)
The theme of the above observations is: "only change the defaults if you've already got something working, and you have a good theory (or way of testing) how that change might help".
With a large-enough dataset, there's another default parameter to consider changing to speed up training (& often improve word-vector quality, as well): sample, which controls how-aggressively highly-frequent words (with many redundant usage-examples) may be downsampled (randomly skipped).
The default value, sample=0.001 (aka 1e-03), is very conservative. A smaller value, such as sample=1e-05, will discard many-more of the most-frequent-words' redundant usage examples, speeding overall training considerably. (And, for a corpus of your size, you could eventually experiment with even smaller, more-aggressive values.)
Finally, to the extent all your data (for either a full run, or a subset run) can be in an already-space-delimited text file, you can use the corpus_file alternate method of specifying the corpus. Then, the Word2Vec class will use an optimized multithreaded IO approach to assign sections of the file to alternate worker threads – which, if you weren't previously seeing full saturation of all threads/CPU-cores, could increase our throughput. (I'd put this off until after trying other things, then check if your best setup still leaves some of your 8 threads often idle.)

Is it feasible to run a Support Vector Machine Kernel on a device with <= 1 MB RAM and <= 10 MB ROM?

Some preliminary testing shows that a project I'm working on could potentially benefit from the use of a Support-Vector-Machine to solve a tricky problem. The concern that I have is that there will be major memory constraints. Prototyping and testing is being done in python with scikit-learn. The final version will be custom written in C. The model would be pre-trained and only the decision function would be stored on the final product. There would be <= 10 training features, and <= 5000 training data-points. I've been reading mixed things regarding SVM memory, and I know the default sklearn memory cache is 200 MB. (Much larger than what I have available) Is this feasible? I know there are multiple different types of SVM kernel and that the kernel's can also be custom written. What kernel types could this potentially work with, if any?

If you're that strapped for space, you'll probably want to skip scikit and simply implement the math yourself. That way, you can cycle through the data in structures of your own choosing. Memory requirements depend on the class of SVM you're using; a two-class linear SVM can be done with a single pass through the data, considering only one observation at a time as you accumulate sum-of-products, so your command logic would take far more space than the data requirements.
If you need to keep the entire data set in memory for multiple passes, that's "only" 5000*10*8 bytes for floats, or 400k of your 1Mb, which might be enough room to do your manipulations. Also consider a slow training process, re-reading the data on each pass, as this reduces the 400k to a triviality at the cost of wall-clock time.
All of this is under your control if you look up a usable SVM implementation and alter the I/O portions as needed.
Does that help?

memory error when using gensim for loading word2vec

I am using gensim library for loading pre-trained word vectors from GoogleNews dataset. this dataset contains 3000000 word vectors each of 300 dimensions. when I want to load GoogleNews dataset, I receive a memory error. I have tried this code before without memory error and I don't know why I receive this error now.
I have checked a lot of sites for solving this issue but I cant understand.
this is my code for loading GoogleNews:
import gensim.models.keyedvectors as word2vec
model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)
and this is the error I received:
File "/home/mahsa/PycharmProjects/tensor_env_project/word_embedding_DUC2007/inspect_word2vec-master/word_embeddings_GoogleNews.py", line 8, in <module>
model=word2vec.KeyedVectors.load_word2vec_format("GoogleNews-vectors-negative300.bin",binary=True)
File "/home/mahsa/anaconda3/envs/tensorflow_env/lib/python3.5/site-packages/gensim/models/keyedvectors.py", line 212, in load_word2vec_format
result.syn0 = zeros((vocab_size, vector_size), dtype=datatype)
MemoryError
can anybody help me? thanks.

Loading just the raw vectors will take...
3,000,000 words * 300 dimensions * 4 bytes/dimension = 3.6GB
...of addressable memory (plus some overhead for the word-key to index-position map).
Additionally, as soon as you want to do a most_similar()-type operation, unit-length normalized versions of the vectors will be created – which will require another 3.6GB. (You may instead clobber the raw vectors in place, saving that extra memory, if you'll only be doing cosine-similarity comparisons between the unit-normed vectors, by 1st doing a forced explicit model.init_sims(replace=True).)
So you'll generally only want to do full operations on a machine with at least 8GB of RAM. (Any swapping at all during full-array most_similar() lookups will make operations very slow.)
If anything else was using Python heap space, that could have accounted for the MemoryError you saw.
The load_word2vec_format() method also has an optional limit argument which will only load the supplied number of vectors – so you could use limit=500000 to cut the memory requirements by about 5/6ths. (And, since the GoogleNews and other vector sets are usually ordered from most- to least-frequent words, you'll get the 500K most-frequent words. Lower-frequency words generally have much less value and even not-as-good vectors, so it may not hurt much to ignore them.)

To load the whole model one needs a bigger RAM.
You may use the following code. Set the limit to which your system can take. It'll load vectors that are at top of the file.
from gensim import models
w = models.KeyedVectors.load_word2vec_format(r"GoogleNews-vectors-negative300.bin.gz", binary=True, limit = 100000)
I set the limit as 100,000. It worked on my 4GB RAM laptop.

Try closing all your browser tabs and everything else that is eating up RAM. For me that worked.

You should increase the RAM it would work

How to make word2vec model's loading time and memory use more efficient?

I want to use Word2vec in a web server (production) in two different variants where I fetch two sentences from the web and compare it in real-time. For now, I am testing it on a local machine which has 16GB RAM.
Scenario:
w2v = load w2v model
If condition 1 is true:
if normalize:
reverse normalize by w2v.init_sims(replace=False) (not sure if it will work)
Loop through some items:
calculate their vectors using w2v
else if condition 2 is true:
if not normalized:
w2v.init_sims(replace=True)
Loop through some items:
calculate their vectors using w2v
I have already read the solution about reducing the vocabulary size to a small size but I would like to use all the vocabulary.
Are there new workarounds on how to handle this? Is there a way to initially load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary?

As a one-time delay that you should be able to schedule to happen before any service-requests, I would recommend against worrying too much about the first-time load() time. (It's going to inherently take a lot of time to load a lot of data from disk to RAM – but once there, if it's being kept around and shared between processes well, the cost is not spent again for an arbitrarily long service-uptime.)
It doesn't make much sense to "load a small portion of the vocabulary for first 1-2 minutes and in parallel keep loading the whole vocabulary" – as soon as any similarity-calc is needed, the whole set of vectors need to be accessed for any top-N results. (So the "half-loaded" state isn't very useful.)
Note that if you do init_sims(replace=True), the model's original raw vector magnitudes are clobbered with the new unit-normed (all-same-magnitude) vectors. So looking at your pseudocode, the only difference between the two paths is the explicit init_sims(replace=True). But if you're truly keeping the same shared model in memory between requests, as soon as condition 2 occurs, the model is normalized, and thereafter calls under condition 1 are also occurring with normalized vectors. And further, additional calls under condition 2 just redundantly (and expensively) re-normalize the vectors in-place. So if normalized-comparisons are your only focus, best to do one in-place init_sims(replace=True) at service startup - not at the mercy of order-of-requests.
If you've saved the model using gensim's native save() (rather than save_word2vec_format()), and as uncompressed files, there's the option to 'memory-map' the files on a future re-load. This means rather than immediately copying the full vector array into RAM, the file-on-disk is simply marked as providing the addressing-space. There are two potential benefits to this: (1) if you only even access some limited ranges of the array, only those are loaded, on demand; (2) many separate processes all using the same mapped files will automatically reuse any shared ranges loaded into RAM, rather than potentially duplicating the same data.
(1) isn't much of an advantage as soon as you need to do a full-sweep over the whole vocabulary – because they're all brought into RAM then, and further at the moment of access (which will have more service-lag than if you'd just pre-loaded them). But (2) is still an advantage in multi-process webserver scenarios. There's a lot more detail on how you might use memory-mapped word2vec models efficiently in a prior answer of mine, at How to speed up Gensim Word2vec model load time?

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.