Improve efficiency for ranking in Bag of words model

Improve efficiency for ranking in Bag of words model - python

I am creating a text summarizer and using a basic model to work with using Bag of words approach.
the code i am performing is using the nltk library.
the file read is a large file with over 2500000 words.
below is the loop i am working on with but this takes over 2 hours to run and complete. is there a way to optimize this code
f= open('Complaints.csv', 'r')
raw = f.read()
len(raw)
tokens = nltk.word_tokenize(raw)
len(tokens)
freq = nltk.FreqDist(text)
top_words = [] # blank dictionary
top_words = freq.most_common(100)
print(top_words)
sentences = sent_tokenize(raw)
print(raw)
ranking = defaultdict(int)
for i, sent in enumerate(raw):
for word in word_tokenize(sent.lower()):
if word in freq:
ranking[i]+=freq[word]
top_sentences = nlargest(10, ranking, ranking.get)
print(top_sentences)
This is only one one file and the actual deployment has more than 10-15 files of similar size.
How we can improve this.
Please note these are the text from a chat bot and are actual sentences hence there was no requirement to remove whitespaces, stemming and other text pre processing methods

Firstly, you open at once a large file that needs to fit into your RAM. If you do not have a really good computer, this might be the first bottleneck for perfomance. Read each line separately, or try to use some IO buffer.
What CPU do you have? If you have enough cores, you can get a lot of extra performance when parallelizing the program with an async Pool from Multiprocessing library because you really use the full power of all cores (choose the number of processes according to the thread number. With this method, I reduced a model on 2500 data sets from ~5 minutes to ~17 seconds on 12 threads). You would have to implement the processes to return a dict each, updating them after the processes have finished.
Otherwise, there are machine learning approches for text summarization (sequence to sequence RNNs). With a tensorflow implementation, you can use a dedicated GPU on your local machine (even a decent 10xx or a 2060 from Nvidia will help) to speed up your model.
https://docs.python.org/2/library/multiprocessing.html
https://arxiv.org/abs/1602.06023
hope this helps

Use google colab, it provides it own GPU

Related

How to handle with large dataset in spacy

I use the following code to clean my dataset and print all tokens (words).
with open(".data.csv", "r", encoding="utf-8") as file:
text = file.read()
text = re.sub(r"[^a-zA-Z0-9ß\.,!\?-]", " ", text)
text = text.lower()
nlp = spacy.load("de_core_news_sm")
doc = nlp(text)
for token in doc:
print(token.text)
When I execute this code with a small string it works fine. But when I use a 50 megabyte csv I get the message
Text of length 62235045 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
When I increase the limit to this size my computer gets hard problems..
How can I fix this? It can't be anything special to want to tokenize this amount of data.

de_core_web_sm isn't just tokenizing. It is running a number of pipeline components including a parser and NER, where you are more likely to run out of RAM on long texts. This is why spacy includes this default limit.
If you only want to tokenize, use spacy.blank("de") and then you can probably increase nlp.max_length to a fairly large limit without running out of RAM. (You'll still eventually run out of RAM if the text gets extremely long, but this takes much much longer than with the parser or NER.)
If you want to run the full de_core_news_sm pipeline, then you'd need to break your text up into smaller units. Meaningful units like paragraphs or sections can make sense. The linguistic analysis from the provided pipelines mostly depends on local context within a few neighboring sentences, so having longer texts isn't helpful. Use nlp.pipe to process batches of text more efficiently, see: https://spacy.io/usage/processing-pipelines#processing
If you have CSV input, then it might make sense to use individual text fields as the units?

Training time of gensim word2vec

I'm training word2vec from scratch on 34 GB pre-processed MS_MARCO corpus(of 22 GB). (Preprocessed corpus is sentnecepiece tokenized and so its size is more) I'm training my word2vec model using following code :
from gensim.test.utils import common_texts, get_tmpfile
from gensim.models import Word2Vec
class Corpus():
"""Iterate over sentences from the corpus."""
def __init__(self):
self.files = [
"sp_cor1.txt",
"sp_cor2.txt",
"sp_cor3.txt",
"sp_cor4.txt",
"sp_cor5.txt",
"sp_cor6.txt",
"sp_cor7.txt",
"sp_cor8.txt"
]
def __iter__(self):
for fname in self.files:
for line in open(fname):
words = line.split()
yield words
sentences = Corpus()
model = Word2Vec(sentences, size=300, window=5, min_count=1, workers=8, sg=1, hs=1, negative=10)
model.save("word2vec.model")
My model is running now for about more than 30 hours now. This is doubtful since on my i5 laptop with 8 cores, I'm using all the 8 cores at 100% for every moment of time. Plus, my program seems to have read more than 100 GB of data from the disk now. I don't know if there is anything wrong here, but the main reason after my doubt on the training is because of this 100 GB of read from the disk. The whole corpus is of 34 GB, then why my code has read 100 GB of data from the disk? Does anyone know how much time should it take to train word2vec on 34 GB of text, with 8 cores of i5 CPU running all in parallel? Thank you. For more information, I'm also attaching the photo of my process from system monitor.
I want to know why my model has read 112 GB from memory, even when my corpus is of 34 GB in total? Will my training ever get finished? Also I'm bit worried about health of my laptop, since it is running constantly at its peak capacity since last 30 hours. It is really hot now.
Should I add any additional parameter in Word2Vec for quicker training without much performance loss?

Completing a model requires one pass over all the data to discover the vocabulary, then multiple passes, with a default of 5, to perform vector training. So, you should expect to see about 6x your data size in disk-reads, just from the model training.
(If your machine winds up needing to use virtual-memory swapping during the process, there could be more disk activity – but you absolutely do not want that to happen, as the random-access pattern of word2vec training is nearly a worst-case for virtual memory usage, which will slow training immensely.)
If you'd like to understand the code's progress, and be able to estimate its completion time, you should enable Python logging to at least the INFO level. Various steps of the process will report interim results (such as the discovered and surviving vocabulary size) and estimated progress. You can often tell if something is going wrong before the end of a run by studying the logging outputs for sensible values, and once the 'training' phase has begun the completion time will be a simple projection from the training completed so far.
I believe most laptops should throttle their own CPU if it's becoming so hot as to become unsafe or risk extreme wear on the CPU/components, but whether yours does, I can't say, and definitely make sure its fans work & vents are unobstructed.
I'd suggest you choose some small random subset of your data – maybe 1GB? – to be able to run all your steps to completion, becoming familiar with the Word2Vec logging output, resource usage, and results, and tinkering with settings to observe changes, before trying to run on your full dataset, which might require days of training time.
Some of your shown parameters aren't optimal for speedy training. In particular:
min_count=1 retains every word seen in the corpus-survey, including those with only a single occurrence. This results in a much, much larger model - potentially risking a model that doesn't fit into RAM, forcing disastrous swapping. But also, words with just a few usage examples can't possibly get good word vectors, as the process requires seeing many subtly-varied alternate uses. Still, via typical 'Zipfian' word-frequencies, the number of such words with just a few uses may be very large in total, so retaining all those words takes a lot of training time/effort, and even serves a bit like 'noise' making the training of other words, with plenty of usage examples, less effective. So for model size, training speed, and quality of remaining vectors, a larger min_count is desirable. The default of min_count=5 is better for more projects than min_count=1 – this is a parameter that should only really be changed if you're sure you know the effects. And, when you have plentiful data – as with your 34GB – the min_count can go much higher to keep the model size manageable.
hs=1 should only be enabled if you want to use the 'hierarchical-softmax' training mode instead of 'negative-sampling' – and in that case, negative=0 should also be set to disable 'negative-sampling'. You probably don't want to use hierarchical-softmax: it's not the default for a reason, and it doesn't scale as well to larger datasets. But here you've enabled in in addition to negative-sampling, likely more-than-doubling the required training time.
Did you choose negative=10 because you had problems with the default negative=5? Because this non-default choice, again, would slow training noticeably. (But also, again, a non-default choice here would be more common with smaller datasets, while larger datasets like yours are more likely to experiment with a smaller negative value.)
The theme of the above observations is: "only change the defaults if you've already got something working, and you have a good theory (or way of testing) how that change might help".
With a large-enough dataset, there's another default parameter to consider changing to speed up training (& often improve word-vector quality, as well): sample, which controls how-aggressively highly-frequent words (with many redundant usage-examples) may be downsampled (randomly skipped).
The default value, sample=0.001 (aka 1e-03), is very conservative. A smaller value, such as sample=1e-05, will discard many-more of the most-frequent-words' redundant usage examples, speeding overall training considerably. (And, for a corpus of your size, you could eventually experiment with even smaller, more-aggressive values.)
Finally, to the extent all your data (for either a full run, or a subset run) can be in an already-space-delimited text file, you can use the corpus_file alternate method of specifying the corpus. Then, the Word2Vec class will use an optimized multithreaded IO approach to assign sections of the file to alternate worker threads – which, if you weren't previously seeing full saturation of all threads/CPU-cores, could increase our throughput. (I'd put this off until after trying other things, then check if your best setup still leaves some of your 8 threads often idle.)

Improving spaCy memory usage and run time when running on 400K+ documents?

I currently have around 400K+ documents, each with an associated group and id number. They average around 24K characters and 350 lines each. In total, there is about 25 GB worth of data. Currently, they are split up by the group, reducing the number of documents need to process to around 15K at one time. I have run into the problem of both memory usage and segmentation faults (I believe the latter is a result of the former) when running on a machine with 128GB of memory. I have changed how I process the documents by using batching to handle them at one time.
Batch Code
def batchGetDoc(raw_documents):
out = []
reports = []
infos = []
# Each item in raw_documents is a tuple of 2 items, where the first item is all
# information (report number, tags) that correlate with said document. The second
# item is the raw text of the document itself
for info, report in raw_documents:
reports.append(report)
infos.append(info)
# Using en_core_web_sm as the model
docs = list(SPACY_PARSER.pipe(reports))
for i in range(len(infos)):
out.append([infos[i],docs[i]])
return out
I use a batch size of 500, and even then, it still takes a while. Are these issues in both speed and memory due to using .pipe() on full documents rather than sentences? Would it be better to go through and run SPACY_PARSER(report) individually?
I am using spaCy to get the named entities, their linked entities, the dependency graphs, and knowledge bases from each document. Will doing it this way risk losing information that will be important for spaCy later on when it comes to getting said data?
Edit: I should mention that I do need the document info for later use in predicting the accuracy based on the document's text

There was a memory leak in the parser and NER, which is fixed in v2.1.9 and v2.2.2, so update if necessary. If you have very long documents, you might want to split them into paragraphs or sections for processing. (You'll get an error for texts longer than 1,000,000 characters.)
Definitely use nlp.pipe() for faster processing. You could use the as_tuples option with nlp.pipe() to pass in (text, context) tuples and get (doc, context) tuples back, so you won't need multiple loops. You'd have to handle reversing your tuples from the code above, but once you have (text, context) tuples, you'd just need something like:
out = nlp.pipe(raw_documents, as_tuples=True)

Speed up Spacy Named Entity Recognition

I'm using spacy to recognize street addresses on web pages.
My model is initialized basically using spacy's new entity type sample code found here:
https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py
My training data consists of plain text webpages with their corresponding Street Address entities and character positions.
I was able to quickly build a model in spacy to start making predictions, but I found its prediction speed to be very slow.
My code works by iterating through serveral raw HTML pages and then feeding each page's plain text version into spacy as it's iterating. For reasons I can't get into, I need to make predictions with Spacy page by page, inside of the iteration loop.
After the model is loaded, I'm using the standard way of making predictions, which I'm referring to as the prediction/evaluation phase:
doc = nlp(plain_text_webpage)
if len(doc.ents) > 0:
print ("found entity")
Questions:
How can I speed up the entity prediction / recognition phase? I'm using a c4.8xlarge instance on AWS and all 36 cores are constantly maxed out when spacy is evaluating the data. Spacy is turning processing a few million webpages from a 1 minute job to a 1 hour+ job.
Will the speed of entity recognition improve as my model becomes more accurate?
Is there a way to remove pipelines like tagger during this phase, can ER be decoupled like that and still be accurate? Will removing other pipelines affect the model itself or is it just a temporary thing?
I saw that you can use GPU during the ER training phase, can it also be used in this evaluating phase in my code for faster predictions?
Update:
I managed to significantly cut down the processing time by:
Using a custom tokenizer (used the one in the docs)
Disabling other pipelines that aren't for Named Entity Recognition
Instead of feeding the whole body of text from each webpage into spacy, I'm only sending over a maximum of 5,000 characters
My updated code to load the model:
nlp = spacy.load('test_model/', disable=['parser', 'tagger', 'textcat'])
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(text)
However, it is still too slow (20X slower than I need it)
Questions:
Are there any other improvements I can make to speed up the Named Entity Recognition? Any fat I can cut from spacy?
I'm still looking to see if a GPU based solution would help - I saw that GPU use is supported during the Named Entity Recognition training phase, can it also be used in this evaluation phase in my code for faster predictions?

Please see here for details about speed troubleshooting: https://github.com/explosion/spaCy/issues/1508
The most important things:
1) Check which BLAS library numpy is linked against, and make sure it's compiled well for your machine. Using conda is helpful as then you get Intel's mkl
2)
c4.8xlarge instance on AWS and all 36 cores are constantly maxed out
when spacy is evaluating the data.
That's probably bad. We can only really parallelise the matrix multiplications at the moment, because we're using numpy --- so there's no way to thread larger chunks. This means the BLAS library is probably launching too many threads. In general you can only profitably use 3-4 cores per process. Try setting the environment variables for your BLAS library to restrict the number of threads.
3) Use nlp.pipe(), to process batches of data. This makes the matrix multiplications bigger, making processing more efficient.
4) Your outer loop of "feed data through my processing pipeline" is probably embarrassingly parallel. So, parallelise it. Either use Python's multiprocessing, or something like joblib, or something like Spark, or just fire off 10 bash scripts in parallel. But take the outermost, highest level chunk of work you can, and run it as independently as possible.
It's actually usually better to run multiple smaller VMs instead of one large VM. It's annoying operationally, but it means less resource sharing.

Python NLTK FreqDist() Reduce Memory Usage By Writing k,v to disk?

I have a small program that uses NLTK to get the frequency distribution of a rather large dataset. The problem is that after a few million words I start to eat up all the RAM on my system. Here's what I believe to be the relevant lines of code:
freq_distribution = nltk.FreqDist(filtered_words) # get the frequency distribution of all the words
top_words = freq_distribution.keys()[:10] # get the top used words
bottom_words = freq_distribution.keys()[-10:] # get the least used words
There must be a way to write the key, value store to disk, I'm just not sure how. I'm trying to stay away from a document store like MongoDB and stay purely pythonic. If anyone has some suggestions I would appreciate it.

By coincidence, I had the same problem in the past month. I was trying to use NLTK and FreqDist to create n-gram frequency tables from large datasets (eg. the English Wikipedia and Gutenberg datasets). My 8GB machine could store a unigram model in memory, but not a bigram one.
My solution was to use BerkeleyDB, which stores a k,v database to disk; but also stores an in-memory table cache for speed. For frequency distributions, this is VERY slow, so I also created my own sub-tables in memory using FreqDist, and then periodically saved them to BerkeleyDB (typically every 1000 or so input files). This greatly reduces the BerkeleyDB writes because it removes a lot of duplicates - eg. "the" in a unigram model is only written once instead of many 100,0000s of times. I wrote it up here:
https://www.winwaed.com/blog/2012/05/17/using-berkeleydb-to-create-a-large-n-gram-table/
The problem with using pickle is that you have to store the entire distribution in memory. The only way of being purely pythonic is to write your own implementation, with it's own k,v disk database and probably your own in-memory cache. Using BerkeleyDB is an awful lot easier, and efficient!

I've used the JSON module to store large dictionaries (or other data structures) in these kinds of situations. I think pickle or cpickle may be more efficient, unless you want to store the data in human-readable form (often useful for nlp).
Here's how I do it:
import json
d = {'key': 'val'}
with open('file.txt', 'w') as f:
json.dump(d, f)
Then to retrieve,
with open('file.txt', 'r') as f:
d = json.loads(f.read())

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.