How to handle with large dataset in spacy - python

I use the following code to clean my dataset and print all tokens (words).
with open(".data.csv", "r", encoding="utf-8") as file:
text = file.read()
text = re.sub(r"[^a-zA-Z0-9ß\.,!\?-]", " ", text)
text = text.lower()
nlp = spacy.load("de_core_news_sm")
doc = nlp(text)
for token in doc:
print(token.text)
When I execute this code with a small string it works fine. But when I use a 50 megabyte csv I get the message
Text of length 62235045 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
When I increase the limit to this size my computer gets hard problems..
How can I fix this? It can't be anything special to want to tokenize this amount of data.

de_core_web_sm isn't just tokenizing. It is running a number of pipeline components including a parser and NER, where you are more likely to run out of RAM on long texts. This is why spacy includes this default limit.
If you only want to tokenize, use spacy.blank("de") and then you can probably increase nlp.max_length to a fairly large limit without running out of RAM. (You'll still eventually run out of RAM if the text gets extremely long, but this takes much much longer than with the parser or NER.)
If you want to run the full de_core_news_sm pipeline, then you'd need to break your text up into smaller units. Meaningful units like paragraphs or sections can make sense. The linguistic analysis from the provided pipelines mostly depends on local context within a few neighboring sentences, so having longer texts isn't helpful. Use nlp.pipe to process batches of text more efficiently, see: https://spacy.io/usage/processing-pipelines#processing
If you have CSV input, then it might make sense to use individual text fields as the units?

Related

SCISPACY - Maximum length exceeded

I am getting the following error while trying to use a spaCy pipeline for biomedical data.
ValueError: [E088] Text of length 36325726 exceeds the maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in the number of characters, so you can check whether your inputs are too long by checking `len(text)`.
Note: When I am reducing the size, it works fine. But, NLP is all about big data :) (mostly)
Update:
So, the ValueError is resolved. But SciSpacy is using too much processing power and thus forcing Kaggle Kernel to restart.
For now, I have split my dataset (1919 articles into 15 separate items), just to achieve the result.
But please let me know if there is some other way and if I am missing something. Here is the latest Kernel: Cord-19

Improving spaCy memory usage and run time when running on 400K+ documents?

I currently have around 400K+ documents, each with an associated group and id number. They average around 24K characters and 350 lines each. In total, there is about 25 GB worth of data. Currently, they are split up by the group, reducing the number of documents need to process to around 15K at one time. I have run into the problem of both memory usage and segmentation faults (I believe the latter is a result of the former) when running on a machine with 128GB of memory. I have changed how I process the documents by using batching to handle them at one time.
Batch Code
def batchGetDoc(raw_documents):
out = []
reports = []
infos = []
# Each item in raw_documents is a tuple of 2 items, where the first item is all
# information (report number, tags) that correlate with said document. The second
# item is the raw text of the document itself
for info, report in raw_documents:
reports.append(report)
infos.append(info)
# Using en_core_web_sm as the model
docs = list(SPACY_PARSER.pipe(reports))
for i in range(len(infos)):
out.append([infos[i],docs[i]])
return out
I use a batch size of 500, and even then, it still takes a while. Are these issues in both speed and memory due to using .pipe() on full documents rather than sentences? Would it be better to go through and run SPACY_PARSER(report) individually?
I am using spaCy to get the named entities, their linked entities, the dependency graphs, and knowledge bases from each document. Will doing it this way risk losing information that will be important for spaCy later on when it comes to getting said data?
Edit: I should mention that I do need the document info for later use in predicting the accuracy based on the document's text
There was a memory leak in the parser and NER, which is fixed in v2.1.9 and v2.2.2, so update if necessary. If you have very long documents, you might want to split them into paragraphs or sections for processing. (You'll get an error for texts longer than 1,000,000 characters.)
Definitely use nlp.pipe() for faster processing. You could use the as_tuples option with nlp.pipe() to pass in (text, context) tuples and get (doc, context) tuples back, so you won't need multiple loops. You'd have to handle reversing your tuples from the code above, but once you have (text, context) tuples, you'd just need something like:
out = nlp.pipe(raw_documents, as_tuples=True)

Improve efficiency for ranking in Bag of words model

I am creating a text summarizer and using a basic model to work with using Bag of words approach.
the code i am performing is using the nltk library.
the file read is a large file with over 2500000 words.
below is the loop i am working on with but this takes over 2 hours to run and complete. is there a way to optimize this code
f= open('Complaints.csv', 'r')
raw = f.read()
len(raw)
tokens = nltk.word_tokenize(raw)
len(tokens)
freq = nltk.FreqDist(text)
top_words = [] # blank dictionary
top_words = freq.most_common(100)
print(top_words)
sentences = sent_tokenize(raw)
print(raw)
ranking = defaultdict(int)
for i, sent in enumerate(raw):
for word in word_tokenize(sent.lower()):
if word in freq:
ranking[i]+=freq[word]
top_sentences = nlargest(10, ranking, ranking.get)
print(top_sentences)
This is only one one file and the actual deployment has more than 10-15 files of similar size.
How we can improve this.
Please note these are the text from a chat bot and are actual sentences hence there was no requirement to remove whitespaces, stemming and other text pre processing methods
Firstly, you open at once a large file that needs to fit into your RAM. If you do not have a really good computer, this might be the first bottleneck for perfomance. Read each line separately, or try to use some IO buffer.
What CPU do you have? If you have enough cores, you can get a lot of extra performance when parallelizing the program with an async Pool from Multiprocessing library because you really use the full power of all cores (choose the number of processes according to the thread number. With this method, I reduced a model on 2500 data sets from ~5 minutes to ~17 seconds on 12 threads). You would have to implement the processes to return a dict each, updating them after the processes have finished.
Otherwise, there are machine learning approches for text summarization (sequence to sequence RNNs). With a tensorflow implementation, you can use a dedicated GPU on your local machine (even a decent 10xx or a 2060 from Nvidia will help) to speed up your model.
https://docs.python.org/2/library/multiprocessing.html
https://arxiv.org/abs/1602.06023
hope this helps
Use google colab, it provides it own GPU

How to speed up slow POS tagging?

Before you redirect me to another stackoverflow page since I know there are a few questions about speeding up POS tagging, I've already browsed through and sped up my code with the suggestions here: Slow performance of POS tagging. Can I do some kind of pre-warming?
I'm using Python 3.6. I have lists containing ~100,000 words that have been tokenized using nltk. These are pretty hefty lists so I know that tagging all of these words will inherently take some amount of time. I've loaded the tagger outside, as follows:
def tag_wordList(tokenizedWordList):
from nltk.tag.perceptron import PerceptronTagger
tagger=PerceptronTagger() # load outside
for words in tokenizedWordList:
taggedList = tagger.tag(tokenizedWordList) # add POS to words
return taggedList
Taking this step has sped things up a significant amount, but to get through 100,000+ words, it's still taking over 1.5 hours (and it's still running). The code works fine on a smaller set of data. I believe I tried converting the list to a set at one point without much improvement, though I'm going to try again for good measure. Anyone have any other tips for improving efficiency?
If that's really your tagging code, you are tagging each ten-word sentence ten times before you go to the next one. Understand how your tools work before you complain that they are too slow.
You can get a further speedup by calling pos_tag_sents() on your full list of word-tokenized sentences, instead of launching it separately for each sentence (even just once).
tagged_sents = nltk.pos_tag_sents(tokenized_sentences)

Storing tokenized text in the db?

I have a simple question. I'm doing some light crawling so new content arrives every few days. I've written a tokenizer and would like to use it for some text mining purposes. Specifically, I'm using Mallet's topic modeling tool and one of the pipe is to tokenize the text into tokens before further processing can be done. With the amount of text in my database, it takes a substantial amount of time tokenizing the text (I'm using regex here).
As such, is it a norm to store the tokenized text in the db so that tokenized data can be readily available and tokenizing can be skipped if I need them for other text mining purposes such as Topic modeling, POS tagging? What are the cons of this approach?
Caching Intermediate Representations
It's pretty normal to cache the intermediate representations created by slower components in your document processing pipeline. For example, if you needed dependency parse trees for all the sentences in each document, it would be pretty crazy to do anything except parsing the documents once and then reusing the results.
Slow Tokenization
However, I'm surprise that tokenization is really slow for you, since the stuff downstream from tokenization is usually the real bottleneck.
What package are you using to do the tokenization? If you're using Python and you wrote your own tokenization code, you might want to try one of the tokenizers included in NLTK (e.g., TreebankWordTokenizer).
Another good tokenizer, albeit one that is not written in Python, is the PTBTokenizer included with the Stanford Parser and the Stanford CoreNLP end-to-end NLP pipeline.
I store tokenized text in a MySQL database. While I don't always like the overhead of communication with the database, I've found that there are lots of processing tasks that I can ask the database to do for me (like search the dependency parse tree for complex syntactic patterns).

Categories