I have a simple question. I'm doing some light crawling so new content arrives every few days. I've written a tokenizer and would like to use it for some text mining purposes. Specifically, I'm using Mallet's topic modeling tool and one of the pipe is to tokenize the text into tokens before further processing can be done. With the amount of text in my database, it takes a substantial amount of time tokenizing the text (I'm using regex here).
As such, is it a norm to store the tokenized text in the db so that tokenized data can be readily available and tokenizing can be skipped if I need them for other text mining purposes such as Topic modeling, POS tagging? What are the cons of this approach?
Caching Intermediate Representations
It's pretty normal to cache the intermediate representations created by slower components in your document processing pipeline. For example, if you needed dependency parse trees for all the sentences in each document, it would be pretty crazy to do anything except parsing the documents once and then reusing the results.
Slow Tokenization
However, I'm surprise that tokenization is really slow for you, since the stuff downstream from tokenization is usually the real bottleneck.
What package are you using to do the tokenization? If you're using Python and you wrote your own tokenization code, you might want to try one of the tokenizers included in NLTK (e.g., TreebankWordTokenizer).
Another good tokenizer, albeit one that is not written in Python, is the PTBTokenizer included with the Stanford Parser and the Stanford CoreNLP end-to-end NLP pipeline.
I store tokenized text in a MySQL database. While I don't always like the overhead of communication with the database, I've found that there are lots of processing tasks that I can ask the database to do for me (like search the dependency parse tree for complex syntactic patterns).
Related
I use the following code to clean my dataset and print all tokens (words).
with open(".data.csv", "r", encoding="utf-8") as file:
text = file.read()
text = re.sub(r"[^a-zA-Z0-9ß\.,!\?-]", " ", text)
text = text.lower()
nlp = spacy.load("de_core_news_sm")
doc = nlp(text)
for token in doc:
print(token.text)
When I execute this code with a small string it works fine. But when I use a 50 megabyte csv I get the message
Text of length 62235045 exceeds maximum of 1000000. The parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the `nlp.max_length` limit. The limit is in number of characters, so you can check whether your inputs are too long by checking `len(text)`.
When I increase the limit to this size my computer gets hard problems..
How can I fix this? It can't be anything special to want to tokenize this amount of data.
de_core_web_sm isn't just tokenizing. It is running a number of pipeline components including a parser and NER, where you are more likely to run out of RAM on long texts. This is why spacy includes this default limit.
If you only want to tokenize, use spacy.blank("de") and then you can probably increase nlp.max_length to a fairly large limit without running out of RAM. (You'll still eventually run out of RAM if the text gets extremely long, but this takes much much longer than with the parser or NER.)
If you want to run the full de_core_news_sm pipeline, then you'd need to break your text up into smaller units. Meaningful units like paragraphs or sections can make sense. The linguistic analysis from the provided pipelines mostly depends on local context within a few neighboring sentences, so having longer texts isn't helpful. Use nlp.pipe to process batches of text more efficiently, see: https://spacy.io/usage/processing-pipelines#processing
If you have CSV input, then it might make sense to use individual text fields as the units?
I have a big text file from which I want to count the occurrences of known phrases. I currently read the whole text file line by line into memory and use the 'find' function to check whether a particular phrase exists in the text file or not:
found = txt.find(phrase)
This is very slow for large file. To build an index of all possible phrases and store them in a dict will help, but the problem is it's challenging to create all meaningful phrases myself. I know that Lucene search engine supports phrase search. In using Lucene to create an index for a text set, do I need to come up with my own tokenization method, especially for my phrase search purpose above? Or Lucene has an efficient way to automatically create an index for all possible phrases without the need for me to worry about how to create the phrases?
My main purpose is to find a good way to count occurrences in a big text.
Summary: Lucene will take care of allocating higher matching scores to indexed text which more closely match your input phrases, without you having to "create all meaningful phrases" yourself.
Start Simple
I recommend you start with a basic Lucene analyzer, and see what effect that has. There is a reasonably good chance that it will meet your needs.
If that does not give you satisfactory results, then you can certainly investigate more specific/targeted analyzers/tokenizers/filters (for example if you need to analyze non-Latin character sets).
It is hard to be more specific without looking at the source data and the phrase matching requirements in more detail.
But, having said that, here are two examples (and I am assuming you have basic familiarity with how to create a Lucene index, and then query it).
All of the code is based on Lucene 8.4.
CAVEAT - I am not familiar with Python implementations of Lucene. So, with apologies, my examples are in Java - not Python (as your question is tagged). I would imagine that the concepts are somewhat translatable. Apologies if that's a showstopper for you.
A Basic Multi-Purpose Analyzer
Here is a basic analyzer - using the Lucene "service provider interface" syntax and a CustomAnalyzer:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.custom.CustomAnalyzer;
...
Analyzer analyzer = CustomAnalyzer.builder()
.withTokenizer("icu")
.addTokenFilter("lowercase")
.addTokenFilter("asciiFolding")
.build();
The above analyzer tokenizes your input text using Unicode whitespace rules, as encoded into the ICU libraries. It then standardizes on lowercase, and maps accents/diacritics/etc. to their ASCII equivalents.
An Example Using Shingles
If the above approach proves to be weak for your specific phrase matching needs (i.e. false positives scoring too highly), then one technique you can try is to use shingles as your tokens. Read more about shingles here (Elasticsearch has great documentation).
Here is an example analyzer using shingles, and using the more "traditional" syntax:
import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.LowerCaseFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardTokenizer;
import org.apache.lucene.analysis.StopwordAnalyzerBase;
import org.apache.lucene.analysis.Tokenizer;
import org.apache.lucene.analysis.miscellaneous.ASCIIFoldingFilter;
import org.apache.lucene.analysis.shingle.ShingleFilter;
...
StopwordAnalyzerBase.TokenStreamComponents createComponents(String fieldName) {
final Tokenizer source = new StandardTokenizer();
TokenStream tokenStream = source;
tokenStream = new LowerCaseFilter(tokenStream);
tokenStream = new ASCIIFoldingFilter(tokenStream);
// default shingle size is 2:
tokenStream = new ShingleFilter(tokenStream);
return new Analyzer.TokenStreamComponents(source, tokenStream);
}
In this example, the default shingle size is 2 (two words per shingle) - which is a good place to start.
Finally...
Even if you think this is a one-time exercise, it is probably still worth going to the trouble to build some Lucene indexes in a repeatable/automated way (which may take a while depending on the amount of data you have).
That way, it will be fast to run your set of known phrases against the index, to see how effective each index is.
I have deliberately not said anything about your ultimate objective ("to count occurrences"), because that part should be relatively straightforward, assuming you really do want to find exact matches for known phrases. It's possible I have misinterpreted your question - but at a high level I think this is what you need.
I have a spaCy-based NLP pipeline that processes large corpora, then searches for matches on specific words (lemma plus part of speech), then records detailed lexical information for every match. What results is a Python dictionary structured as in the following (simplified) snippet, where w is the matched token and sent_id is a unique hash to the sentence/corpus in which the token was found:
this_match = {"tag": w.tag_,
"dep": w.dep_,
"parent": w.head_,
"children": [{"lemma": c.lemma_,
"tag": c.tag_}
for c in w.children if not c.is_punct]
"sent_id": sent_id}
These smaller dictionaries, each of which represents the linguistic context in which a word appears, are then added to a top-level dictionary that is keyed to the lemma/POS of each match token w. That dictionary is to serve as a store of lexical information for calculating various statistics on the linguistic characteristics of words.
As you can see, our data structure is a straightforward JSON, with dictionaries and arrays of dictionaries of text. How should I be storing it for follow-on analysis? A purely local solution is all I require at the moment, but if (like Elasticsearch, at 2. below) the solution is clearly scalable, then even better. Here are some of the options I've considered:
Just save them all as JSONs and load them every time you work on the statistics.
Load them into an Elasticsearch index. (The data points are not documents, per se, so Elasticsearch seems like overkill.)
Use some other lighter, simpler database format (TinyDB?).
Pickle to a .pkl archive. (Way too big; not easily transferable to other developers/environments.)
What have you used in the past for storing lexically indexed language data? Is there a go-to utility for this use case? And how do I ensure that my data store is scalable?
This question is more open-ended than I'd normally like to ask or see on Stackoverflow, but I think that it will be useful to enough other developers working in NLP that it's worth asking anyway.
I'm using spacy to recognize street addresses on web pages.
My model is initialized basically using spacy's new entity type sample code found here:
https://github.com/explosion/spaCy/blob/master/examples/training/train_new_entity_type.py
My training data consists of plain text webpages with their corresponding Street Address entities and character positions.
I was able to quickly build a model in spacy to start making predictions, but I found its prediction speed to be very slow.
My code works by iterating through serveral raw HTML pages and then feeding each page's plain text version into spacy as it's iterating. For reasons I can't get into, I need to make predictions with Spacy page by page, inside of the iteration loop.
After the model is loaded, I'm using the standard way of making predictions, which I'm referring to as the prediction/evaluation phase:
doc = nlp(plain_text_webpage)
if len(doc.ents) > 0:
print ("found entity")
Questions:
How can I speed up the entity prediction / recognition phase? I'm using a c4.8xlarge instance on AWS and all 36 cores are constantly maxed out when spacy is evaluating the data. Spacy is turning processing a few million webpages from a 1 minute job to a 1 hour+ job.
Will the speed of entity recognition improve as my model becomes more accurate?
Is there a way to remove pipelines like tagger during this phase, can ER be decoupled like that and still be accurate? Will removing other pipelines affect the model itself or is it just a temporary thing?
I saw that you can use GPU during the ER training phase, can it also be used in this evaluating phase in my code for faster predictions?
Update:
I managed to significantly cut down the processing time by:
Using a custom tokenizer (used the one in the docs)
Disabling other pipelines that aren't for Named Entity Recognition
Instead of feeding the whole body of text from each webpage into spacy, I'm only sending over a maximum of 5,000 characters
My updated code to load the model:
nlp = spacy.load('test_model/', disable=['parser', 'tagger', 'textcat'])
nlp.tokenizer = WhitespaceTokenizer(nlp.vocab)
doc = nlp(text)
However, it is still too slow (20X slower than I need it)
Questions:
Are there any other improvements I can make to speed up the Named Entity Recognition? Any fat I can cut from spacy?
I'm still looking to see if a GPU based solution would help - I saw that GPU use is supported during the Named Entity Recognition training phase, can it also be used in this evaluation phase in my code for faster predictions?
Please see here for details about speed troubleshooting: https://github.com/explosion/spaCy/issues/1508
The most important things:
1) Check which BLAS library numpy is linked against, and make sure it's compiled well for your machine. Using conda is helpful as then you get Intel's mkl
2)
c4.8xlarge instance on AWS and all 36 cores are constantly maxed out
when spacy is evaluating the data.
That's probably bad. We can only really parallelise the matrix multiplications at the moment, because we're using numpy --- so there's no way to thread larger chunks. This means the BLAS library is probably launching too many threads. In general you can only profitably use 3-4 cores per process. Try setting the environment variables for your BLAS library to restrict the number of threads.
3) Use nlp.pipe(), to process batches of data. This makes the matrix multiplications bigger, making processing more efficient.
4) Your outer loop of "feed data through my processing pipeline" is probably embarrassingly parallel. So, parallelise it. Either use Python's multiprocessing, or something like joblib, or something like Spark, or just fire off 10 bash scripts in parallel. But take the outermost, highest level chunk of work you can, and run it as independently as possible.
It's actually usually better to run multiple smaller VMs instead of one large VM. It's annoying operationally, but it means less resource sharing.
I have a large text, and I want to parse this text and identify (e.g. wikipedia entries that exist within this text).
I thought of using regular expression, something like:
pattern='New York|Barak Obama|Russian Federation|Olympic Games'
re.findall(pattern,text)
... etc, but this would be millions of characters long, and re doesn't accept that...
The other way I thought about was to tokenize my text and search wikipedia entries for each token, but this doesn't look very efficient, especially if my text is too big...
Any ideas how to do this in Python?
Another way would be getting all Wikipedia articles and pages and use then the Sentence tagger from NLTK.
Put the created sentences, sentence by sentence into an Lucene Index, so that each sentence represent an own "document" in the Lucene Index.
Than you can look up for example all sentences with "Barak Obama", to find patterns in the sentences.
The access to Lucene is pretty fast, I myself use a Lucene Index, containing over 42000000 sentences from Wikipedia.
To get on the clan wikipedia txt file, you can download wikipedia as xml file from here: http://en.wikipedia.org/wiki/Wikipedia:Database_download
and then use the WikipediaExtractor from the Università di Pisa.
http://medialab.di.unipi.it/wiki/Wikipedia_Extractor
I would use NLTK to tokenize the text and look for valid wikipedia entries in the token. If you don't want to store the whole text in-memory you can work line by line or sizes of text chunks.
Do you have to do this with Python? grep --fixed-strings is a good fit for what you want to do, and should do it fairly efficiently: http://www.gnu.org/savannah-checkouts/gnu/grep/manual/grep.html#index-g_t_0040command_007bgrep_007d-programs-175
If you want to do it in pure Python, you'll probably have a tough time getting faster than:
for name in articles:
if name in text:
print 'found name'
The algorithm used by fgrep is called the Aho-Corasick algorithm but a pure Python implementation is likely to be slow.
The Gensim library has a threaded iterator for the ~13GB wikipedia dump. So if you're after specific terms (n-grams) then you can write a custom regex and process each article of text. It may take a day of cpu power to do the search.
You may need to adjust the library if you're after the uri source.