I wrote a couple of user defined functions to remove named entities (using NLTK) in Python from a list of text sentences/paragraphs. The problem I'm having is that my method is very slow, especially for large amounts of data. Does anyone have a suggestion for how to optimize this to make it run faster?
import nltk
import string
# Function to reverse tokenization
def untokenize(tokens):
return("".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip())
# Remove named entities
def ne_removal(text):
tokens = nltk.word_tokenize(text)
chunked = nltk.ne_chunk(nltk.pos_tag(tokens))
tokens = [leaf[0] for leaf in chunked if type(leaf) != nltk.Tree]
return(untokenize(tokens))
To use the code I typically have a text list and call the ne_removal function through a list comprehension. Example below:
text_list = ["Bob Smith went to the store.", "Jane Doe is my friend."]
named_entities_removed = [ne_removal(text) for text in text_list]
print(named_entities_removed)
## OUT: ['went to the store.', 'is my friend.']
UPDATE: I tried switching to batch version with this code, but it's only slightly faster. Will keep exploring. Thanks for the input so far.
def extract_nonentities(tree):
tokens = [leaf[0] for leaf in tree if type(leaf) != nltk.Tree]
return(untokenize(tokens))
def fast_ne_removal(text_list):
token_list = [nltk.word_tokenize(text) for text in text_list]
tagged = nltk.pos_tag_sents(token_list)
chunked = nltk.ne_chunk_sents(tagged)
non_entities = []
for tree in chunked:
non_entities.append(extract_nonentities(tree))
return(non_entities)
Every time you call ne_chunk(), it needs to initialize a chunker object and load the statistical model for chunking from disk. Ditto for pos_tag(). So instead of calling them on one sentence at a time, call their batch versions on the complete list of texts:
all_data = [ nltk.word_tokenize(sent) for sent in list_of_all_sents ]
tagged = nltk.pos_tag_sents(all_data)
chunked = nltk.ne_chunk_sents(tagged)
This should give you a considerable speed-up. If that's still too slow for your needs, try profiling your code and consider whether you need to switch to more high-powered tools, like #Lenz suggested.
Related
I have to analyse a large text dataset using Spacy. the dataset contains about 120000 records with a typical text length of about 1000 words. Lemmatizing the text takes quite some time so I looked for methods to reduce that time. This arcicle describes how to speed up the computations using joblib. That works reasonably well: 16 cores reduce the CPU time with a factor of 10, the hyperthreads reduce it with an extra 7%.
Recently I realized that I wanted to compute similarities between docs and probably more analyses with docs later on. So I decided to generate a Spacy document instance (<class 'spacy.tokens.doc.Doc'>) for all documents and use that for analyses (lemmatizing, vectorizing, and probably more) later on. This is where the trouble started.
The analyses of the parallel lemmatizer take place in the function below:
def lemmatize_pipe(doc):
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return lemma_list
(the full demo code can be found at the end of the post). All I have to do is returning doc instead of lemma_list and I'm ready. I thought.
def lemmatize_pipe(doc):
return doc
The sequential version runs in 73 seconds, the parallel version returning lemma_list takes 7 seconds while the version returning doc runs in 127 seconds: twice as much as the sequential version. Full code below.
import time
import pandas as pd
from joblib import Parallel, delayed
import gensim.downloader as api
import spacy
from pdb import set_trace as breakpoint
# Initialize spacy with the small english language model
nlp = spacy.load('en', disable=['parser', 'ner', 'tagger'])
nlp.add_pipe(nlp.create_pipe('sentencizer'))
# Import the dataset and get the text
dataset = api.load("text8")
data = [d for d in dataset]
doc_requested = False
print(len(data), 'documents in original data')
df_data = pd.DataFrame(columns=['content'])
df_data['content'] = df_data['content'].astype(str)
# Content is a list of words, convert is to strings
for doc in data:
sentence = ' '.join([word for word in doc])
df_data.loc[len(df_data)] = [sentence]
### === Sequential processing ===
def lemmatize(text):
doc = nlp(text)
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return doc if doc_requested else lemma_list
cpu = time.time()
df_data['sequential'] = df_data['content'].apply(lemmatize)
print('\nSequential processing in {:.0f} seconds'.format(time.time() - cpu))
df_data.head(3)
### === Parallel processing ===
def lemmatize_pipe(doc):
lemma_list = [str(tok.lemma_).lower() for tok in doc
if tok.is_alpha]
return doc if doc_requested else lemma_list
def chunker(iterable, total_length, chunksize):
return (iterable[pos: pos + chunksize] for pos in range(0, total_length,
chunksize))
def process_chunk(texts):
preproc_pipe = []
for doc in nlp.pipe(texts, batch_size=20):
preproc_pipe.append(lemmatize_pipe(doc))
return preproc_pipe
def preprocess_parallel(data, chunksize):
executor = Parallel(n_jobs=31, backend='multiprocessing', prefer="processes")
do = delayed(process_chunk)
tasks = (do(chunk) for chunk in chunker(data, len(data), chunksize=chunksize))
result = executor(tasks)
flattened = [item for sublist in result for item in sublist]
return flattened
cpu = time.time()
df_data['parallel'] = preprocess_parallel(df_data['content'], chunksize=1)
print('\nParallel processing in {:.0f} seconds'.format(time.time() - cpu))
I have searched and tried all kind of things but could not find a solution. In the end I have decided to compute the similarities together with the lemma's, but that is a workaround. What actually is the cause of the time increase? And is there a way to get the docs without losing that much time?
A pickled doc is quite large and contains a lot of data that isn't needed to reconstruct the doc itself, including the entire model vocab. Using doc.to_bytes() will be a major improvement, and you can improve it a bit more by using exclude to exclude data that you don't need, like doc.tensor:
data = doc.to_bytes(exclude=["tensor"])
...
reloaded_doc = Doc(nlp.vocab)
reloaded_doc.from_bytes(data)
To compare:
doc = nlp("test")
len(pickle.dumps(doc)) # 1749721
len(pickle.dumps(doc.to_bytes())) # 750
len(pickle.dumps(doc.to_bytes(exclude=["tensor"]))) # 316
You can also use doc.to_array() instead of doc.to_bytes() to export only the annotation layers that you need, but reloading the doc from the array is slightly more complicated.
See:
https://spacy.io/usage/saving-loading#docs
https://spacy.io/api/doc#serialization-fields
Is there a way in python to map documents belonging to a certain topic. For example a list of documents that are primarily "Topic 0". I know there are ways to list topics for each document but how do I do it the other way around?
Edit:
I am using the following script for LDA:
doc_set = []
for file in files:
newpath = (os.path.join(my_path, file))
newpath1 = textract.process(newpath)
newpath2 = newpath1.decode("utf-8")
doc_set.append(newpath2)
texts = []
for i in doc_set:
raw = i.lower()
tokens = tokenizer.tokenize(raw)
stopped_tokens = [i for i in tokens if not i in stopwords.words()]
stemmed_tokens = [p_stemmer.stem(i) for i in stopped_tokens]
texts.append(stemmed_tokens)
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
ldamodel = gensim.models.ldamodel.LdaModel(corpus, num_topics=2, random_state=0, id2word = dictionary, passes=1)
You've got a tool/API (Gensim LDA) that, when given a document, gives you a list of topics.
But you want the reverse: a list of documents, for a topic.
Essentially, you'll want to build the reverse-mapping yourself.
Fortunately Python's native dicts & idioms for working with mapping make this pretty simple - just a few lines of code - as long as you're working with data that fully fits in memory.
Very roughly the approach would be:
create a new structure (dict or list) for mapping topics to lists-of-documents
iterate over all docs, adding them (perhaps with scores) to that topic-to-docs mapping
finally, look up (& perhaps sort) those lists-of-docs, for each topic of interest
If your question could be edited to include more information about the format/IDs of your documents/topics, and how you've trained your LDA model, this answer could be expanded with more specific example code to build the kind of reverse-mapping you'd need.
Update for your code update:
OK, if your model is in ldamodel and your BOW-formatted docs in corpus, you'd do something like:
# setup: get the model's topics in their native ordering...
all_topics = ldamodel.print_topics()
# ...then create a empty list per topic to collect the docs:
docs_per_topic = [[] for _ in all_topics]
# now, for every doc...
for doc_id, doc_bow in enumerate(corpus):
# ...get its topics...
doc_topics = ldamodel.get_document_topics(doc_bow)
# ...& for each of its topics...
for topic_id, score in doc_topics:
# ...add the doc_id & its score to the topic's doc list
docs_per_topic[topic_id].append((doc_id, score))
After this, you can see the list of all (doc_id, score) values for a certain topic like this (for topic 0):
print(docs_per_topic[0])
If you're interested in the top docs per topic, you can further sort each list's pairs by their score:
for doc_list in docs_per_topic:
doc_list.sort(key=lambda id_and_score: id_and_score[1], reverse=True)
Then, you could get the top-10 docs for topic 0 like:
print(docs_per_topic[0][:10])
Note that this does everything using all-in-memory lists, which might become impractical for very-large corpuses. In some cases, you might need to compile the per-topic listings into disk-backed structures, like files or a database.
I have a DataFrame that has a text column. I am splitting the DataFrame into two parts based on the value in another column. One of those parts is indexed into a gensim similarity model. The other part is then fed into the model to find the indexed text that is most similar. This involves a couple of search functions to enumerate over each item in the indexed part. With the toy data, it is fast, but with my real data, it is much too slow using apply. Here is the code example:
import pandas as pd
import gensim
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt')
d = {'number': [1,2,3,4,5], 'text': ['do you like python', 'do you hate python','do you like apples','who is nelson mandela','i am not interested'], 'answer':['no','yes','no','no','yes']}
df = pd.DataFrame(data=d)
df_yes = df[df['answer']=='yes']
df_no = df[df['answer']=='no']
df_no = df_no.reset_index()
docs = df_no['text'].tolist()
genDocs = [[w.lower() for w in word_tokenize(text)] for text in docs]
dictionary = gensim.corpora.Dictionary(genDocs)
corpus = [dictionary.doc2bow(genDoc) for genDoc in genDocs]
tfidf = gensim.models.TfidfModel(corpus)
sims = gensim.similarities.MatrixSimilarity(tfidf[corpus], num_features=len(dictionary))
def search(row):
query = [w.lower() for w in word_tokenize(row)]
query_bag_of_words = dictionary.doc2bow(query)
query_tfidf = tfidf[query_bag_of_words]
return query_tfidf
def searchAll(row):
max_similarity = max(sims[search(row)])
index = [i for i, j in enumerate(sims[search(row)]) if j == max_similarity]
return max_similarity, index
df_yes = df_yes.copy()
df_yes['max_similarity'], df_yes['index'] = zip(*df_yes['text'].apply(searchAll))
I have tried converting the operations to dask dataframes to no avail, as well as python multiprocessing. How would I make these functions more efficient? Is it possible to vectorize some/all of the functions?
Your code's intent and operation is very unclear. Assuming it works, explaining the ultimate goal, and showing more example data, more example queries, and the desired results in your question could help.
Perhaps it could be improved to not repeat certain operations over and over. Some ideas could include:
only tokenize each row once, and cache the tokenization
only doc2bow() each row once, and cache the BOW representation
don't call sims(search[row]) twice inside searchAll()
don't iterate twice – once to find the max, then again to find the index – but just once
(More generally, though, efficient text keyword search often uses specialized reverse-indexes for efficiency, to avoid a costly iteration over every document.)
I'm trying to replicate Go Et Al. Twitter sentiment Analysis which can be found here http://help.sentiment140.com/for-students
The problem I'm having is the number of features is 364464. I'm currently using nltk and nltk.NaiveBayesClassifier to do this where tweets holds a replication of the 1,600,000 tweets and there polarity:
for tweet in tweets:
tweet[0] = extract_features(tweet[0], features)
classifier = nltk.NaiveBayesClassifier.train(training_set)
# print "NB Classified"
classifier.show_most_informative_features()
print(nltk.classify.util.accuracy(classifier, testdata))
Everything doesn't take very long apart from the extract_features function
def extract_features(tweet, featureList):
tweet_words = set(tweet)
features = {}
for word in featureList:
features['contains(%s)' % word] = (word in tweet_words)
return features
This is because for each tweet it's creating a dictionary of size 364,464 to represent whether something is present or not.
Is there a way to make this faster or more efficient without reducing the number of features like in this paper?
Turns out there is a wonderful function called:
nltk.classify.util.apply_features()
which you can find herehttp://www.nltk.org/api/nltk.classify.html
training_set = nltk.classify.apply_features(extract_features, tweets)
I had to change my extract_features function but it now works with the huge sizes without memory issues.
Here's a lowdown of the function description:
The primary purpose of this function is to avoid the memory overhead involved in storing all the featuresets for every token in a corpus. Instead, these featuresets are constructed lazily, as-needed. The reduction in memory overhead can be especially significant when the underlying list of tokens is itself lazy (as is the case with many corpus readers).
and my changed function:
def extract_features(tweet):
tweet_words = set(tweet)
global featureList
features = {}
for word in featureList:
features[word] = False
for word in tweet_words:
if word in featureList:
features[word] = True
return features
For a project I would like to measure the amount of ‘human centered’ words within a text. I plan on doing this using WordNet. I have never used it and I am not quite sure how to approach this task. I want to use WordNet to count the amount of words that belong to certain synsets, for example the sysnets ‘human’ and ‘person’.
I came up with the following (simple) piece of code:
word = 'girlfriend'
word_synsets = wn.synsets(word)[0]
hypernyms = word_synsets.hypernym_paths()[0]
for element in hypernyms:
print element
Results in:
Synset('entity.n.01')
Synset('physical_entity.n.01')
Synset('causal_agent.n.01')
Synset('person.n.01')
Synset('friend.n.01')
Synset('girlfriend.n.01')
My first question is, how do I properly iterate over the hypernyms? In the code above it prints them just fine. However, when using an ‘if’ statement, for example:
count_humancenteredness = 0
for element in hypernyms:
if element == 'person':
print 'found person hypernym'
count_humancenteredness +=1
I get ‘AttributeError: 'str' object has no attribute '_name'’. What method can I use to iterate over the hypernyms of my word and perform an action (e.g. increase the count of human centerdness) when a word does indeed belong to the ‘person’ or ‘human’ synset.
Secondly, is this an efficient approach? I assume that iterating over several texts and iterating over the hypernyms of each noun will take quite some time.. Perhaps that there is another way to use WordNet to perform my task more efficiently.
Thanks for your help!
wrt the error message
hypernyms = word_synsets.hypernym_paths() returns a list of list of SynSets.
Hence
if element == 'person':
tries to compare a SynSet object against a string. That kind of comparison is not supported by the SynSet.
Try something like
target_synsets = wn.synsets('person')
if element in target_synsets:
...
or
if u'person' in element.lemma_names():
...
instead.
wrt efficiency
Currently, you do a hypernym-lookup for every word inside your input text. As you note, this is not necessarily efficient. However, if this is fast enough, stop here and do not optimize what is not broken.
To speed up the lookup, you can pre-compile a list of "person related" words in advance by making use of the transitive closure over the hyponyms as explained here.
Something like
person_words = set(w for s in p.closure(lambda s: s.hyponyms()) for w in s.lemma_names())
should do the trick. This will return a set of ~ 10,000 words, which is not too much to store in main memory.
A simple version of the word counter then becomes something on the lines of
from collections import Counter
word_count = Counter()
for word in (w.lower() for w in words if w in person_words):
word_count[word] += 1
You might also need to pre-process the input words using stemming or other morphologic reductions before passing the words on to WordNet, though.
To get all the hyponyms of a synset, you can use the following function (tested with NLTK 3.0.3, dhke's closure trick doesn't work on this version):
def get_hyponyms(synset):
hyponyms = set()
for hyponym in synset.hyponyms():
hyponyms |= set(get_hyponyms(hyponym))
return hyponyms | set(synset.hyponyms())
Example:
from nltk.corpus import wordnet
food = wordnet.synset('food.n.01')
print(len(get_hyponyms(food))) # returns 1526