Faster NER extraction using SpaCy and Pandas - python

I have a df with a column that contains comments from which I want to extract the organisations. This article provides a great approach but it is too slow for my problem. The df I am using has over 1,000,000 rows and I am using a Google Colab notebook.
Currently my approach is (from the linked article):
def get_orgs(text):
# process the text with our SpaCy model to get named entities
doc = nlp(text)
# initialize list to store identified organizations
org_list = []
# loop through the identified entities and append ORG entities to org_list
for entity in doc.ents:
if entity.label_ == 'ORG':
org_list.append(entity.text)
# if organization is identified more than once it will appear multiple times in list
# we use set() to remove duplicates then convert back to list
org_list = list(set(org_list))
return org_list
df['organizations'] = df['body'].apply(get_orgs)
Is there a faster way to process this? And, would you advise to apply it to a Pandas df or are there better/faster alternatives?

There are a couple of things you can do in general to speed up spaCy. There's a section in the docs on this.
The first thing to try is creating docs in a pipe. You'll need to be a little creative to get this working with a dataframe:
org_lists = []
for doc in nlp.pipe(iter(df['body']):
org_lists.append(...) # do your processing here
# now you can add a column in your dataframe
The other thing is to disable components you aren't using. Since it looks like you're only using NER you can do this:
for doc in nlp.pipe(texts, disable=["tok2vec", "tagger", "parser", "attribute_ruler", "lemmatizer"]):
# Do something with the doc here
Those together should give you a significant speedup.

Related

How to set entity information for token which is included in more than one span in entities in SpaCy?

I'm a spacy beginner who is doing samples for learning purposes, I have referred to an article on how to create an address parser using SpaCy.
My tutorial datasheet as follows
which is running perfectly,
Then I created my own data set which contains addresses in Denmark
but when I run the training command, there is an error,
ValueError: [E1010] Unable to set entity information for token 1 which is included in more than one span in entities, blocked, missing, or outside.
As per the questions asked I StackOverflow and other platforms, the reason for the error is duplicate words in a span
[18, Mbl Denmark A/S, Glarmestervej, 8600, Silkeborg, Denmark]
Recipient contains the word "Denmark" and Country contains the word "Demark"
can anyone suggest to me the solution to fix this
Code for Create DocBin object for building training/test
db = DocBin()
for text, annotations in training_data:
doc = nlp(text) #Construct a Doc object
ents = []
for start, end, label in annotations:
span = doc.char_span(start, end, label=label)
ents.append(span)
doc.ents = ents
db.add(doc)
In general, entities can't be nested or overlapping, and if you have data like that you have to decide what kind of output you want.
If you actually want nested or overlapping annotations, you can use the spancat, which supports that.
In this case though, "Denmark" in "Mbl Denmark" is not really interesting and you probably don't want to annotate it. I would recommend you use filter_spans on your list of spans before assigning it to the Doc. filter_spans will take the longest (or first) span of any overlapping spans, resulting in a list of non-overlapping spans, which you can use for normal entity annotations.

spaCy doc.ents is inconsistent when running on similar data

I have a dataframe with a column called "titleFinal". It's the title of a table in a PDF. My current goal is to pull entities from the table titles and then later use them for analysis. I plan on training a model over my data (by creating a list of entities), but so far I've used the base NER. Unfortunately, when I look at what doc.ents extracts, it doesn't seem consistent at all and I'm not sure if I did something wrong or the model is simply extracting entities poorly.
I started with the small model, but there was a noticeable improvement when I switched to the large model. I'm not seeing as many inconsistencies. However, they are still there. For example:
Table 21-9 Cumulative Effects Initial Screening -> [(21)]
Table 21-18 Cumulative Effects Initial Screening – Human Occupancy and Resources -> [(21), (Cumulative, Effects, Initial, Screening, –, Human, Occupancy, and, Resources)]
These inconsistencies happen quite frequently throughout the list of entities so I'm wondering what I can do to resolve this. Is this expected?
Unfortunately, I can't share the dataset yet, but here's the code I'm currently using:
nlp = spacy.load("en_core_web_lg") # large model for production
tokens = []
lemma = []
ents = []
for doc in nlp.pipe(df['titleFinal'].astype('unicode').values, batch_size=50,
n_threads=3):
if doc.is_parsed:
tokens.append([n.text for n in doc])
lemma.append([n.lemma_ for n in doc])
ents.append([e for e in doc.ents])
else:
# We want to make sure that the lists of parsed results have the
# same number of entries of the original Dataframe, so add some blanks in case the parse fails
tokens.append(None)
lemma.append(None)
ents.append(None)
df['titleFinal_tokens'] = tokens
df['titleFinal_lemma'] = lemma
df['titleFinal_ents'] = ents
Is my approach wrong here?
Info
spaCy version: 2.2.4
Platform: Linux-4.19.104+-x86_64-with-Ubuntu-18.04-bionic
Python version: 3.6.9
Models: en

Vectorized form of cleaning function for NLP

I made the following function to clean the text notes of my dataset :
import spacy
nlp = spacy.load("en")
def clean(text):
"""
Text preprocessing for english text
"""
# Apply spacy to the text
doc=nlp(text)
# Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
tokens=[token.lemma_.strip() for token in doc if
not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
and not token.is_punct # Remove puntuaction
and not token.is_digit # Remove digit
]
# Recreation of the text
text=" ".join(tokens)
return text.lower()
Problem is when I want to clean all my dataset text, it take hour and hour. (my dataset is 70k row and between 100 to 5000 words per row)
I tried to use swifter to run the apply method on multiplethread like that : data.note_line_comment.swifter.apply(clean)
But it doesn't made really better as it took almost one hour.
I was wondering if there is any way to make a vectorized form of my function or maybe and other way to speed up the process. Any idea ?
Short answer
This type of problem inherently takes time.
Long answer
Use regular expressions
Change the spacy pipeline
The more information about the strings you need to make a decision, the longer it will take.
Good news is, if your cleaning of the text is relatively simplified, a few regular expressions might do the trick.
Otherwise you are using the spacy pipeline to help remove bits of text which is costly since it does many things by default:
Tokenisation
Lemmatisation
Dependency parsing
NER
Chunking
Alternatively, you can try your task again and turn off the aspects of the spacy pipeline you don't want which may speed it up quite a bit.
For example, maybe turn off named entity recognition, tagging and dependency parsing...
nlp = spacy.load("en", disable=["parser", "tagger", "ner"])
Then try again, it will speed up.

Efficient way to get data from lotus notes view

I am trying to get all data from view(Lotus Notes) with lotusscript and Python(noteslib module) and export it to csv, but problem is that this takes too much time. I have tried two ways with loop through all documents:
import noteslib
db = noteslib.Database('database','file.nsf')
view = db.GetView('My View')
doc = view.GetFirstDocument()
data = list()
while doc:
data.append(doc.ColumnValues)
doc = view.GetNextDocument(doc)
To get about 1000 lines of data it took me 70 seconds, but view has about 85000 lines so get all data will be too much time, because manually when I use File->Export in Lotus Notes it is about 2 minutes to export all data to csv.
And I tried second way with AllEntries, but it was even slower:
database = []
ec = view.AllEntries
ent = ec.Getfirstentry()
while ent:
row = []
for v in ent.Columnvalues:
row.append(v)
database.append(row)
ent = ec.GetNextEntry(ent)
Everything that I found on the Internet is based on "NextDocument" or "AllEntries". Is there any way to do it faster?
It is (or at least used to be) very expensive from a time standpoint to open a Notes document, like you are doing in your code.
Since you are saying that you want to export the data that is being displayed in the view, you could use the NotesViewEntry class instead. It should be much faster.
Set col = view.AllEntries
Set entry = col.GetFirstEntry()
Do Until entry Is Nothing
values = entry.ColumnValues '*** Array of column values
'*** Do stuff here
Set entry = col.GetNextEntry(entry)
Loop
I wrote a blog about this back in 2013:
http://blog.texasswede.com/which-is-faster-columnvalues-or-getitemvalue/
Something is going on with your code "outside" the view navigation: You already chose the most performant way to navigate a view using "GetFirstDocument" and "GetNextDocument". Using the NotesViewNavigator as mentioned in the comments will be slightly better, but not significant.
You might get a little bit of performance out of your code by setting view.AutoUpdate = False to prohibit the view object to refresh when something in the backend changes. But as you only read data and not change view data that will not give you much of a performance boost.
My suggestion: Identify the REAL bottleneck of your code by commenting out single sections to find out when it starts to get slower:
First attempt:
while doc:
doc = view.GetNextDocument(doc)
Slow?
If not then next attempt:
while doc:
arr = doc.ColumnValues
doc = view.GetNextDocument(doc)
Slow?
If yes: ColumnValues is your enemy...
If not then next attempt:
while doc:
arr = doc.ColumnValues
data.append(arr)
doc = view.GetNextDocument(doc)
I would be very interested to get your results of where it starts to become slow.
I would suspect the performance issue is using COM/ActiveX in Python to access Notes databases. Transferring data via COM involves datatype 'marshalling', possibly at every step, and especially for 'out-of-process' method/property calls.
I don't think there is any way around this in COM. You should consider arranging a Notes 'agent' to do this for you instead (LotusScript or Java maybe). Even a basic LotusScript agent can export 000's of docs per minute. A further alternative may be to look at the Notes C-API (not an easy option and requires API calls from Python).

Unable to run Stanford Core NLP annotator over whole data set

I have been trying to use Stanford Core NLP over a data set but it stops at certain indexes which I am unable to find.
The data set is available on Kaggle: https://www.kaggle.com/PromptCloudHQ/amazon-reviews-unlocked-mobile-phones/data
This is a function that outputs the sentiment of a paragraph by taking the mean sentiment value of individual sentences.
import json
def funcSENT(paragraph):
all_scores = []
output = nlp.annotate(paragraph, properties={
"annotators": "tokenize,ssplit,parse,sentiment",
"outputFormat": "json",
# Only split the sentence at End Of Line. We assume that this method only takes in one single sentence.
#"ssplit.eolonly": "true",
# Setting enforceRequirements to skip some annotators and make the process faster
"enforceRequirements": "false"
})
all_scores = []
for i in range(0,len(output['sentences'])):
all_scores.append((int(json.loads(output['sentences'][i]['sentimentValue']))+1))
final_score = sum(all_scores)/len(all_scores)
return round(final_score)
Now I run this code for every review in the 'Reviews' column using this code.
import pandas as pd
data_file = 'C:\\Users\\SONY\\Downloads\\Amazon_Unlocked_Mobile.csv'
data = pd.read_csv( data_file)
from pandas import *
i = 0
my_reviews = data['Reviews'].tolist()
senti = []
while(i<data.shape[0]):
senti.append(funcSENT(my_reviews[i]))
i=i+1
But somehow I get this error and I am not able to find the problem. Its been many hours now, kindly help.
[1]: https://i.stack.imgur.com/qFbCl.jpg
How to avoid this error?
As I understand, you're using pycorenlp with nlp=StanfordCoreNLP(...) and a running StanfordCoreNLP server. I won't check the data you are using since it appears to require a Kaggle account.
Running with the same setup but different paragraph shows that printing "output" alone shows an error from the java server, in my case:
java.util.concurrent.ExecutionException: java.lang.IllegalArgumentException: Input word not tagged
I THINK that because there is no part-of-speech annotator, the server cannot perform the parsing. Whenever you use parse or depparse, I think you need to have the "pos" annotator as well.
I am not sure what the sentiment annotator needs, but you may need other annotators such as "lemma" to get good sentiment results.
Print output by itself. If you get the same java error, try adding the "pos" annotator to see if you get the expected json. Otherwise, try to give a simpler example, using your own small dataset maybe, and comment or adjust your question.

Categories