Vectorized form of cleaning function for NLP

Vectorized form of cleaning function for NLP - python

I made the following function to clean the text notes of my dataset :
import spacy
nlp = spacy.load("en")
def clean(text):
"""
Text preprocessing for english text
"""
# Apply spacy to the text
doc=nlp(text)
# Lemmatization, remotion of noise (stopwords, digit, puntuaction and singol characters)
tokens=[token.lemma_.strip() for token in doc if
not token.is_stop and not nlp.vocab[token.lemma_].is_stop # Remotion StopWords
and not token.is_punct # Remove puntuaction
and not token.is_digit # Remove digit
]
# Recreation of the text
text=" ".join(tokens)
return text.lower()
Problem is when I want to clean all my dataset text, it take hour and hour. (my dataset is 70k row and between 100 to 5000 words per row)
I tried to use swifter to run the apply method on multiplethread like that : data.note_line_comment.swifter.apply(clean)
But it doesn't made really better as it took almost one hour.
I was wondering if there is any way to make a vectorized form of my function or maybe and other way to speed up the process. Any idea ?

Short answer
This type of problem inherently takes time.
Long answer
Use regular expressions
Change the spacy pipeline
The more information about the strings you need to make a decision, the longer it will take.
Good news is, if your cleaning of the text is relatively simplified, a few regular expressions might do the trick.
Otherwise you are using the spacy pipeline to help remove bits of text which is costly since it does many things by default:
Tokenisation
Lemmatisation
Dependency parsing
NER
Chunking
Alternatively, you can try your task again and turn off the aspects of the spacy pipeline you don't want which may speed it up quite a bit.
For example, maybe turn off named entity recognition, tagging and dependency parsing...
nlp = spacy.load("en", disable=["parser", "tagger", "ner"])
Then try again, it will speed up.

Related

Using POS and PUNCT tokens in custom sentence boundaries in spaCy

I am trying to split sentences into clauses using spaCy for classification with a MLLib. I have searched for one of two solutions that I consider the best way to approach but haven't quite had much luck.
Option: Would be to use the tokens in the doc i.e. token.pos_ that match to SCONJ and split as a sentence.
Option: Would be to create a list using whatever spaCy has as a dictionary of values it identifies as SCONJ
The issue with 1 is that I only have .text, .i, and no .pos_ as the custom boundaries (as far as I am aware needs to be run before the parser.
The issue with 2 is that I can't seem to find the dictionary. It is also a really hacky approach.
import deplacy
from spacy.language import Language
# Uncomment to visualise how the tokens are labelled
# deplacy.render(doc)
custom_EOS = ['.', ',', '!', '!']
custom_conj = ['then', 'so']
#Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text in custom_EOS:
doc[token.i + 1].is_sent_start = True
if token.text in custom_conj:
doc[token.i].is_sent_start = True
return doc
def set_sentence_breaks(doc):
for token in doc:
if token == "SCONJ":
doc[token.i].is_sent_start = True
def main():
text = "In the add user use case, we need to consider speed and reliability " \
"so use of a relational DB would be better than using SQLite. Though " \
"it may take extra effort to convert #Bot"
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Current Output
Sentences: ['In the add user use case,',
'we need to consider speed and reliability,
'so the use of a relational DB would be better than using SQLite.',
'Though it may take extra effort to convert #Bot']

I would recommend not trying to do anything clever with is_sent_starts - while it is user-accessible, it's really not intended to be used in that way, and there is at least one unresolved issue related to it.
Since you just need these divisions for some other classifier, it's enough for you to just get the string, right? In that case I recommend you run the spaCy pipeline as usual and then split sentences on SCONJ tokens (if just using SCONJ is working for your use case). Something like:
out = []
for sent in doc.sents:
last = sent[0].i
for tok in sent:
if tok.pos_ == "SCONJ":
out.append(doc[last:tok.i])
last = tok.i + 1
out.append(doc[last:sent[-1].i])
Alternately, if that's not good enough, you can identify subsentences using the dependency parse to find verbs in subsentences (by their relation to SCONJ, for example), saving the subsentences, and then adding another sentence based on the root.

Adding entites to spacy doc object using BERT's offsets

Is there any way to add entities to a spacy doc object using BERT's offsets ? Problem is my whole pipeline is spacy dependent and i am using the latest PUBMEDBERT for which spacy doesnt provide support.
So at times the offsets of entities given by pubmedbert dont result into a valid SPAN for spacy as the tokenization is completely different.
what work have i done till now to solve my problem ?
I made a custom tokenizer by asking spacy to split on punctuation, similar to bert but there are certain cases wherein i just cant make a rule. for example:-
text = '''assessment
Exdtve age-rel mclr degn, left eye, with actv chrdl neovas
Mar-10-2020
assessment'''
Pubmedbert predicted 13:17 to be an entity i.e. dtve
but on adding the span as entity in spacy doc object it results NONE as it is not a valid span.
span = doc.char_span(row['start'], row['end'], row['ent'])
doc.ents = list(doc.ents) + [span]
TypeError: object of type 'NoneType' has no len()
Consider row['start'] to be 13, row['end'] to be 17 and row['ent'] to be label
how can i solve this problem ? is there anyway i can just add entities in spacy doc object using starting and ending offset given by pubmedbert
would really appreciate any help on this, Thank you.

Because spacy stores entities internally as IOB tags on tokens in the doc, you can only add entity spans that correspond to full tokens underneath.
If you're only using this doc to store these entities (not using any other components like a tagger or parser from another model that expect a different tokenizer), you can create a doc with the same tokenization as the BERT model:
import spacy
from spacy.tokens import Doc
nlp = spacy.blank("en")
# bert_tokens = [..., "Ex", "dtve", ...]
words, spaces = spacy.util.get_words_and_spaces(bert_tokens, text)
doc = Doc(nlp.vocab, words=words, spaces=spaces)
Then you should be able to add the entity spans to the document.
If you need the original spacy tokenization + entities based on a different tokenization, then you'll have to adjust the entity character offsets to match the spacy token boundaries in order to add them. Since this can depend a lot on the data/task (if dtve is an entity, is Exdtve also necessarily an entity of the same type?), you probably need a custom solution based on your data. If you're trying to adjust the entity spans to line up with the current tokens, you can see the character start and length for each token with token.idx and len(token).

What are the best algorithms to determine the language of text and to correct typos in python?

I am looking for algorithms that could tell the language of the text to me(e.g. Hello - English, Bonjour - French, Servicio - Spanish) and also correct typos of the words in english. I have already explored Google's TextBlob, it is very relevant but it got "Too many requests" error as soon as my code starts executing. I also started exploring Polyglot but I am facing a lot of issues to download the library on Windows.
Code for TextBlob
*import pandas as pd
from tkinter import filedialog
from textblob import TextBlob
import time
from time import sleep
colnames = ['Word']
x=filedialog.askopenfilename(title='Select the word list')
print("Data to be checked: " + x)
df = pd.read_excel(x,sheet_name='Sheet1',header=0,names=colnames,na_values='?',dtype=str)
words = df['Word']
i=0
Language_detector=pd.DataFrame(columns=['Word','Language','corrected_word','translated_word'])
for word in words:
b = TextBlob(word)
language_word=b.detect_language()
time.sleep(0.5)
if language_word in ['en','EN']:
corrected_word=b.correct()
time.sleep(0.5)
Language_detector.loc[i, ['corrected_word']]=corrected_word
else:
translated_word=b.translate(to='en')
time.sleep(0.5)
Language_detector.loc[i, ['Word']]=word
Language_detector.loc[i, ['Language']]=language_word
Language_detector.loc[i, ['translated_word']]=translated_word
i=i+1
filename="Language detector test v 1.xlsx"
Language_detector.to_excel(filename,sheet_name='Sheet1')
print("Languages identified for the word list")**

A common way to classify languages is to gather summary statistics on letter or word frequencies and compare them to a known corpus. A naive bayesian classifier would suffice. See https://pypi.org/project/Reverend/ for a way to do this in Python.
Correction of typos can also be done from a corpus using a statistical model of the most likely words versus the likelihood of a particular typo. See, https://norvig.com/spell-correct.html for an example of how to do this in Python.

You could use this, but it is hardly reliable:
https://github.com/hb20007/hands-on-nltk-tutorial/blob/master/8-1-The-langdetect-and-langid-Libraries.ipynb
Alternatively, you could give compact language detector (cld v3) or fasttext a chance OR you could use a corpus to check frequencies of occurring words with the target text in order to find out whether the target text belongs to the language of the respective corpus. The latter is only possible if you know the set of languages to choose from.

For typo correction, you could use the Levenshtein algorithm, which computes a «edit distance». You can compare your words against a dictionary and choose the most likely word. For Python, you could use: https://pypi.org/project/python-Levenshtein/
See the concept of Levenshtein edit distance here: https://en.wikipedia.org/wiki/Levenshtein_distance

Can a token be removed from a spaCy document during pipeline processing?

I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in the document processing pipeline. Is there a way to remove a token from the document within a pipeline component?

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.
While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency. One solution would be to add a custom extension attribute like is_excluded to the tokens, based on whatever objective you want to use:
from spacy.tokens import Token
def get_is_excluded(token):
# Getter function to determine the value of token._.is_excluded
return token.text in ['some', 'excluded', 'words']
Token.set_extension('is_excluded', getter=get_is_excluded)
When processing a Doc, you can now filter it to only get the tokens that are not excluded:
doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']
You can also make this more complex by using the Matcher or PhraseMatcher to find sequences of tokens in context and mark them as excluded.
Also, for completeness: If you do want to change the tokens in a Doc, you can achieve this by constructing a new Doc object with words (a list of strings) and optional spaces (a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array method with the attributes to set and a numpy array of the values (all IDs).

Text Spinner using Naive Bayes

I am writing a text spinner which is working fine as it should. But the accuracy of the readable sentences is very low as it is just using a dictionary which i am getting from database. Which return spintax like this
{Your} {home| house| residence| property} {is} {your} {castle| mansion| fortress| palace}
and is passed to a function which selects randomly synonym and output sentence based on the original input of the user. For example for input:
Your home is your castle.
will return
Your property is your mansion.
Now I want to include Artificial intelligence as it will make my output sentences more readable. I want to know how to make a better selection using naive Bayes. I know I probably need to train so that better results.
Here is my current method for selection of word, which is really simple right now.
def spin(spintax):
while True:
word, n = re.subn('{([^{}]*)}',lambda m: random.choice(m.group(1).split("|")),spintax)
if n == 0: break
return word.strip()
Thank you in advance if you guys need me to post more code let me know

This will probably get closed as there is no concise answer to your question, but you might want to check out nltk wordnet:
https://pythonprogramming.net/wordnet-nltk-tutorial/

Maybe you could download the dataset Google collected from all English books and generate random sentences using ngrams? https://books.google.com/ngrams
The implementation is to use a Markov chain, where that downloaded data provides you probabilities for the next word to choose.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Vectorized form of cleaning function for NLP - python

Related

Using POS and PUNCT tokens in custom sentence boundaries in spaCy

Adding entites to spacy doc object using BERT's offsets

What are the best algorithms to determine the language of text and to correct typos in python?

Can a token be removed from a spaCy document during pipeline processing?

Text Spinner using Naive Bayes

Categories

Resources