Using POS and PUNCT tokens in custom sentence boundaries in spaCy

Using POS and PUNCT tokens in custom sentence boundaries in spaCy - python

I am trying to split sentences into clauses using spaCy for classification with a MLLib. I have searched for one of two solutions that I consider the best way to approach but haven't quite had much luck.
Option: Would be to use the tokens in the doc i.e. token.pos_ that match to SCONJ and split as a sentence.
Option: Would be to create a list using whatever spaCy has as a dictionary of values it identifies as SCONJ
The issue with 1 is that I only have .text, .i, and no .pos_ as the custom boundaries (as far as I am aware needs to be run before the parser.
The issue with 2 is that I can't seem to find the dictionary. It is also a really hacky approach.
import deplacy
from spacy.language import Language
# Uncomment to visualise how the tokens are labelled
# deplacy.render(doc)
custom_EOS = ['.', ',', '!', '!']
custom_conj = ['then', 'so']
#Language.component("set_custom_boundaries")
def set_custom_boundaries(doc):
for token in doc[:-1]:
if token.text in custom_EOS:
doc[token.i + 1].is_sent_start = True
if token.text in custom_conj:
doc[token.i].is_sent_start = True
return doc
def set_sentence_breaks(doc):
for token in doc:
if token == "SCONJ":
doc[token.i].is_sent_start = True
def main():
text = "In the add user use case, we need to consider speed and reliability " \
"so use of a relational DB would be better than using SQLite. Though " \
"it may take extra effort to convert #Bot"
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("set_custom_boundaries", before="parser")
doc = nlp(text)
# for token in doc:
# print(token.pos_)
print("Sentences:", [sent.text for sent in doc.sents])
if __name__ == "__main__":
main()
Current Output
Sentences: ['In the add user use case,',
'we need to consider speed and reliability,
'so the use of a relational DB would be better than using SQLite.',
'Though it may take extra effort to convert #Bot']

I would recommend not trying to do anything clever with is_sent_starts - while it is user-accessible, it's really not intended to be used in that way, and there is at least one unresolved issue related to it.
Since you just need these divisions for some other classifier, it's enough for you to just get the string, right? In that case I recommend you run the spaCy pipeline as usual and then split sentences on SCONJ tokens (if just using SCONJ is working for your use case). Something like:
out = []
for sent in doc.sents:
last = sent[0].i
for tok in sent:
if tok.pos_ == "SCONJ":
out.append(doc[last:tok.i])
last = tok.i + 1
out.append(doc[last:sent[-1].i])
Alternately, if that's not good enough, you can identify subsentences using the dependency parse to find verbs in subsentences (by their relation to SCONJ, for example), saving the subsentences, and then adding another sentence based on the root.

Related

Force spaCy lemmas to be lowercase

Is it possible to leave the token text true cased, but force the lemmas to be lowercased? I am interested in this because I want to use the PhraseMatcher where I run an input text through the pipleline, and then search for matching phrases on that text, where each search query can be case sensitive or not. In the case that I search by Lemma, i'd like the search to be case insensitive by default.
e.g.
doc = nlp(text)
for query in queries:
if case1:
attr = "LEMMA"
elif case2:
attr = "ORTH"
elif case3:
attr = "LOWER"
phrase_matcher = PhraseMatcher(self.vocab, attr=attr)
phrase_matcher.add(key, query)
matches = phrase_matcher(doc)
In case 1, I expect matching to be case insensitive, and if there were something in the spaCy library to enforce that lemmas are lowercased by default, this would be much more efficient than keeping multiple versions of the doc, and forcing one to have all lowercased characters.

This part of spacy is changing from version to version, last time I looked at the lemmatization was a few versions ago. So this solution might not be the most elegant one, but it is definitely a simple one:
# Create a pipe that converts lemmas to lower case:
def lower_case_lemmas(doc) :
for token in doc :
token.lemma_ = token.lemma_.lower()
return doc
# Add it to the pipeline
nlp.add_pipe(lower_case_lemmas, name="lower_case_lemmas", after="tagger")
You will need to figure out where in the pipeline to add it to. The latest documentation mentions that the Lemmatizer uses POS tagging info, so I am not sure at what point it is called. Placing your pipe after tagger is safe, all the lemmas should be figured out by then.
Another option I can think of is to derive a custom lemmatizer from Lemmatizer class and override its __call__ method, but this is likely to be quite invasive as you will need to figure out how (and where) to plug in your own lemmatizer.

Can a token be removed from a spaCy document during pipeline processing?

I am using spaCy (a great Python NLP library) to process a number of very large documents, however, my corpus has a number of common words that I would like to eliminate in the document processing pipeline. Is there a way to remove a token from the document within a pipeline component?

spaCy's tokenization is non-destructive, so it always represents the original input text and never adds or deletes anything. This is kind of a core principle of the Doc object: you should always be able to reconstruct and reproduce the original input text.
While you can work around that, there are usually better ways to achieve the same thing without breaking the input text ↔ Doc text consistency. One solution would be to add a custom extension attribute like is_excluded to the tokens, based on whatever objective you want to use:
from spacy.tokens import Token
def get_is_excluded(token):
# Getter function to determine the value of token._.is_excluded
return token.text in ['some', 'excluded', 'words']
Token.set_extension('is_excluded', getter=get_is_excluded)
When processing a Doc, you can now filter it to only get the tokens that are not excluded:
doc = nlp("Test that tokens are excluded")
print([token.text for token if not token._.is_excluded])
# ['Test', 'that', 'tokens', 'are']
You can also make this more complex by using the Matcher or PhraseMatcher to find sequences of tokens in context and mark them as excluded.
Also, for completeness: If you do want to change the tokens in a Doc, you can achieve this by constructing a new Doc object with words (a list of strings) and optional spaces (a list of boolean values indicating whether the token is followed by a space or not). To construct a Doc with attributes like part-of-speech tags or dependency labels, you can then call the Doc.from_array method with the attributes to set and a numpy array of the values (all IDs).

Using spaCy to replace the "topic" of a sentence

So as a bit of a thought experiment I coded up a function in python that uses spaCy to find the subject of a news article, then replace it with a noun of choice. The problem is, it doesn't exactly work well, and I was hoping it could be improved. I don't exactly understand spaCy that well, and the documentation is a bit hard to understand.
First, the code:
doc=nlp(thetitle)
for text in doc:
#subject would be
if text.dep_ == "nsubj":
subject = text.orth_
#iobj for indirect object
if text.dep_ == "iobj":
indirect_object = text.orth_
#dobj for direct object
if text.dep_ == "dobj":
direct_object = text.orth_
try:
subject
except NameError:
if not thetitle: #if empty title
thetitle = "cat"
subject = "cat"
else: #if unknown subject
try: #do we have a direct object?
direct_object
except NameError:
try: #do we have an indirect object?
indirect_object
except NameError: #still no??
subject = random.choice(thetitle.split())
else:
subject = indirect_object
else:
subject = direct_object
else:
thecat = "cat" #do nothing here, everything went okay
newtitle = re.sub(r"\b%s\b" % subject, toreplace, thetitle)
if (newtitle == thetitle) : #if no replacement happened due to regex
newtitle = thetitle.replace(subject, toreplace)
return newtitle
the "cat" lines are filler lines that don't do anything. "thetitle" is a variable for a random news article title I'm pulling in from RSS feeds. "toreplace" is the variable that holds the string to replace whatever the found subject is.
Let's use an example:
"Video Games that Should Be Animated TV Shows - Screen Rant" And here's the displaCy breakdown of that: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows%20-%20Screen%20Rant&model=en&cpu=1&cph=1
The word the code decided to replace ended up being "that", which isn't even a noun in this sentence, but seems to have resulted in the random word choice fallback, since it couldn't find a subject, indirect object, or direct object. My hope is that it would find something more like "Video games" in this example.
I should note if I take the last bit out (which appears to be the source for the news article) in displaCy: https://demos.explosion.ai/displacy/?text=Video%20Games%20that%20Should%20Be%20Animated%20TV%20Shows&model=en&cpu=1&cph=1 it seems to think "that" is the subject, which is incorrect.
What is a better way to parse this? Should I look for proper nouns first?

Not directly answering your question, I think the code below is far more readable because the conditions are explicit, and what happens when a condition is valid is not buried in an else clause far away. This code also takes care of the cases with multiple objects.
To your problem: any natural language processing tool will have a hard time to find the subject (or maybe rather topic) of a sentence fragment, they are trained with complete sentences. I'm not even sure if such fragments technically have subjects (I'm not an expert, though). You could try to train your own model, but then you will have to provide labeled sentences, I don't know if such a thing already exists for sentence fragments.
I am not fully sure what you want to achieve, looking at the common nouns and pronouns might likely contain the word you want to replace, and the first one appearing is likely the most important.
import spacy
import random
import re
from collections import defaultdict
def replace_subj(sentence, nlp):
doc = nlp(sentence)
tokens = defaultdict(list)
for text in doc:
tokens[text.dep_].append(text.orth_)
if not sentence:
return "cat"
if "nsubj" in tokens:
subject = tokens["nsubj"][0]
elif "dobj" in tokens:
subject = tokens["dobj"][0]
elif "iobj" in tokens:
subject = tokens["iobj"][0]
else:
subject = random.choice(sentence.split())
return re.sub(r"\b{}\b".format(subject), "cat", sentence)
if __name__ == "__main__":
sentence = """Video Games that Should Be Animated TV Shows - Screen Rant"""
nlp = spacy.load("en")
print(replace_subj(sentence, nlp))

Determine if text is in English?

I am using both Nltk and Scikit Learn to do some text processing. However, within my list of documents I have some documents that are not in English. For example, the following could be true:
[ "this is some text written in English",
"this is some more text written in English",
"Ce n'est pas en anglais" ]
For the purposes of my analysis, I want all sentences that are not in English to be removed as part of pre-processing. However, is there a good way to do this? I have been Googling, but cannot find anything specific that will let me recognize if strings are in English or not. Is this something that is not offered as functionality in either Nltk or Scikit learn? EDIT I've seen questions both like this and this but both are for individual words... Not a "document". Would I have to loop through every word in a sentence to check if the whole sentence is in English?
I'm using Python, so libraries that are in Python would be preferable, but I can switch languages if needed, just thought that Python would be the best for this.

There is a library called langdetect. It is ported from Google's language-detection available here:
https://pypi.python.org/pypi/langdetect
It supports 55 languages out of the box.

You might be interested in my paper The WiLI benchmark dataset for written
language identification. I also benchmarked a couple of tools.
TL;DR:
CLD-2 is pretty good and extremely fast
lang-detect is a tiny bit better, but much slower
langid is good, but CLD-2 and lang-detect are much better
NLTK's Textcat is neither efficient nor effective.
You can install lidtk and classify languages:
$ lidtk cld2 predict --text "this is some text written in English"
eng
$ lidtk cld2 predict --text "this is some more text written in English"
eng
$ lidtk cld2 predict --text "Ce n'est pas en anglais"
fra

Pretrained Fast Text Model Worked Best For My Similar Needs
I arrived at your question with a very similar need. I appreciated Martin Thoma's answer. However, I found the most help from Rabash's answer part 7 HERE.
After experimenting to find what worked best for my needs, which were making sure text files were in English in 60,000+ text files, I found that fasttext was an excellent tool.
With a little work, I had a tool that worked very fast over many files. Below is the code with comments. I believe that you and others will be able to modify this code for your more specific needs.
class English_Check:
def __init__(self):
# Don't need to train a model to detect languages. A model exists
# that is very good. Let's use it.
pretrained_model_path = 'location of your lid.176.ftz file from fasttext'
self.model = fasttext.load_model(pretrained_model_path)
def predictionict_languages(self, text_file):
this_D = {}
with open(text_file, 'r') as f:
fla = f.readlines() # fla = file line array.
# fasttext doesn't like newline characters, but it can take
# an array of lines from a file. The two list comprehensions
# below, just clean up the lines in fla
fla = [line.rstrip('\n').strip(' ') for line in fla]
fla = [line for line in fla if len(line) > 0]
for line in fla: # Language predict each line of the file
language_tuple = self.model.predictionict(line)
# The next two lines simply get at the top language prediction
# string AND the confidence value for that prediction.
prediction = language_tuple[0][0].replace('__label__', '')
value = language_tuple[1][0]
# Each top language prediction for the lines in the file
# becomes a unique key for the this_D dictionary.
# Everytime that language is found, add the confidence
# score to the running tally for that language.
if prediction not in this_D.keys():
this_D[prediction] = 0
this_D[prediction] += value
self.this_D = this_D
def determine_if_file_is_english(self, text_file):
self.predictionict_languages(text_file)
# Find the max tallied confidence and the sum of all confidences.
max_value = max(self.this_D.values())
sum_of_values = sum(self.this_D.values())
# calculate a relative confidence of the max confidence to all
# confidence scores. Then find the key with the max confidence.
confidence = max_value / sum_of_values
max_key = [key for key in self.this_D.keys()
if self.this_D[key] == max_value][0]
# Only want to know if this is english or not.
return max_key == 'en'
Below is the application / instantiation and use of the above class for my needs.
file_list = # some tool to get my specific list of files to check for English
en_checker = English_Check()
for file in file_list:
check = en_checker.determine_if_file_is_english(file)
if not check:
print(file)

This is what I've used some time ago.
It works for texts longer than 3 words and with less than 3 non-recognized words.
Of course, you can play with the settings, but for my use case (website scraping) those worked pretty well.
from enchant.checker import SpellChecker
max_error_count = 4
min_text_length = 3
def is_in_english(quote):
d = SpellChecker("en_US")
d.set_text(quote)
errors = [err.word for err in d]
return False if ((len(errors) > max_error_count) or len(quote.split()) < min_text_length) else True
print(is_in_english('“中文”'))
print(is_in_english('“Two things are infinite: the universe and human stupidity; and I\'m not sure about the universe.”'))
> False
> True

Use the enchant library
import enchant
dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
dictionary.check("Hello") # prints True
dictionary.check("Helo") #prints False
This example is taken directly from their website

If you want something lightweight, letter trigrams are a popular approach. Every language has a different "profile" of common and uncommon trigrams. You can google around for it, or code your own. Here's a sample implementation I came across, which uses "cosine similarity" as a measure of distance between the sample text and the reference data:
http://code.activestate.com/recipes/326576-language-detection-using-character-trigrams/
If you know the common non-English languages in your corpus, it's pretty easy to turn this into a yes/no test. If you don't, you need to anticipate sentences from languages for which you don't have trigram statistics. I would do some testing to see the normal range of similarity scores for single-sentence texts in your documents, and choose a suitable threshold for the English cosine score.

import enchant
def check(text):
text=text.split()
dictionary = enchant.Dict("en_US") #also available are en_GB, fr_FR, etc
for i in range(len(text)):
if(dictionary.check(text[i])==False):
o = "False"
break
else:
o = ("True")
return o

Best way to match a large list against strings in python

I have a python list that contains about 700 terms that I would like to use as metadata for some database entries in Django. I would like to match the terms in the list against the entry descriptions to see if any of the terms match but there are a couple of issues. My first issue is that there are some multiword terms within the list that contain words from other list entries. An example is:
Intrusion
Intrusion Detection
I have not gotten very far with re.findall as it will match both Intrusion and Intrusion Detection in the above example. I would only want to match Intrusion Detection and not Intrusion.
Is there a better way to do this type of matching? I thought maybe maybe about trying NLTK but it didn't look like it could help with this type of matching.
Edit:
So to add a little more clarity, I have a list of 700 terms such as firewall or intrusion detection. I would like to try to match these words in the list against descriptions that I have stored in a database to see if any match, and I will use those terms in metadata. So if I have the following string:
There are many types of intrusion detection devices in production today.
and if I have a list with the following terms:
Intrusion
Intrusion Detection
I would like to match 'intrusion detection', but not 'intrusion'. Really I would like to also be able to match singular/plural instances too, but I may be getting ahead of myself. The idea behind all of this is to take all of the matches and put them in a list, and then process them.

If you need more flexibility to match entry descriptions, you can combine nltk and re
from nltk.stem import PorterStemmer
import re
let's say you have different descriptions of the same event ie. a rewrite of the system. You can use nltk.stem to capture rewrite, rewriting, rewrites, singular and plural forms etc.
master_list = [
'There are many types of intrusion detection devices in production today.',
'The CTO approved a rewrite of the system',
'The CTO is about to approve a complete rewrite of the system',
'The CTO approved a rewriting',
'Breaching of Firewalls'
]
terms = [
'Intrusion Detection',
'Approved rewrite',
'Firewall'
]
stemmer = PorterStemmer()
# for each term, split it into words (could be just one word) and stem each word
stemmed_terms = ((stemmer.stem(word) for word in s.split()) for s in terms)
# add 'match anything after it' expression to each of the stemmed words
# join result into a pattern string
regex_patterns = [''.join(stem + '.*' for stem in term) for term in stemmed_terms]
print(regex_patterns)
print('')
for sentence in master_list:
match_obs = (re.search(pattern, sentence, flags=re.IGNORECASE) for pattern in regex_patterns)
matches = [m.group(0) for m in match_obs if m]
print(matches)
Output:
['Intrus.*Detect.*', 'Approv.*rewrit.*', 'Firewal.*']
['intrusion detection devices in production today.']
['approved a rewrite of the system']
['approve a complete rewrite of the system']
['approved a rewriting']
['Firewalls']
EDIT:
To see which of the terms caused the match:
for sentence in master_list:
# regex_patterns maps directly onto terms (strictly speaking it's one-to-one and onto)
for term, pattern in zip(terms, regex_patterns):
if re.search(pattern, sentence, flags=re.IGNORECASE):
# process term (put it in the db)
print('TERM: {0} FOUND IN: {1}'.format(term, sentence))
Output:
TERM: Intrusion Detection FOUND IN: There are many types of intrusion detection devices in production today.
TERM: Approved rewrite FOUND IN: The CTO approved a rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO is about to approve a complete rewrite of the system
TERM: Approved rewrite FOUND IN: The CTO approved a rewriting
TERM: Firewall FOUND IN: Breaching of Firewalls

This question is unclear, but from what I understand you have a Master List of terms. Say one term per line. Next you have a list of test data, where some of the test data will be in the master list, and some wont. You want to see if the test data is in the master list and if it is perform a task.
Assuming your Master List looks like this
Intrusion Detection
Firewall
FooBar
and your Test Data Looks like this
Intrusion
Intrusion Detection
foo
bar
this simple script should lead you in the right direction
#!/usr/bin/env python
import sys
def main():
'''useage tester.py masterList testList'''
#open files
masterListFile = open(sys.argv[1], 'r')
testListFile = open(sys.argv[2], 'r')
#bulid master list
# .strip() off '\n' new line
# set to lower case. Intrusion != intrusion, but should.
masterList = [ line.strip().lower() for line in masterListFile ]
#run test
for line in testListFile:
term = line.strip().lower()
if term in masterList:
print term, "in master list!"
#perhaps grab your metadata using a like %%
else:
print "OH NO!", term, "not found!"
#close files
masterListFile.close()
testListFile.close()
if __name__ == '__main__':
main()
SAMPLE OUTPUT
OH NO! intrusion not found!
intrusion detection in master list!
OH NO! foo not found!
OH NO! bar not found!
there are several other ways to do this, but this should point you in the right direction. if your list is large (700 really isn't that large) consider using a dict, I feel they quicker. especially if yo plan to query a database. perhaps a dictionary structure could look like {term: information about term}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.