spaCy Matcher conditional or/and Python - python

I want to categorize the following keywords:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)
cat_patterns = [nlp(text) for text in ('cat', 'cute', 'fat')]
dog_patterns = [nlp(text) for text in ('dog', 'fat')]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('Category1', None, *cat_patterns)
matcher.add('Category2', None, *dog_patterns)
doc = nlp("I have a white cat. It is cute and fat; I have a black dog. It is fat,too")
matches = matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
#Output
#Category1 cat
#Category1 cute
#Category1 fat
#Category2 fat
#Category2 dog
#Category1 fat
#Category2 fat
However, my expected output is if the text contains cat and cute or cat and fat together, it will fall in the first category; if the text contains dog and fat together, then it will fall in the second category.
#Category1 cat cute
#Category1 cat fat
#Category2 dog fat
Is it possible to do it using the similar algorithm? Thank you

From the spaCy documentation on Matchers (https://spacy.io/usage/rule-based-matching), there is no way to detect 2 different tokens separated by an arbitrary number of tokens. If you knew how many tokens were between "cat" and "fat", for example, then you could use wildcard patterns (https://spacy.io/usage/rule-based-matching#adding-patterns-wildcard), but it looks like from your example that distance between tokens can vary.
Two solutions that I can see to solve your problem:
Keep track of matches in your for loop using some sort of data structure. If all the tokens you are looking for end up being found, then add that match to your final results.
Use regular expressions to detect what you are looking for. spaCy does have great tools for rule-based matching, but it looks like you aren't using any linguistic aspects of the words you are searching for. A simple regex like /cat.*?fat/ will find the matches you are looking for.

Related

How to match repeating patterns in spacy?

I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).
For example, let's assume we analyze the following sentence:
"She told me that her dog was big, black and strong."
The following code would allow me to match the list of adjectives at the end of the sentence:
import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")
# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])
matches = matcher(doc)
Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".
How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True} can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}.
Thanks for any hints.
The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.
patterns = []
for ii in range(1, 5):
pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
patterns.append(pattern)
Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.
As a separate note, you are not handling sentences with the serial comma.

Search a string for a word/sentence and print the following word

I have a string that has around 10 lines of text. What I am trying to do is find a sentence that has a specific word(s) in it, and display the word following.
Example String:
The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat
I want the script to search for 'The slow', then print the following word, so in this case, 'donkey'.
I have tried using the Find function, but that just prints the location of the word(s).
Example code:
sSearch = output.find("destination-pattern")
print(sSearch)
Any help would be greatly appreciated.
output = "The slow donkey brown fox"
patt = "The slow"
sSearch = output.find(patt)
print(output[sSearch+len(patt)+1:].split(' ')[0])
output:
donkey
You could work with regular expressions. Python has builtin library called re.
Example usage:
s = "The slow donkey some more text"
finder = "The slow"
idx_finder_end = s.find(finder) + len(finder)
next_word_match = re.match(r"\s\w*\s", s[idx_finder_end:])
next_word = next_word_match.group().strip()
# donkey
I would do it using regular expressions (re module) following way:
import re
txt = '''The quick brown fox
The slow donkey
The slobbery dog
The Furry Cat'''
words = re.findall(r'(?<=The slow) (\w*)',txt)
print(words) # prints ['donkey']
Note that words is now list of words, if you are sure that there is exactly 1 word to be found you could do then:
word = words[0]
print(word) # prints donkey
Explanation: I used so-called lookbehind assertion in first argument of re.findall, which mean I am looking for something behind The slow. \w* means any substring consisting of: letters, digits, underscores (_). I enclosed it in group (brackets) because it is not part of word.
You can do it using regular expressions:
>>> import re
>>> r=re.compile(r'The slow\s+\b(\w+)\b')
>>> r.match('The slow donkey')[1]
'donkey'
>>>

New named entity class in Spacy

I need to train Spacy NER to be able to recognize 2 new classes for named entity recognition, all I have are files with list of items that are supposed to be in new classes.
For example: Rolling Stones, Muse, Arctic Monkeys - artists
Any idea how this can be done?
This seems like a perfect use case for Matcher or PhraseMatcher (if you care about performance).
import spacy
nlp = spacy.load('en')
def merge_phrases(matcher, doc, i, matches):
'''
Merge a phrase. We have to be careful here because we'll change the token indices.
To avoid problems, merge all the phrases once we're called on the last match.
'''
if i != len(matches)-1:
return None
spans = [(ent_id, label, doc[start : end]) for ent_id, label, start, end in matches]
for ent_id, label, span in spans:
span.merge('NNP' if label else span.root.tag_, span.text, nlp.vocab.strings[label])
matcher = spacy.matcher.Matcher(nlp.vocab)
matcher.add(entity_key='1', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Rolling'}, {spacy.attrs.ORTH: 'Stones'}]], on_match=merge_phrases)
matcher.add(entity_key='2', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Muse'}]], on_match=merge_phrases)
matcher.add(entity_key='3', label='ARTIST', attrs={}, specs=[[{spacy.attrs.ORTH: 'Arctic'}, {spacy.attrs.ORTH: 'Monkeys'}]], on_match=merge_phrases)
doc = nlp(u'The Rolling Stones are an English rock band formed in London in 1962. The first settled line-up consisted of Brian Jones, Ian Stewart, Mick Jagger, Keith Richards, Bill Wyman and Charlie Watts')
matcher(doc)
for ent in doc.ents:
print(ent)
See the documentation for more details. From my experience, with 400k entities in a Matcher it would take almost a second to match each document.
PhraseMatcher is much much faster but a bit trickier to use. Note that this is "strict" matcher, it won't match any entities it haven't seen before.

Match all characters except a few using .(dot) in multiline string using Regex

My input string is as follows :
The dog is black
and beautiful
The dog and the cat
is black and beautiful
I want to replace 'black' to 'dark' only when the cat is not described .
So my output should be
The dog is dark
and beautiful
The dog and the cat
is black and beautiful
pRegex = re.compile(r'(The.*?(?!cat)ful)', re.DOTALL)
for i in pRegex.finditer(asm_file):
res = i.groups()
print res
With this , the 'black' is replaced in both the cases.
Is there anything wrong with the regex .
I am using python 2.7
Thanks
Regexp cannot describe a string based on a general negative expression ("not containing Z"). In your case you tried to express sth like "A string starting with X and ending with Y but NOT containing Z." That NOT containing is not possible in regexp. What your pattern resulted in to express was: "A string starting with X and ending with Y and containing at least one place which is not Z." Which does not help.
I suggest to search for the more general expression and then test it using sth like if 'cat' is in i:. That is straight forward and everybody is able to understand that.
A more sophisticated way could be to search for an alternative (OR) of two regexps, the first being one matching such expressions with cat inside, the other matching all expressions with that start and end part. If you then capture both alternatives in different groups, you can easily decide on the group filled which alternative you've got (with or without cat). But this only works if you can specify true separators between the groups which I figure you can't ;-) Anyway here's an example of what I mean:
r = re.compile(r'(The[^|]*?cat[^|]*?ful)|(The[^|]*?ful)')
text = 'The dog is black and beautiful | The dog and the cat is black and beautiful'
for i in r.finditer(text):
print i.groups()
prints:
(None, 'The dog is black and beautiful')
('The dog and the cat is black and beautiful', None)

efficient algorithm for finding co occurrence matrix of phrases

I have a list L of around 40,000 phrases and a document of around 10 million words. what I want to check is which pair of these phrases co occur within a window of 4 words. For example, consider L=["brown fox","lazy dog"]. The document contains the words "a quick brown fox jumps over the lazy dog". I want to see, how many times brown fox and lazy dog appears within an window of four words and store that in a file. I have following code for doing this:
content=open("d.txt","r").read().replace("\n"," ");
for i in range(len(L)):
for j in range(i+1,len(L)):
wr=L[i]+"\W+(?:\w+\W+){1,4}"+L[j]
wrev=L[j]+"\W+(?:\w+\W+){1,4}"+L[i]
phrasecoccur=len(re.findall(wr, content))+len(re.findall(wrev,content))
if (phrasecoccur>0):
f.write(L[i]+", "+L[j]+", "+str(phrasecoccur)+"\n")
Essentially, for each pair of phrases in the list L, I am checking in the document content that how many times these phrases appear within an window of 4 words. However, this method is computationally inefficient when the list L is pretty large, like 40K elements. Is there a better way of doing this?
You could use something similar to the Aho-Corasick string matching algorithm. Build the state machine from your list of phrases. Then start feeding words into the state machine. Whenever a match occurs, the state machine will tell you which phrase matched and at what word number. So your output would be something like:
"brown fox", 3
"lazy dog", 8
etc.
You can either capture all of the output and post-process it, or you can process the matches as they're found.
It takes a little time to build the state machine (a few seconds for 40,000 phrases), but after that it's linear in the number of input tokens, number of phrases, and number of matches.
I used something similar to match 50 million YouTube video titles against the several million song titles and artist names in the MusicBrainz database. Worked great. And very fast.
It should be possible to assemble your 40000 phrases into a big regular expression pattern, and use that to match against your document. It might not be as fast as something more job-specific, but it does work. Here's how I'd do it:
import re
class Matcher(object):
def __init__(self, phrases):
phrase_pattern = "|".join("(?:{})".format(phrase) for phrase in phrases)
gap_pattern = r"\W+(?:\w+\W+){0,4}?"
full_pattern = "({0}){1}({0})".format(phrase_pattern, gap_pattern)
self.regex = re.compile(full_pattern)
def match(self, doc):
return self.regex.findall(doc) # or use finditer to generate match objs
Here's how you can use it:
>>> L = ["brown fox", "lazy dog"]
>>> matcher = Matcher(L)
>>> doc = "The quick brown fox jumps over the lazy dog."
>>> matcher.match(doc)
[('brown fox', 'lazy dog')]
This solution does have a few limitations. One is that it won't detect overlapping pairs of phrases. So in the example, if you added the phrase "jumps over" to the phrase list, you would still only get one matched pair, ("brown fox", "jumps over"). It would miss both ("brown fox", "lazy dog") and ("jumps over", "lazy dog"), since they include some of the same words.
Expanding on Joel's answer, your iterator could be something like this:
def doc_iter(doc):
words=doc[0:4]
yield words
for i in range(3,len(doc)):
words=words[1:]
words.append(doc[i])
yield words
put your phrases in a dict and use the iterator over the doc, checking the phrases at each iteration. This should give you performance between O(n) and O(n*log(n)).

Categories