How to match repeating patterns in spacy? - python

I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).
For example, let's assume we analyze the following sentence:
"She told me that her dog was big, black and strong."
The following code would allow me to match the list of adjectives at the end of the sentence:
import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")
# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])
matches = matcher(doc)
Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".
How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True} can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}.
Thanks for any hints.

The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.
patterns = []
for ii in range(1, 5):
pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
patterns.append(pattern)
Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.
As a separate note, you are not handling sentences with the serial comma.

Related

Spacy dependency parse: negative rules

I'm using the dependency parser to see if a sentence matches a rule with exceptions. For example, I'm trying to find all sentences whose noun subject does not have a complement word (adjective, compound, etc.).
A positive case is.
The school is built in 1978.
A negative case is.
The Blue Sky Airline is 70 years old.
My current Spacy pattern matches all two cases.
[
{"RIGHT_ID": "copula", "RIGHT_ATTRS": {"LEMMA": "be"}},
# subject of the verb
{
"LEFT_ID": "copula",
"REL_OP": ">",
"RIGHT_ID": "subject",
"RIGHT_ATTRS": {"DEP": "nsubj"},
},
]
Is there a negative REL_OP? I want to exclude some relations between tokens.
There is no negative REL_OP.
I haven't seen this come up before... It's a little weird, but your best option might be to match the complements you want to exclude, and keep any sentence with no matches.

Add known matches to Spacy document with character offsets

I would like to run some analysis on documents using different Spacy tools, though I am interested in the Dependency Matcher in particular.
It just so happens that for these documents, I already have the character offsets of some difficult-to-parse entities. A somewhat-contrived example:
from spacy.lang.en import English
nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [
{"offsets":(0,5), "id": "apple"},
{"offsets":(41,54), "id": "san-francisco"}
]
# do something here so that `nlp` knows about those entities
doc = nlp(text)
I've thought about doing something like this:
from spacy.lang.en import English
nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [{"offsets":(0,5), "id": "apple"}, {"offsets":(41,54), "id": "san-francisco"}]
ruler = nlp.add_pipe("entity_ruler")
patterns = []
for e in already_known_entities:
patterns.append({
"label": "GPE",
"pattern": text[e["offsets"][0]:e["offsets"][1]]
})
ruler.add_patterns(patterns)
doc = nlp(text)
This technically works, and it's not the worst solution in the world, but I was still wondering if offsets can be added to the nlp object directly. As far as I can tell, the Matcher docs don't show anything like this. I also understand this might be a bit of a departure from typical Matcher behavior, where a pattern can be applied to all documents in a corpus--whereas here I want to tag entities at certain offsets only for particular documents. Offsets from one document do not apply to other documents.
You are looking for Doc.char_span.
doc = "Blah blah blah"
span = doc.char_span(0, 4, label="BLAH")
doc.ents = [span]
Note that doc.ents is a tuple, so you can't append to it, but you can convert it to a list and set the ents, for example.

spacy matcher returns right answer when two words are set as seperate 'TEXT' conditional object only. Why is it?

I'm trying to set a matcher finding word 'iPhone X'.
The sample code says I should follow below.
import spacy
# Import the Matcher
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
doc = nlp("Upcoming iPhone X release date leaked as Apple reveals pre-orders")
# Initialize the Matcher with the shared vocabulary
matcher = Matcher(nlp.vocab)
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone"}, {"TEXT": "X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
# Use the matcher on the doc
matches = matcher(doc)
print("Matches:", [doc[start:end].text for match_id, start, end in matches])
I tried another approach by putting like below.
# Create a pattern matching two tokens: "iPhone" and "X"
pattern = [{"TEXT": "iPhone X"}]
# Add the pattern to the matcher
matcher.add("IPHONE_X_PATTERN", None, pattern)
Why is the second approach not working? I assumed if I put the two word 'iPhone' and 'X' together, it might work as the same way cause it regard the word with space in the middle as a long unique word. But it didn't.
The possible reason I could think of is,
matcher condition should be a single word without empty space.
Am I right? or is there another reason the second approach not working?
Thank you.
The answer is in how Spacy tokenizes the string:
>>> print([t.text for t in doc])
['Upcoming', 'iPhone', 'X', 'release', 'date', 'leaked', 'as', 'Apple', 'reveals', 'pre', '-', 'orders']
As you see, the iPhone and X are separate tokens. See the Matcher reference:
A pattern added to the Matcher consists of a list of dictionaries. Each dictionary describes one token and its attributes.
Thus, you cannot use them both in one token definition.

spaCy Matcher conditional or/and Python

I want to categorize the following keywords:
import spacy
from spacy.matcher import PhraseMatcher
nlp = spacy.load("en_core_web_sm")
phrase_matcher = PhraseMatcher(nlp.vocab)
cat_patterns = [nlp(text) for text in ('cat', 'cute', 'fat')]
dog_patterns = [nlp(text) for text in ('dog', 'fat')]
matcher = PhraseMatcher(nlp.vocab)
matcher.add('Category1', None, *cat_patterns)
matcher.add('Category2', None, *dog_patterns)
doc = nlp("I have a white cat. It is cute and fat; I have a black dog. It is fat,too")
matches = matcher(doc)
for match_id, start, end in matches:
rule_id = nlp.vocab.strings[match_id] # get the unicode ID, i.e. 'CategoryID'
span = doc[start : end] # get the matched slice of the doc
print(rule_id, span.text)
#Output
#Category1 cat
#Category1 cute
#Category1 fat
#Category2 fat
#Category2 dog
#Category1 fat
#Category2 fat
However, my expected output is if the text contains cat and cute or cat and fat together, it will fall in the first category; if the text contains dog and fat together, then it will fall in the second category.
#Category1 cat cute
#Category1 cat fat
#Category2 dog fat
Is it possible to do it using the similar algorithm? Thank you
From the spaCy documentation on Matchers (https://spacy.io/usage/rule-based-matching), there is no way to detect 2 different tokens separated by an arbitrary number of tokens. If you knew how many tokens were between "cat" and "fat", for example, then you could use wildcard patterns (https://spacy.io/usage/rule-based-matching#adding-patterns-wildcard), but it looks like from your example that distance between tokens can vary.
Two solutions that I can see to solve your problem:
Keep track of matches in your for loop using some sort of data structure. If all the tokens you are looking for end up being found, then add that match to your final results.
Use regular expressions to detect what you are looking for. spaCy does have great tools for rule-based matching, but it looks like you aren't using any linguistic aspects of the words you are searching for. A simple regex like /cat.*?fat/ will find the matches you are looking for.

python spacy looking for two (or more) words in a window

I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other.
For instance a concept would be any of the words
forest, tree, nature
in a distance less than 4 words from
fire, burn, overheat
I am learning spacy and so far I can use the matcher like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])
That would match hello world and hello, world (or tree firing for the above mentioned example)
I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.
I had a look into:
https://spacy.io/usage/rule-based-matching
and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.
Furthermore, I am not able to generalize that to more words as well.
Some ideas?
Thanks
For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.
Thus, you can write your matcher as
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])
which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for
doc = nlp(u"Hello brave new world")
for match_id, start, end in matcher(doc):
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
it will print you
15578876784678163569 HelloWorld 0 4 Hello brave new world
And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.
I'm relatively new to spaCy but I think the following pattern should work for any number of tokens between 'hello' and 'world' that are comprised of ASCII characters:
[{"LOWER": "hello"}, {'IS_ASCII': True, 'OP': '*'}, {"LOWER": "world"}]
I tested it using Explosion's rule-based match explorer and it works. Overlapping matches will return just one match (eg, "hello and I do mean hello world').

Categories