I am trying to construct a dependency matcher that catches certain phrases in document and prints out paragraphs containing those phrases. These are a pre-existing long list of verb-noun combinations.
The wider purpose of this exercise is to pore through a large set of PDF documents to analyze what types of activities were undertaken, by whom, and with what frequency. The task is split into two parts. The first is to extract paragraphs containing these phrases(verb-noun etc) for humans to look at and verify a random sample of so we know the parsing is working properly. Then using other characteristics associated with each PDF, do further analysis of the types of tasks (drafting/create/perform > document/task type "x") being performed, by whom, when, etc etc.
One example is "draft/prepare" > "procurement and market risk assessment".
I looked at the dependency tree of a sample sentence and then set up the Dependency matcher to work with that. Please see example below.
The sample sentence is "He drafted the procurement & market risk assessment". The dependency seems to be draft > assessment > procurement > risk > market
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import DependencyMatcher
dmatcher = DependencyMatcher(nlp.vocab)
doc = nlp("""He drafted the procurement & market risk assessment.""")
lemma_list_drpr = ['draft', 'prepare']
print("----- Using Dependency Matcher -----")
deppattern22 = [
{'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
{"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN":
{"LEMMA": "assessment"}},
{"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN":
{"LEMMA": "procurement"}}
]
dmatcher.add("Pat22", patterns = [deppattern22])
for number, mylist in dmatcher(doc):
for item in mylist:
print(doc[item[0]].sent)
When I do this, it works.
However, there are many problems here.
When I try to add "risk" and "market" terms to the matcher then it no longer works:
deppattern22a = [
{'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
{"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN":
{"LEMMA": "assessment"}},
{"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN":
{"LEMMA": "procurement"}},
{"SPEC": {"NBOR_NAME": "proc2", "NBOR_RELOP": ">", "NODE_NAME": "risk2"}, "PATTERN":
{"LEMMA": "risk"}},
{"SPEC": {"NBOR_NAME": "risk2", "NBOR_RELOP": ">", "NODE_NAME": "mkt2"}, "PATTERN":
{"LEMMA": "market"}}
]
Moreover, when I change the sentence text a little bit, by replacing "&" by "and" then the dependency changes so my dependency matcher doesn't work again. The dependency becomes draft > procurement > assessment > ... whereas in the earlier sample sentence it was draft > assessment > procurement > ...
The dependency changes back when I add other text to the sentence.
What would be a good way to find such matches that are not sensitive to minor changes in sentence structure?
Related
I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).
For example, let's assume we analyze the following sentence:
"She told me that her dog was big, black and strong."
The following code would allow me to match the list of adjectives at the end of the sentence:
import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")
# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])
matches = matcher(doc)
Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".
How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True} can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}.
Thanks for any hints.
The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.
patterns = []
for ii in range(1, 5):
pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
patterns.append(pattern)
Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.
As a separate note, you are not handling sentences with the serial comma.
I'm using the dependency parser to see if a sentence matches a rule with exceptions. For example, I'm trying to find all sentences whose noun subject does not have a complement word (adjective, compound, etc.).
A positive case is.
The school is built in 1978.
A negative case is.
The Blue Sky Airline is 70 years old.
My current Spacy pattern matches all two cases.
[
{"RIGHT_ID": "copula", "RIGHT_ATTRS": {"LEMMA": "be"}},
# subject of the verb
{
"LEFT_ID": "copula",
"REL_OP": ">",
"RIGHT_ID": "subject",
"RIGHT_ATTRS": {"DEP": "nsubj"},
},
]
Is there a negative REL_OP? I want to exclude some relations between tokens.
There is no negative REL_OP.
I haven't seen this come up before... It's a little weird, but your best option might be to match the complements you want to exclude, and keep any sentence with no matches.
I would like to run some analysis on documents using different Spacy tools, though I am interested in the Dependency Matcher in particular.
It just so happens that for these documents, I already have the character offsets of some difficult-to-parse entities. A somewhat-contrived example:
from spacy.lang.en import English
nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [
{"offsets":(0,5), "id": "apple"},
{"offsets":(41,54), "id": "san-francisco"}
]
# do something here so that `nlp` knows about those entities
doc = nlp(text)
I've thought about doing something like this:
from spacy.lang.en import English
nlp = English()
text = "Apple is opening its first big office in San Francisco."
already_known_entities = [{"offsets":(0,5), "id": "apple"}, {"offsets":(41,54), "id": "san-francisco"}]
ruler = nlp.add_pipe("entity_ruler")
patterns = []
for e in already_known_entities:
patterns.append({
"label": "GPE",
"pattern": text[e["offsets"][0]:e["offsets"][1]]
})
ruler.add_patterns(patterns)
doc = nlp(text)
This technically works, and it's not the worst solution in the world, but I was still wondering if offsets can be added to the nlp object directly. As far as I can tell, the Matcher docs don't show anything like this. I also understand this might be a bit of a departure from typical Matcher behavior, where a pattern can be applied to all documents in a corpus--whereas here I want to tag entities at certain offsets only for particular documents. Offsets from one document do not apply to other documents.
You are looking for Doc.char_span.
doc = "Blah blah blah"
span = doc.char_span(0, 4, label="BLAH")
doc.ents = [span]
Note that doc.ents is a tuple, so you can't append to it, but you can convert it to a list and set the ents, for example.
I am new to spacy and to nlp overall.
To understand how spacy works I would like to create a function which takes a sentence and returns a dictionary,tuple or list with the noun and the words describing it.
I know that spacy creates a tree of the sentence and knows the use of each word (shown in displacy).
But what's the right way to get from:
"A large room with two yellow dishwashers in it"
To:
{noun:"room",adj:"large"}
{noun:"dishwasher",adj:"yellow",adv:"two"}
Or any other solution that gives me all related words in a usable bundle.
Thanks in advance!
This is a very straightforward use of the DependencyMatcher.
import spacy
from spacy.matcher import DependencyMatcher
nlp = spacy.load("en_core_web_sm")
pattern = [
{
"RIGHT_ID": "target",
"RIGHT_ATTRS": {"POS": "NOUN"}
},
# founded -> subject
{
"LEFT_ID": "target",
"REL_OP": ">",
"RIGHT_ID": "modifier",
"RIGHT_ATTRS": {"DEP": {"IN": ["amod", "nummod"]}}
},
]
matcher = DependencyMatcher(nlp.vocab)
matcher.add("FOUNDED", [pattern])
text = "A large room with two yellow dishwashers in it"
doc = nlp(text)
for match_id, (target, modifier) in matcher(doc):
print(doc[modifier], doc[target], sep="\t")
Output:
large room
two dishwashers
yellow dishwashers
It should be easy to turn that into a dictionary or whatever you'd like. You might also want to modify it to take proper nouns as the target, or to support other kinds of dependency relations, but this should be a good start.
You may also want to look at the noun chunks feature.
What you want to do is called "noun chunks":
import spacy
nlp = spacy.load('en_core_web_md')
txt = "A large room with two yellow dishwashers in it"
doc = nlp(txt)
chunks = []
for chunk in doc.noun_chunks:
out = {}
root = chunk.root
out[root.pos_] = root
for tok in chunk:
if tok != root:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'NOUN': room, 'DET': A, 'ADJ': large},
{'NOUN': dishwashers, 'NUM': two, 'ADJ': yellow},
{'PRON': it}
]
You may notice "noun chunk" doesn't guarantee the root will always be a noun. Should you wish to restrict your results to nouns only:
chunks = []
for chunk in doc.noun_chunks:
out = {}
noun = chunk.root
if noun.pos_ != 'NOUN':
continue
out['noun'] = noun
for tok in chunk:
if tok != noun:
out[tok.pos_] = tok
chunks.append(out)
print(chunks)
[
{'noun': room, 'DET': A, 'ADJ': large},
{'noun': dishwashers, 'NUM': two, 'ADJ': yellow}
]
I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other.
For instance a concept would be any of the words
forest, tree, nature
in a distance less than 4 words from
fire, burn, overheat
I am learning spacy and so far I can use the matcher like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])
That would match hello world and hello, world (or tree firing for the above mentioned example)
I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.
I had a look into:
https://spacy.io/usage/rule-based-matching
and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.
Furthermore, I am not able to generalize that to more words as well.
Some ideas?
Thanks
For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.
Thus, you can write your matcher as
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])
which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for
doc = nlp(u"Hello brave new world")
for match_id, start, end in matcher(doc):
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
it will print you
15578876784678163569 HelloWorld 0 4 Hello brave new world
And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.
I'm relatively new to spaCy but I think the following pattern should work for any number of tokens between 'hello' and 'world' that are comprised of ASCII characters:
[{"LOWER": "hello"}, {'IS_ASCII': True, 'OP': '*'}, {"LOWER": "world"}]
I tested it using Explosion's rule-based match explorer and it works. Overlapping matches will return just one match (eg, "hello and I do mean hello world').