I'm using the dependency parser to see if a sentence matches a rule with exceptions. For example, I'm trying to find all sentences whose noun subject does not have a complement word (adjective, compound, etc.).
A positive case is.
The school is built in 1978.
A negative case is.
The Blue Sky Airline is 70 years old.
My current Spacy pattern matches all two cases.
[
{"RIGHT_ID": "copula", "RIGHT_ATTRS": {"LEMMA": "be"}},
# subject of the verb
{
"LEFT_ID": "copula",
"REL_OP": ">",
"RIGHT_ID": "subject",
"RIGHT_ATTRS": {"DEP": "nsubj"},
},
]
Is there a negative REL_OP? I want to exclude some relations between tokens.
There is no negative REL_OP.
I haven't seen this come up before... It's a little weird, but your best option might be to match the complements you want to exclude, and keep any sentence with no matches.
Related
I have a similar question as the one asked in this post: How to define a repeating pattern consisting of multiple tokens in spacy? The difference in my case compared to the linked post is that my pattern is defined by POS and dependency tags. As a consequence I don't think I could easily use regex to solve my problem (as is suggested in the accepted answer of the linked post).
For example, let's assume we analyze the following sentence:
"She told me that her dog was big, black and strong."
The following code would allow me to match the list of adjectives at the end of the sentence:
import spacy # I am using spacy 2
from spacy.matcher import Matcher
nlp = spacy.load('en_core_web_sm')
# Create doc object from text
doc = nlp(u"She told me that her dog was big, black and strong.")
# Set up pattern matching
matcher = Matcher(nlp.vocab)
pattern = [{"POS": "ADJ"}, {"IS_PUNCT": True}, {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
matcher.add("AdjList", [pattern])
matches = matcher(doc)
Running this code would match "big, black and strong". However, this pattern would not find the list of adjectives in the following sentences "She told me that her dog was big and black" or "She told me that her dog was big, black, strong and playful".
How would I have to define a (single) pattern for spacy's matcher in order to find such a list with any number of adjectives? Put differently, I am looking for the correct syntax for a pattern where the part {"POS": "ADJ"}, {"IS_PUNCT": True} can be repeated arbitrarily often before the list concludes with the pattern {"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}.
Thanks for any hints.
The solution / issue isn't fundamentally different from the question linked to, there's no facility for repeating multi-token patterns in a match like that. You can use a for loop to build multiple patterns to capture what you want.
patterns = []
for ii in range(1, 5):
pattern = [{"POS": "ADJ"}, {"IS_PUNCT":True}] * ii
pattern += [{"POS": "ADJ"}, {"POS": "CCONJ"}, {"POS": "ADJ"}]
patterns.append(pattern)
Alternately you could do something with the dependency matcher. In your example sentence it's not that clean, but for a sentence like "It was a big, brown, playful dog", the adjectives all have dependency arcs directly connecting them to the noun.
As a separate note, you are not handling sentences with the serial comma.
I am trying to construct a dependency matcher that catches certain phrases in document and prints out paragraphs containing those phrases. These are a pre-existing long list of verb-noun combinations.
The wider purpose of this exercise is to pore through a large set of PDF documents to analyze what types of activities were undertaken, by whom, and with what frequency. The task is split into two parts. The first is to extract paragraphs containing these phrases(verb-noun etc) for humans to look at and verify a random sample of so we know the parsing is working properly. Then using other characteristics associated with each PDF, do further analysis of the types of tasks (drafting/create/perform > document/task type "x") being performed, by whom, when, etc etc.
One example is "draft/prepare" > "procurement and market risk assessment".
I looked at the dependency tree of a sample sentence and then set up the Dependency matcher to work with that. Please see example below.
The sample sentence is "He drafted the procurement & market risk assessment". The dependency seems to be draft > assessment > procurement > risk > market
import spacy
nlp = spacy.load('en_core_web_sm')
from spacy.matcher import DependencyMatcher
dmatcher = DependencyMatcher(nlp.vocab)
doc = nlp("""He drafted the procurement & market risk assessment.""")
lemma_list_drpr = ['draft', 'prepare']
print("----- Using Dependency Matcher -----")
deppattern22 = [
{'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
{"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN":
{"LEMMA": "assessment"}},
{"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN":
{"LEMMA": "procurement"}}
]
dmatcher.add("Pat22", patterns = [deppattern22])
for number, mylist in dmatcher(doc):
for item in mylist:
print(doc[item[0]].sent)
When I do this, it works.
However, there are many problems here.
When I try to add "risk" and "market" terms to the matcher then it no longer works:
deppattern22a = [
{'SPEC' : {"NODE_NAME": "drpr"}, "PATTERN":{"LEMMA": {"IN": lemma_list_drpr}}},
{"SPEC": {"NBOR_NAME": "drpr", "NBOR_RELOP": ">", "NODE_NAME": "ass2"}, "PATTERN":
{"LEMMA": "assessment"}},
{"SPEC": {"NBOR_NAME": "ass2", "NBOR_RELOP": ">", "NODE_NAME": "proc2"}, "PATTERN":
{"LEMMA": "procurement"}},
{"SPEC": {"NBOR_NAME": "proc2", "NBOR_RELOP": ">", "NODE_NAME": "risk2"}, "PATTERN":
{"LEMMA": "risk"}},
{"SPEC": {"NBOR_NAME": "risk2", "NBOR_RELOP": ">", "NODE_NAME": "mkt2"}, "PATTERN":
{"LEMMA": "market"}}
]
Moreover, when I change the sentence text a little bit, by replacing "&" by "and" then the dependency changes so my dependency matcher doesn't work again. The dependency becomes draft > procurement > assessment > ... whereas in the earlier sample sentence it was draft > assessment > procurement > ...
The dependency changes back when I add other text to the sentence.
What would be a good way to find such matches that are not sensitive to minor changes in sentence structure?
I am trying to identify concepts in texts. Oftentimes I consider that a concept appears in a text when two or more words appear relatively close to each other.
For instance a concept would be any of the words
forest, tree, nature
in a distance less than 4 words from
fire, burn, overheat
I am learning spacy and so far I can use the matcher like this:
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}],[{"LOWER": "hello"}, {"LOWER": "world"}])
That would match hello world and hello, world (or tree firing for the above mentioned example)
I am looking for a solution that would yield matches of the words Hello and World within a window of 5 words.
I had a look into:
https://spacy.io/usage/rule-based-matching
and the operators there described, but I am not able to put this word-window approach in "spacy" syntax.
Furthermore, I am not able to generalize that to more words as well.
Some ideas?
Thanks
For a window with K words, where K is relatively small, you can add K-2 optional wildcard tokens between your words. Wildcard means "any symbol", and in Spacy terms it is just an empty dict. Optional means the token may be there or may not, and in Spacy in is encoded as {"OP": "?"}.
Thus, you can write your matcher as
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
matcher = Matcher(nlp.vocab)
matcher.add("HelloWorld", None, [{"LOWER": "hello"}, {"OP": "?"}, {"OP": "?"}, {"OP": "?"}, {"LOWER": "world"}])
which means you look for "hello", then 0 to 3 tokens of any kind, then "world". For example, for
doc = nlp(u"Hello brave new world")
for match_id, start, end in matcher(doc):
string_id = nlp.vocab.strings[match_id]
span = doc[start:end]
print(match_id, string_id, start, end, span.text)
it will print you
15578876784678163569 HelloWorld 0 4 Hello brave new world
And if you want to match the other order (world ? ? ? hello) as well, you need to add the second, symmetric pattern into your matcher.
I'm relatively new to spaCy but I think the following pattern should work for any number of tokens between 'hello' and 'world' that are comprised of ASCII characters:
[{"LOWER": "hello"}, {'IS_ASCII': True, 'OP': '*'}, {"LOWER": "world"}]
I tested it using Explosion's rule-based match explorer and it works. Overlapping matches will return just one match (eg, "hello and I do mean hello world').
In the following example:
"noun 1 left and right sides 左右摇摆 zuǒ-yòu yáobǎi vacillating; unsteady; hesitant 主席台左右, 红旗迎风飘扬。 Zhǔxítái zuǒyòu, hóngqí yíngfēng piāoyáng. Red flags are fluttering on both sides of the rostrum. 2 [after a numeral] about; or so 八点钟左右 bā diǎn zhōng zuǒyòu around eight o'clock 一个月左右 yī ge yuè zuǒyòu a month or so 身高一米七左右 Shēngāo yī mǐ qī zuǒyòu be about 1.70 metres in height 价值十元左右。 Jiàzhí shí yuán zuǒyòu. It's worth about 10 yuan. 3 those in close attendance; retinue 屏退左右 Píng tuì zuǒyòu order one's attendants to clear out verb master; control; influence 左右局势 zuǒyòu júshì be master of the situation; in control 为人所左右 wéi rén suǒ zuǒyòu controlled by another; fall under another’s influence 他这个人不是别人能左右得了的。 Tā zhège rén bù shì biéren néng zuǒyòu déle de. He is not a man to be influenced by others. adverb dialect anyway; anyhow; in any case 左右闲没事, 我就陪你走一趟吧。 Zuǒyòu xiánzhe méishì, wǒ jiù péi nǐ zǒu yī tàng ba. Ānyway I’m free now so let me go with you."
I would like to get the string separated based on the noun, adjective, adverb, etc... and also based on the number, if they have multiple.
So the final result should be:
noun
["left and right sides", "左右摇摆 zuǒ-yòu yáobǎi vacillating; unsteady; hesitant 主席台左右, 红旗迎风飘扬。 Zhǔxítái zuǒyòu, hóngqí yíngfēng piāoyáng. Red flags are fluttering on both sides of the rostrum."]
["[after a numeral] about; or so", "八点钟左右 bā diǎn zhōng zuǒyòu around eight o'clock 一个月左右 yī ge yuè zuǒyòu a month or so 身高一米七左右 Shēngāo yī mǐ qī zuǒyòu be about 1.70 metres in height 价值十元左右。 Jiàzhí shí yuán zuǒyòu. It's worth about 10 yuan."]
["those in close attendance; retinue", "屏退左右 Píng tuì zuǒyòu order one's attendants to clear out"]
verb
["master; control; influence", "左右局势 zuǒyòu júshì be master of the situation; in control 为人所左右 wéi rén suǒ zuǒyòu controlled by another; fall under another’s influence 他这个人不是别人能左右得了的。 Tā zhège rén bù shì biéren néng zuǒyòu déle de. He is not a man to be influenced by others."]
adverb
["dialect anyway; anyhow; in any case", "左右闲没事, 我就陪你走一趟吧。 Zuǒyòu xiánzhe méishì, wǒ jiù péi nǐ zǒu yī tàng ba. Ānyway I’m free now so let me go with you"]
The noun, verb, and adverb should be keys, while the value might be a dict. Since noun has three objects here, it should have three distinctive results.
So the first step is take the component from noun, adjective adverb, verb, etc... and store it to some variables. But in this case, I fail to get the relevant result based on the specific string. For example:
re.findall("(noun|verb|adverb|adjective)", s)
This returns ['noun', 'verb', 'adverb'] as it only focuses on the exact match.
So I added .+ to make it re.findall("(noun|verb|adverb|adjective).+", s) and get any word after noun, but then it caught all the strings after noun, including any strings after verb or adverb (and returns ['noun']).
So I hit the wall. Is it possible to get the relevant part but also get the full result except the keyword match?
This is not a job for a regular expression. What you are trying to match is too variable.
Write a proper grammar for the dictionary entry, as if it were a programming language, and then parse your data according to that grammar.
Like this:
Your language keywords are noun, verb, adverb.
Each introduces one unnumbered or several numbered definitions.
Numbering of numbered definitions increases monotonically, so other
numbers appearing inside a definition should be treated as part of the definition and not start a new one.
As a sometime lexicographer I would also recommend that you should treat labels like dialect (which are generally drawn from a standard vocabulary) as optional keywords rather than as part of the definition.
You may use
(?s)(noun|verb|adverb|adjective)(.*?)(?=(?:noun|verb|adverb|adjective|$))
See the regex demo
Details
(?s) - an inline re.DOTALL equivalent
(noun|verb|adverb|adjective) - Group 1: a word noun, verb, adverb or adjective
(.*?) - Group 2: any 0+ chars as few as possible, up to (but excluding) the first occurrence of:
(?=(?:noun|verb|adverb|adjective|$)) - either noun, verb, adverb, adjective or end of string (as it is a positive lookahead, (?=...), the texts matched do not become part of a match).
In Python, use with re.findall:
re.findall(r'(?s)(noun|verb|adverb|adjective)(.*?)(?=(?:noun|verb|adverb|adjective|$))', s)
Probably the easiest thing will be to re.split the string on the part-of-speech pattern first: re.split('(noun|adjective|verb|adverb)', s). For the provided input, this include an empty item at the start, and then the rest will alternate between part-of-speech labels and the bits in between, which you can then process further.
I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?
You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']
I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).
Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)