Extracting subject/object in Spacy

Extracting subject/object in Spacy - python

I am very new to working with Spacy in Python and I have an issue - when identifying the subject/object, Spacy doesn’t label the whole proper noun as the subject/object. For example when working with two similar nouns in the same context (e.g. John Doe and John Smith) spacy confuses Doe and Smith because they are both Johns.
I was wondering how I could solve this issue. For example by injecting something in the lines of “if Doe follows John, then John Doe is the subject, and if Smith follows the word John, then John Smith is the subject”?
Here's what I have so far
if lang == 'en':
dir_path = r'/User/news/articles/en'
nlp = en_core_web_lg.load()
#if name comes after these words he is most likely the object
indObjCombinations = re.compile(r'\b(with|for|against|to|from|without|between)(?:\W+\w+){0,3}?\W+(%s)\b' % '|'.join(names),re.IGNORECASE)
passiveCombinations = re.compile(r'\b(by)(?:\W+\w+){0,3}?\W+(%s)\b' % '|'.join(names),re.IGNORECASE)
objCombinations = re.compile(r'\b(received|receives|welcomes|welcomed)(?:\W+\w+){0,3}?\W+(%s)\b' % '|'.join(names),re.IGNORECASE)
subjCombinations = re.compile(r'\b(%s)(?:\W+\w+){0,1}?\W+(in)\b' % '|'.join(names),re.IGNORECASE)
subjCombinations_andName = re.compile(r'\b(and)(\W+\w+){0,3}\W+(%s)(\W+\w+){0,3}\W+(are)\b' % '|'.join(names),re.IGNORECASE )
subjCombinations_nameAnd = re.compile(r'\b(%s)(\W+\w+){0,3}\W+(and)(\W+\w+){0,3}\W+(are)\b' % '|'.join(names),re.IGNORECASE )
subjBeginning = re.compile(r'\b^(%s)(\W+\w+){0,1}\W*(:)\b' % '|'.join(names),re.IGNORECASE )
def getSubsFromConjunctions(subs):
moreSubs = []
for sub in subs:
# rights is a generator
rights = list(sub.rights)
rightDeps = {tok.lower_ for tok in rights}
if lang == 'en':
if "and" in rightDeps:
moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
if len(moreSubs) > 0:
moreSubs.extend(getSubsFromConjunctions(moreSubs))
def getObjsFromConjunctions(objs):
moreObjs = []
for obj in objs:
# rights is a generator
rights = list(obj.rights)
rightDeps = {tok.lower_ for tok in rights}
if lang == 'en':
if "and" in rightDeps:
moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
if len(moreObjs) > 0:
moreObjs.extend(getObjsFromConjunctions(moreObjs))

Related

matching names across multiple lists

the following code is what I've tried to do so far:
import json
uids = {'483775843796': '"jared trav"','483843796': '"azu jared"', '483843996': '"hello azu"', '44384376': '"bitten virgo"', '48384326': '"bitten hello"', '61063868': '"charm feline voxela derp virgo"', '11136664': '"jessica"', '11485423': '"yukkixxtsuki"', '10401438': '"howen"', '29176667': '"zaku ramba char"', '36976082': '"bulma zelda dame prince"', '99661300': '"voxela"', '76923817': '"juniperrose"', '16179876': '"gnollfighter"', '45012369': '"pianist fuzz t travis blunt trav ttttttttttttttttttyt whole ryann lol tiper cuz"', '62797501': '"asriel"', '73647929': '"voxela"', '95019796': '"dao daoisms"', '70094978': '"mort"', '16233382': '"purrs"', '89270209': '"apocalevie waify"', '42873540': '"tear slash peaches attitude maso lyra juvia innocent"', '61284894': '"pup"', '68487075': '"ninja"', '66451758': '"az"', '23492247': '"vegeta"', '77980169': '"virus"'}
def _whois(string):
a = []
for i in uids:
i = json.loads(uids[i])
i = i.split()
if string in i:
a += i
for i in uids:
i = json.loads(uids[i])
i = i.split()
if bool(set(i) & set(a)) == True:
a += i
return list(set(a))
def whois(string):
a = []
ret = _whois(string)
for i in ret:
a += _whois(i)
return list(set(a))
print(whois("charm"))
I am trying to match a search term with accounts that share an id with the term in it, and then match each of those other accounts that are with the id to other accounts on other ids and so on and basically see all of the linked accounts that start from a single term.
For example, if I searched "charm" it would return: "charm feline voxela derp virgo bitten hello" from the example uids above.
After a certain way down the line of connected accounts it stops matching. How would I successfully do this so that it matches all accounts potentially infinitely?

i think i got it to work:
import json
terms = {'4837759863453450996': '"mamma riyoken"','4833480984509580996': '"mamma heika"','483775980980996': '"nemo heika"','4867568843796': '"control nemo"','4956775843796': '"t control"','483775843796': '"jared trav"','483843796': '"azu jared"', '483843996': '"hello azu"', '44384376': '"bitten virgo"', '48384326': '"bitten hello"', '61063868': '"charm feline voxela derp virgo"', '11136664': '"jessica"', '11485423': '"yukkixxtsuki"', '10401438': '"howen"', '29176667': '"zaku ramba char"', '36976082': '"bulma zelda dame prince"', '99661300': '"voxela"', '76923817': '"juniperrose"', '16179876': '"gnollfighter"', '45012369': '"pianist fuzz t travis blunt trav ttttttttttttttttttyt whole ryann lol tiper cuz"', '62797501': '"asriel"', '73647929': '"voxela"', '95019796': '"dao daoisms"', '70094978': '"mort"', '16233382': '"purrs"', '89270209': '"apocalevie waify"', '42873540': '"tear slash peaches attitude maso lyra juvia innocent"', '61284894': '"pup"', '68487075': '"ninja"', '66451758': '"az"', '23492247': '"vegeta"', '77980169': '"virus"'}
def _search(string):
a = []
for i in terms:
i = json.loads(terms[i])
i = i.split()
if string in i:
a += i
return list(set(a))
def search(string):
a = []
a.append(string)
while True:
l = len(a)
for n in a:
a += _search(n)
a = list(set(a))
if l == len(a):
break
return a
print(search("charm"))

Try this:
ids = {'483775843796': '"jared trav"','483843796': '"azu jared"', '483843996': '"hello azu"', '44384376': '"bitten virgo"', '48384326': '"bitten hello"', '61063868': '"charm feline voxela derp virgo"', '11136664': '"jessica"', '11485423': '"yukkixxtsuki"', '10401438': '"howen"', '29176667': '"zaku ramba char"', '36976082': '"bulma zelda dame prince"', '99661300': '"voxela"', '76923817': '"juniperrose"', '16179876': '"gnollfighter"', '45012369': '"pianist fuzz t travis blunt trav ttttttttttttttttttyt whole ryann lol tiper cuz"', '62797501': '"asriel"', '73647929': '"voxela"', '95019796': '"dao daoisms"', '70094978': '"mort"', '16233382': '"purrs"', '89270209': '"apocalevie waify"', '42873540': '"tear slash peaches attitude maso lyra juvia innocent"', '61284894': '"pup"', '68487075': '"ninja"', '66451758': '"az"', '23492247': '"vegeta"', '77980169': '"virus"'}
def find_word(word,dict):
for i,j in dict.items():
if word.lower() in j.lower():
print(i,j)
find_word('jared', ids)
Result:
483775843796 "jared trav"
483843796 "azu jared"

Find successively connected nouns or pronouns in string

I want to find stand-alone or successively connected nouns in a text. I put together below code, but it is neither efficient nor pythonic. Does anybody have a more pythonic way of finding these nouns with spaCy?
Below code builds a dict with all tokens and then runs through them to find stand-alone or connected PROPN or NOUN until the for-loop runs out of range. It returns a list of the collected items.
def extract_unnamed_ents(doc):
"""Takes a string and returns a list of all succesively connected nouns or pronouns"""
nlp_doc = nlp(doc)
token_list = []
for token in nlp_doc:
token_dict = {}
token_dict['lemma'] = token.lemma_
token_dict['pos'] = token.pos_
token_dict['tag'] = token.tag_
token_list.append(token_dict)
ents = []
k = 0
for i in range(len(token_list)):
try:
if token_list[k]['pos'] == 'PROPN' or token_list[k]['pos'] == 'NOUN':
ent = token_list[k]['lemma']
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if token_list[k+1]['pos'] == 'PROPN' or token_list[k+1]['pos'] == 'NOUN':
ent = ent + ' ' + token_list[k+1]['lemma']
k += 1
if ent not in ents:
ents.append(ent)
except:
pass
k += 1
return ents
Test:
extract_unnamed_ents('Chancellor Angela Merkel and some of her ministers will discuss at a cabinet '
"retreat next week ways to avert driving bans in major cities after Germany's "
'top administrative court in February allowed local authorities to bar '
'heavily polluting diesel cars.')
Out:
['Chancellor Angela Merkel',
'minister',
'cabinet retreat',
'week way',
'ban',
'city',
'Germany',
'court',
'February',
'authority',
'diesel car']

spacy has a way of doing this but I'm not sure it is giving you exactly what you are after
import spacy
text = """Chancellor Angela Merkel and some of her ministers will discuss
at a cabinet retreat next week ways to avert driving bans in
major cities after Germany's top administrative court
in February allowed local authorities to bar heavily
polluting diesel cars.
""".replace('\n', ' ')
nlp = spacy.load("en_core_web_sm")
doc = nlp(text)
print([i.text for i in doc.noun_chunks])
gives
['Chancellor Angela Merkel', 'her ministers', 'a cabinet retreat', 'ways', 'driving bans', 'major cities', "Germany's top administrative court", 'February', 'local authorities', 'heavily polluting diesel cars']
Here, however the i.lemma_ line doesn't really give you what you want (I think this might be fixed by this recent PR).
Since it isn't quite what you are after you could use itertools.groupby like so
import itertools
out = []
for i, j in itertools.groupby(doc, key=lambda i: i.pos_):
if i not in ("PROPN", "NOUN"):
continue
out.append(' '.join(k.lemma_ for k in j))
print(out)
gives
['Chancellor Angela Merkel', 'minister', 'cabinet retreat', 'week way', 'ban', 'city', 'Germany', 'court', 'February', 'authority', 'diesel car']
This should give you exactly the same output as your function (the output is slightly different here but I believe this is due to different spacy versions).
If you are feeling really adventurous you could use a list comprehension
out = [' '.join(k.lemma_ for k in j)
for i, j in itertools.groupby(doc, key=lambda i: i.pos_)
if i in ("PROPN", "NOUN")]
Note I see slightly different results with different spacy versions. The output above is from spacy-2.1.8

Spacy - Chunk NE tokens

Let's say that I have a document, like so:
import spacy
nlp = spacy.load('en')
doc = nlp('My name is John Smith')
[t for t in doc]
> [My, name, is, John, Smith]
Spacy is intelligent enough to realize that 'John Smith' is a multi-token named entity:
[e for e in doc.ents]
> [John Smith]
How can I make it chunk named entities into discrete tokens, like so:
> [My, name, is, John Smith]

Spacy documentation on NER says that you can access token entity annotations using the token.ent_iob_ and token.ent_type_ attributes.
https://spacy.io/usage/linguistic-features#accessing
Example:
import spacy
nlp = spacy.load('en')
doc = nlp('My name is John Smith')
ne = []
merged = []
for t in doc:
# "O" -> current token is not part of the NE
if t.ent_iob_ == "O":
if len(ne) > 0:
merged.append(" ".join(ne))
ne = []
merged.append(t.text)
else:
ne.append(t.text)
if len(ne) > 0:
merged.append(" ".join(ne))
print(merged)
This will print:
['My', 'name', 'is', 'John Smith']

wish to extract compound noun-adjective pairs from a sentence. So, basically I want something like :

For the adjective:
"The company's customer service was terrible."
{customer service, terrible}
For the verb:
"They kept increasing my phone bill"
{phone bill, increasing}
This is a branch questions from this posting
However I'm trying to find adj and verbs corresponding to multi-token phrases/compound nouns such as "customer service" using spacy.
I'm not sure how to do this with spacy, nltk, or any other prepackaged natural language processing software, and I'd appreciate any help!

For simple examples like this, you can use spaCy's dependency parsing with a few simple rules.
First, to identify multi-word nouns similar to the examples given, you can use the "compound" dependency. After parsing a document (e.g., sentence) with spaCy, use a token's dep_ attribute to find it's dependency.
For example, this sentence has two compound nouns:
"The compound dependency identifies compound nouns."
Each token and its dependency is shown below:
import spacy
import pandas as pd
nlp = spacy.load('en')
example_doc = nlp("The compound dependency identifies compound nouns.")
for tok in example_doc:
print(tok.i, tok, "[", tok.dep_, "]")
>>>0 The [ det ]
>>>1 compound [ compound ]
>>>2 dependency [ nsubj ]
>>>3 identifies [ ROOT ]
>>>4 compound [ compound ]
>>>5 nouns [ dobj ]
>>>6 . [ punct ]
for tok in [tok for tok in example_doc if tok.dep_ == 'compound']: # Get list of
compounds in doc
noun = example_doc[tok.i: tok.head.i + 1]
print(noun)
>>>compound dependency
>>>compound nouns
The below function works for your examples. However, it will likely not work for more complicated sentences.
adj_doc = nlp("The company's customer service was terrible.")
verb_doc = nlp("They kept increasing my phone bill")
def get_compound_pairs(doc, verbose=False):
"""Return tuples of (multi-noun word, adjective or verb) for document."""
compounds = [tok for tok in doc if tok.dep_ == 'compound'] # Get list of compounds in doc
compounds = [c for c in compounds if c.i == 0 or doc[c.i - 1].dep_ != 'compound'] # Remove middle parts of compound nouns, but avoid index errors
tuple_list = []
if compounds:
for tok in compounds:
pair_item_1, pair_item_2 = (False, False) # initialize false variables
noun = doc[tok.i: tok.head.i + 1]
pair_item_1 = noun
# If noun is in the subject, we may be looking for adjective in predicate
# In simple cases, this would mean that the noun shares a head with the adjective
if noun.root.dep_ == 'nsubj':
adj_list = [r for r in noun.root.head.rights if r.pos_ == 'ADJ']
if adj_list:
pair_item_2 = adj_list[0]
if verbose == True: # For trying different dependency tree parsing rules
print("Noun: ", noun)
print("Noun root: ", noun.root)
print("Noun root head: ", noun.root.head)
print("Noun root head rights: ", [r for r in noun.root.head.rights if r.pos_ == 'ADJ'])
if noun.root.dep_ == 'dobj':
verb_ancestor_list = [a for a in noun.root.ancestors if a.pos_ == 'VERB']
if verb_ancestor_list:
pair_item_2 = verb_ancestor_list[0]
if verbose == True: # For trying different dependency tree parsing rules
print("Noun: ", noun)
print("Noun root: ", noun.root)
print("Noun root head: ", noun.root.head)
print("Noun root head verb ancestors: ", [a for a in noun.root.ancestors if a.pos_ == 'VERB'])
if pair_item_1 and pair_item_2:
tuple_list.append((pair_item_1, pair_item_2))
return tuple_list
get_compound_pairs(adj_doc)
>>>[(customer service, terrible)]
get_compound_pairs(verb_doc)
>>>[(phone bill, increasing)]
get_compound_pairs(example_doc, verbose=True)
>>>Noun: compound dependency
>>>Noun root: dependency
>>>Noun root head: identifies
>>>Noun root head rights: []
>>>Noun: compound nouns
>>>Noun root: nouns
>>>Noun root head: identifies
>>>Noun root head verb ancestors: [identifies]
>>>[(compound nouns, identifies)]

I needed to solve a similar problem and I wanted to share my solution as Spacy.io custom component.
import spacy
from spacy.tokens import Token, Span
from spacy.language import Language
#Language.component("compound_chainer")
def find_compounds(doc):
Token.set_extension("is_compound_chain", default=False)
com_range = []
max_ind = len(doc)
for idx, tok in enumerate(doc):
if((tok.dep_ == "compound") and (idx < max_ind)):
com_range.append([idx, idx+1])
to_remove = []
intersections = []
for t1 in com_range:
for t2 in com_range:
if(t1 != t2):
s1 = set(t1)
s2 = set(t2)
if(len(s1.intersection(s2)) > 0):
to_remove.append(t1)
to_remove.append(t2)
union = list(s1.union(s2))
if union not in intersections:
intersections.append(union)
r = [t for t in com_range if t not in to_remove]
compound_ranges = r + intersections
spans = []
for cr in compound_ranges:
# Example cr [[0, 1], [3, 4], [12, 13], [16, 17, 18]]
entity = Span(doc, min(cr), max(cr)+1, label="compound_chain")
for token in entity:
token._.set("is_compound_chain", True)
spans.append(entity)
doc.ents = list(doc.ents) + spans
return doc
Github link: https://github.com/eboraks/job-description-nlp-analysis/blob/main/src/components/compound_chainer.py

How to extract subjects in a sentence and their respective dependent phrases?

I am trying to work on subject extraction in a sentence, so that I can get the sentiments in accordance with the subject. I am using nltk in python2.7 for this purpose. Take the following sentence as an example:
Donald Trump is the worst president of USA, but Hillary is better than him
He we can see that Donald Trump and Hillary are the two subjects, and sentiments related to Donald Trump is negative but related to Hillary are positive. Till now, I am able to break this sentence into chunks of noun phrases, and I am able to get the following:
(S
(NP Donald/NNP Trump/NNP)
is/VBZ
(NP the/DT worst/JJS president/NN)
in/IN
(NP USA,/NNP)
but/CC
(NP Hillary/NNP)
is/VBZ
better/JJR
than/IN
(NP him/PRP))
Now, how do I approach in finding the subjects from these noun phrases? Then how do I group the phrases meant for both the subjects together? Once I have the phrases meant for both the subjects separately, I can perform sentiment analysis on both of them separately.
EDIT
I looked into the library mentioned by #Krzysiek (spacy), and it gave me dependency trees as well in the sentences.
Here is the code:
from spacy.en import English
parser = English()
example = u"Donald Trump is the worst president of USA, but Hillary is better than him"
parsedEx = parser(example)
# shown as: original token, dependency tag, head word, left dependents, right dependents
for token in parsedEx:
print(token.orth_, token.dep_, token.head.orth_, [t.orth_ for t in token.lefts], [t.orth_ for t in token.rights])
Here are the dependency trees:
(u'Donald', u'compound', u'Trump', [], [])
(u'Trump', u'nsubj', u'is', [u'Donald'], [])
(u'is', u'ROOT', u'is', [u'Trump'], [u'president', u',', u'but', u'is'])
(u'the', u'det', u'president', [], [])
(u'worst', u'amod', u'president', [], [])
(u'president', u'attr', u'is', [u'the', u'worst'], [u'of'])
(u'of', u'prep', u'president', [], [u'USA'])
(u'USA', u'pobj', u'of', [], [])
(u',', u'punct', u'is', [], [])
(u'but', u'cc', u'is', [], [])
(u'Hillary', u'nsubj', u'is', [], [])
(u'is', u'conj', u'is', [u'Hillary'], [u'better'])
(u'better', u'acomp', u'is', [], [u'than'])
(u'than', u'prep', u'better', [], [u'him'])
(u'him', u'pobj', u'than', [], [])
This gives in depth insights into the dependencies of the different tokens of the sentences. Here is the link to the paper which describes the dependencies between different pairs. How can I use this tree to attach the contextual words for different subjects to them?

I was going through spacy library more, and I finally figured out the solution through dependency management. Thanks to this repo, I figured out how to include adjectives as well in my subjective verb object (making it SVAO's), as well as taking out compound subjects in the query. Here goes my solution:
from nltk.stem.wordnet import WordNetLemmatizer
from spacy.lang.en import English
SUBJECTS = ["nsubj", "nsubjpass", "csubj", "csubjpass", "agent", "expl"]
OBJECTS = ["dobj", "dative", "attr", "oprd"]
ADJECTIVES = ["acomp", "advcl", "advmod", "amod", "appos", "nn", "nmod", "ccomp", "complm",
"hmod", "infmod", "xcomp", "rcmod", "poss"," possessive"]
COMPOUNDS = ["compound"]
PREPOSITIONS = ["prep"]
def getSubsFromConjunctions(subs):
moreSubs = []
for sub in subs:
# rights is a generator
rights = list(sub.rights)
rightDeps = {tok.lower_ for tok in rights}
if "and" in rightDeps:
moreSubs.extend([tok for tok in rights if tok.dep_ in SUBJECTS or tok.pos_ == "NOUN"])
if len(moreSubs) > 0:
moreSubs.extend(getSubsFromConjunctions(moreSubs))
return moreSubs
def getObjsFromConjunctions(objs):
moreObjs = []
for obj in objs:
# rights is a generator
rights = list(obj.rights)
rightDeps = {tok.lower_ for tok in rights}
if "and" in rightDeps:
moreObjs.extend([tok for tok in rights if tok.dep_ in OBJECTS or tok.pos_ == "NOUN"])
if len(moreObjs) > 0:
moreObjs.extend(getObjsFromConjunctions(moreObjs))
return moreObjs
def getVerbsFromConjunctions(verbs):
moreVerbs = []
for verb in verbs:
rightDeps = {tok.lower_ for tok in verb.rights}
if "and" in rightDeps:
moreVerbs.extend([tok for tok in verb.rights if tok.pos_ == "VERB"])
if len(moreVerbs) > 0:
moreVerbs.extend(getVerbsFromConjunctions(moreVerbs))
return moreVerbs
def findSubs(tok):
head = tok.head
while head.pos_ != "VERB" and head.pos_ != "NOUN" and head.head != head:
head = head.head
if head.pos_ == "VERB":
subs = [tok for tok in head.lefts if tok.dep_ == "SUB"]
if len(subs) > 0:
verbNegated = isNegated(head)
subs.extend(getSubsFromConjunctions(subs))
return subs, verbNegated
elif head.head != head:
return findSubs(head)
elif head.pos_ == "NOUN":
return [head], isNegated(tok)
return [], False
def isNegated(tok):
negations = {"no", "not", "n't", "never", "none"}
for dep in list(tok.lefts) + list(tok.rights):
if dep.lower_ in negations:
return True
return False
def findSVs(tokens):
svs = []
verbs = [tok for tok in tokens if tok.pos_ == "VERB"]
for v in verbs:
subs, verbNegated = getAllSubs(v)
if len(subs) > 0:
for sub in subs:
svs.append((sub.orth_, "!" + v.orth_ if verbNegated else v.orth_))
return svs
def getObjsFromPrepositions(deps):
objs = []
for dep in deps:
if dep.pos_ == "ADP" and dep.dep_ == "prep":
objs.extend([tok for tok in dep.rights if tok.dep_ in OBJECTS or (tok.pos_ == "PRON" and tok.lower_ == "me")])
return objs
def getAdjectives(toks):
toks_with_adjectives = []
for tok in toks:
adjs = [left for left in tok.lefts if left.dep_ in ADJECTIVES]
adjs.append(tok)
adjs.extend([right for right in tok.rights if tok.dep_ in ADJECTIVES])
tok_with_adj = " ".join([adj.lower_ for adj in adjs])
toks_with_adjectives.extend(adjs)
return toks_with_adjectives
def getObjsFromAttrs(deps):
for dep in deps:
if dep.pos_ == "NOUN" and dep.dep_ == "attr":
verbs = [tok for tok in dep.rights if tok.pos_ == "VERB"]
if len(verbs) > 0:
for v in verbs:
rights = list(v.rights)
objs = [tok for tok in rights if tok.dep_ in OBJECTS]
objs.extend(getObjsFromPrepositions(rights))
if len(objs) > 0:
return v, objs
return None, None
def getObjFromXComp(deps):
for dep in deps:
if dep.pos_ == "VERB" and dep.dep_ == "xcomp":
v = dep
rights = list(v.rights)
objs = [tok for tok in rights if tok.dep_ in OBJECTS]
objs.extend(getObjsFromPrepositions(rights))
if len(objs) > 0:
return v, objs
return None, None
def getAllSubs(v):
verbNegated = isNegated(v)
subs = [tok for tok in v.lefts if tok.dep_ in SUBJECTS and tok.pos_ != "DET"]
if len(subs) > 0:
subs.extend(getSubsFromConjunctions(subs))
else:
foundSubs, verbNegated = findSubs(v)
subs.extend(foundSubs)
return subs, verbNegated
def getAllObjs(v):
# rights is a generator
rights = list(v.rights)
objs = [tok for tok in rights if tok.dep_ in OBJECTS]
objs.extend(getObjsFromPrepositions(rights))
potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
objs.extend(potentialNewObjs)
v = potentialNewVerb
if len(objs) > 0:
objs.extend(getObjsFromConjunctions(objs))
return v, objs
def getAllObjsWithAdjectives(v):
# rights is a generator
rights = list(v.rights)
objs = [tok for tok in rights if tok.dep_ in OBJECTS]
if len(objs)== 0:
objs = [tok for tok in rights if tok.dep_ in ADJECTIVES]
objs.extend(getObjsFromPrepositions(rights))
potentialNewVerb, potentialNewObjs = getObjFromXComp(rights)
if potentialNewVerb is not None and potentialNewObjs is not None and len(potentialNewObjs) > 0:
objs.extend(potentialNewObjs)
v = potentialNewVerb
if len(objs) > 0:
objs.extend(getObjsFromConjunctions(objs))
return v, objs
def findSVOs(tokens):
svos = []
verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
for v in verbs:
subs, verbNegated = getAllSubs(v)
# hopefully there are subs, if not, don't examine this verb any longer
if len(subs) > 0:
v, objs = getAllObjs(v)
for sub in subs:
for obj in objs:
objNegated = isNegated(obj)
svos.append((sub.lower_, "!" + v.lower_ if verbNegated or objNegated else v.lower_, obj.lower_))
return svos
def findSVAOs(tokens):
svos = []
verbs = [tok for tok in tokens if tok.pos_ == "VERB" and tok.dep_ != "aux"]
for v in verbs:
subs, verbNegated = getAllSubs(v)
# hopefully there are subs, if not, don't examine this verb any longer
if len(subs) > 0:
v, objs = getAllObjsWithAdjectives(v)
for sub in subs:
for obj in objs:
objNegated = isNegated(obj)
obj_desc_tokens = generate_left_right_adjectives(obj)
sub_compound = generate_sub_compound(sub)
svos.append((" ".join(tok.lower_ for tok in sub_compound), "!" + v.lower_ if verbNegated or objNegated else v.lower_, " ".join(tok.lower_ for tok in obj_desc_tokens)))
return svos
def generate_sub_compound(sub):
sub_compunds = []
for tok in sub.lefts:
if tok.dep_ in COMPOUNDS:
sub_compunds.extend(generate_sub_compound(tok))
sub_compunds.append(sub)
for tok in sub.rights:
if tok.dep_ in COMPOUNDS:
sub_compunds.extend(generate_sub_compound(tok))
return sub_compunds
def generate_left_right_adjectives(obj):
obj_desc_tokens = []
for tok in obj.lefts:
if tok.dep_ in ADJECTIVES:
obj_desc_tokens.extend(generate_left_right_adjectives(tok))
obj_desc_tokens.append(obj)
for tok in obj.rights:
if tok.dep_ in ADJECTIVES:
obj_desc_tokens.extend(generate_left_right_adjectives(tok))
return obj_desc_tokens
Now when you pass query such as:
from spacy.lang.en import English
parser = English()
sentence = u"""
Donald Trump is the worst president of USA, but Hillary is better than him
"""
parse = parser(sentence)
print(findSVAOs(parse))
You will get the following:
[(u'donald trump', u'is', u'worst president'), (u'hillary', u'is', u'better')]
Thank you #Krzysiek for your solution too, I actually was unable to go deep into your library to modify it. I rather tried modifying the above mentioned link to solve my problem.

I was recently just solving very similar problem - I needed to extract subject(s), action, object(s). And I open sourced my work so you can check this library:
https://github.com/krzysiekfonal/textpipeliner
This based on spacy(opponent to nltk) but it also based on sentence tree.
So for instance let's get this doc embedded in spacy as example:
import spacy
nlp = spacy.load("en")
doc = nlp(u"The Empire of Japan aimed to dominate Asia and the " \
"Pacific and was already at war with the Republic of China " \
"in 1937, but the world war is generally said to have begun on " \
"1 September 1939 with the invasion of Poland by Germany and " \
"subsequent declarations of war on Germany by France and the United Kingdom. " \
"From late 1939 to early 1941, in a series of campaigns and treaties, Germany conquered " \
"or controlled much of continental Europe, and formed the Axis alliance with Italy and Japan. " \
"Under the Molotov-Ribbentrop Pact of August 1939, Germany and the Soviet Union partitioned and " \
"annexed territories of their European neighbours, Poland, Finland, Romania and the Baltic states. " \
"The war continued primarily between the European Axis powers and the coalition of the United Kingdom " \
"and the British Commonwealth, with campaigns including the North Africa and East Africa campaigns, " \
"the aerial Battle of Britain, the Blitz bombing campaign, the Balkan Campaign as well as the " \
"long-running Battle of the Atlantic. In June 1941, the European Axis powers launched an invasion " \
"of the Soviet Union, opening the largest land theatre of war in history, which trapped the major part " \
"of the Axis' military forces into a war of attrition. In December 1941, Japan attacked " \
"the United States and European territories in the Pacific Ocean, and quickly conquered much of " \
"the Western Pacific.")
You can now create a simple pipes structure(more about pipes in readme of this project):
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/*"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
FindTokensPipe("VERB"),
AnyPipe([SequencePipe([FindTokensPipe("VBD/dobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("GPE"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()]),
SequencePipe([FindTokensPipe("VBD/**/*/pobj/NNP"),
AggregatePipe([NamedEntityFilterPipe("LOC"),
NamedEntityFilterPipe("PERSON")]),
NamedEntityExtractorPipe()])])]
engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()
And in the result you will get:
>>>[([Germany], [conquered], [Europe]),
([Japan], [attacked], [the, United, States])]
Actually it based strongly (the finding pipes) on another library - grammaregex. You can read about it from a post:
https://medium.com/#krzysiek89dev/grammaregex-library-regex-like-for-text-mining-49e5706c9c6d#.zgx7odhsc
EDITED
Actually the example I presented in readme discards adj, but all you need is to adjust pipe structure passed to engine according to your needs.
For instance for your sample sentences I can propose such structure/solution which give you tuple of 3 elements(subj, verb, adj) per every sentence:
import spacy
from textpipeliner import PipelineEngine, Context
from textpipeliner.pipes import *
pipes_structure = [SequencePipe([FindTokensPipe("VERB/nsubj/NNP"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
AggregatePipe([FindTokensPipe("VERB"),
FindTokensPipe("VERB/xcomp/VERB/aux/*"),
FindTokensPipe("VERB/xcomp/VERB")]),
AnyPipe([FindTokensPipe("VERB/[acomp,amod]/ADJ"),
AggregatePipe([FindTokensPipe("VERB/[dobj,attr]/NOUN/det/DET"),
FindTokensPipe("VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
]
engine = PipelineEngine(pipes_structure, Context(doc), [0,1,2])
engine.process()
It will give you result:
[([Donald, Trump], [is], [the, worst])]
A little bit complexity is in the fact you have compound sentence and the lib produce one tuple per sentence - I'll soon add possibility(I need it too for my project) to pass a list of pipe structures to engine to allow produce more tuples per sentence. But for now you can solve it just by creating second engine for compounded sents which structure will differ only of VERB/conj/VERB instead of VERB(those regex starts always from ROOT, so VERB/conj/VERB lead you to just second verb in compound sentence):
pipes_structure_comp = [SequencePipe([FindTokensPipe("VERB/conj/VERB/nsubj/NNP"),
NamedEntityFilterPipe(),
NamedEntityExtractorPipe()]),
AggregatePipe([FindTokensPipe("VERB/conj/VERB"),
FindTokensPipe("VERB/conj/VERB/xcomp/VERB/aux/*"),
FindTokensPipe("VERB/conj/VERB/xcomp/VERB")]),
AnyPipe([FindTokensPipe("VERB/conj/VERB/[acomp,amod]/ADJ"),
AggregatePipe([FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/det/DET"),
FindTokensPipe("VERB/conj/VERB/[dobj,attr]/NOUN/[acomp,amod]/ADJ")])])
]
engine2 = PipelineEngine(pipes_structure_comp, Context(doc), [0,1,2])
And now after you run both engines you will get expected result :)
engine.process()
engine2.process()
[([Donald, Trump], [is], [the, worst])]
[([Hillary], [is], [better])]
This is what you need I think. Of course I just quickly created a pipe structure for given example sentence and it won't work for every case, but I saw a lot of sentence structures and it will already fulfil quite nice percentage, but then you can just add more FindTokensPipe etc for cases which won't work currently and I'm sure after a few adjustment you will cover really good number of possible sentences(english is not too complex so...:)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.

Extracting subject/object in Spacy - python

Related

matching names across multiple lists

Find successively connected nouns or pronouns in string

Spacy - Chunk NE tokens

wish to extract compound noun-adjective pairs from a sentence. So, basically I want something like :

How to extract subjects in a sentence and their respective dependent phrases?

Categories

Resources