Retraining Spacy's Noun Chunks

Retraining Spacy's Noun Chunks - python

I have the following sentences:
sent_1 = 'The cat caught the mouse.'
sent_2 = 'The cat caught and killed the mouse.'
Now I want to know who did what to whoom. Spacy's noun_chunks work perfectly in the first case, indicating "The cat" as the "nsubj" with the chunk.root.head.text being "caught". Likewise, "the mouse" is correctly classified as being the "dobj" with again "caught" as chunk.root.head.text. So it is easy to match these two.
However, in the second case, the nsubj gets "caught" as its chunk.root.head.text while the dobj gets "killed", whereas they actually would belong together. Is there a way to account for this kind of cases?

In the second case 'killed' is the head of the 'the mouse' as it is the text connecting the noun chunk to the rest of the phrase. From the spacy documentation:
Root text: The original text of the word connecting the noun chunk to the rest of the parse.
https://spacy.io/usage/linguistic-features#noun-chunks
N.b. that link has a very similar example to yours - a sentence with multiple noun chunks with different roots. ('Autonomous cars shift insurance liability toward manufacturers')
To answer your question, if you want 'caught' to be found as the head in both instances, then really what you're asking for is to recursively check the head of the tree for each noun_chunk... something like this:
nlp = spacy.load('en_core_web_sm')
doc = nlp('The cat caught and killed the mouse.')
[x.root.head.head for x in doc.noun_chunks]
which avails:
[caught, caught]
N.b, this works for your example but if you needed to handle arbitrary sentences then you'd need to do something a bit more sophisticated, i.e. actually recursing the tree. e.g.
def get_head(x):
return x.head if x.head.head == x.head else get_head(x.head)
resulting in:
doc2 = nlp("Autonomous cars shift insurance liability toward manufacturers away from everyday users") # adapted from the spacy example with an additional NC 'everyday users' added
In [17]: [get_head(x.root.head) for x in doc.noun_chunks]
In [187]: [caught, caught]
In [18]: [get_head(x.root.head) for x in doc2.noun_chunks]
Out[18]: [shift, shift, shift, shift]

Related

Check how many words from a given list occur in list of text/strings

I have a list of text data which contains reviews, something likes this:
1. 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.'
2. 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
3. 'This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
I have a seperate list of words which I want to know exists in the these reviews:
['food','science','good','buy','feedback'....]
I want to know which of these words are present in the review and select reviews which contains certain number of these words. For example, lets say only select reviews which contains atleast 3 of the words from this list, so it displays all those reviews, but also show which of those were encountered in the review while selecting it.
I have the code for selecting reviews containing at least 3 of the words, but how do I get the second part which tells me which words exactly were encountered. Here is my initial code:
keywords = list(words)
text = list(df.summary.values)
sentences=[]
for element in text:
if len(set(keywords)&set(element.split(' '))) >=3:
sentences.append(element)

To answer the second part, allow me to revisit how to approach the first part. A handy approach here is to cast your review strings into sets of word strings.
Like this:
review_1 = "I have bought several of the Vitality canned dog food products and"
review_1 = set(review_1.split(" "))
Now the review_1 set contains one of every word. Then take your list of words, convert it to a set, and do an intersection.
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = review_1.intersection(words)
The resulting set, matches, contains all the words that are common. The length of this is the number of matches.
Now, this does not work if you cared about how many of each word matches. For example, if the word "food" is found twice in the review and "science" is found once, does that count as matching three words?
If so, let me know via comment and I can write some code to update the answer to include that scenario.
EDIT: Updating to include comment question
If you want to keep a count of how many times each word repeats, then hang onto the review list. Only cast it to set when performing the intersection. Then, use the 'count' list method to count the number of times each match appears in the review. In the example below, I use a dictionary to store the results.
review_1 = "I have bought several of the Vitality canned dog food products and"
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = set(review_1).intersection(words)
match_counts = dict()
for match in matches:
match_counts[match] = words.count(match)

You can use set intersection for finding the common words:
def filter_reviews(data, *, trigger_words = frozenset({'food', 'science', 'good', 'buy', 'feedback'})):
for review in data:
words = review.split() # use whatever method is appropriate to get the words
common = trigger_words.intersection(words)
if len(common) >= 3:
yield review, common

Iterate over a text and find the distance between predefined substrings

I decided I wanted to take a text and find how close some labels were in the text. Basically, the idea is to check if two persons are less than 14 words apart and if they are we say that they are related.
My naive implementation is working, but only if the person is a single word, because I iterate over words.
text = """At this moment Robert who rises at seven and works before
breakfast came in He glanced at his wife her cheek was
slightly flushed he patted it caressingly What s the
matter my dear he asked She objects to my doing nothing
and having red hair said I in an injured tone Oh of
course he can t help his hair admitted Rose It generally
crops out once in a generation said my brother So does the
nose Rudolf has got them both I must premise that I am going
perforce to rake up the very scandal which my dear Lady
Burlesdon wishes forgotten--in the year 1733 George II
sitting then on the throne peace reigning for the moment and
the King and the Prince of Wales being not yet at loggerheads
there came on a visit to the English Court a certain prince
who was afterwards known to history as Rudolf the Third of Ruritania"""
involved = ['Robert', 'Rose', 'Rudolf the Third',
'a Knight of the Garter', 'James', 'Lady Burlesdon']
# my naive implementation
ws = text.split()
l = len(ws)
for wi,w in enumerate(ws):
# Skip if the word is not a person
if w not in involved:
continue
# Check next x words for any involved person
x = 14
for i in range(wi+1,wi+x):
# Avoid list index error
if i >= l:
break
# Skip if the word is not a person
if ws[i] not in involved:
continue
# Print related
print(ws[wi],ws[i])
Now I would like to upgrade this script to allow for multi-word names such as 'Lady Burlesdon'. I am not entirely sure what is the best way to proceed. Any hints are welcome.

You could first preprocess your text so that all the names in text are replaced with single-word ids. The ids would have to be strings that you would not expect to appear as other words in the text. As you preprocess the text, you could keep a mapping of ids to names to know which name corresponds to which id. This would allow to keep your current algorithm as is.

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.

Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.

To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.

This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

efficient algorithm for finding co occurrence matrix of phrases

I have a list L of around 40,000 phrases and a document of around 10 million words. what I want to check is which pair of these phrases co occur within a window of 4 words. For example, consider L=["brown fox","lazy dog"]. The document contains the words "a quick brown fox jumps over the lazy dog". I want to see, how many times brown fox and lazy dog appears within an window of four words and store that in a file. I have following code for doing this:
content=open("d.txt","r").read().replace("\n"," ");
for i in range(len(L)):
for j in range(i+1,len(L)):
wr=L[i]+"\W+(?:\w+\W+){1,4}"+L[j]
wrev=L[j]+"\W+(?:\w+\W+){1,4}"+L[i]
phrasecoccur=len(re.findall(wr, content))+len(re.findall(wrev,content))
if (phrasecoccur>0):
f.write(L[i]+", "+L[j]+", "+str(phrasecoccur)+"\n")
Essentially, for each pair of phrases in the list L, I am checking in the document content that how many times these phrases appear within an window of 4 words. However, this method is computationally inefficient when the list L is pretty large, like 40K elements. Is there a better way of doing this?

You could use something similar to the Aho-Corasick string matching algorithm. Build the state machine from your list of phrases. Then start feeding words into the state machine. Whenever a match occurs, the state machine will tell you which phrase matched and at what word number. So your output would be something like:
"brown fox", 3
"lazy dog", 8
etc.
You can either capture all of the output and post-process it, or you can process the matches as they're found.
It takes a little time to build the state machine (a few seconds for 40,000 phrases), but after that it's linear in the number of input tokens, number of phrases, and number of matches.
I used something similar to match 50 million YouTube video titles against the several million song titles and artist names in the MusicBrainz database. Worked great. And very fast.

It should be possible to assemble your 40000 phrases into a big regular expression pattern, and use that to match against your document. It might not be as fast as something more job-specific, but it does work. Here's how I'd do it:
import re
class Matcher(object):
def __init__(self, phrases):
phrase_pattern = "|".join("(?:{})".format(phrase) for phrase in phrases)
gap_pattern = r"\W+(?:\w+\W+){0,4}?"
full_pattern = "({0}){1}({0})".format(phrase_pattern, gap_pattern)
self.regex = re.compile(full_pattern)
def match(self, doc):
return self.regex.findall(doc) # or use finditer to generate match objs
Here's how you can use it:
>>> L = ["brown fox", "lazy dog"]
>>> matcher = Matcher(L)
>>> doc = "The quick brown fox jumps over the lazy dog."
>>> matcher.match(doc)
[('brown fox', 'lazy dog')]
This solution does have a few limitations. One is that it won't detect overlapping pairs of phrases. So in the example, if you added the phrase "jumps over" to the phrase list, you would still only get one matched pair, ("brown fox", "jumps over"). It would miss both ("brown fox", "lazy dog") and ("jumps over", "lazy dog"), since they include some of the same words.

Expanding on Joel's answer, your iterator could be something like this:
def doc_iter(doc):
words=doc[0:4]
yield words
for i in range(3,len(doc)):
words=words[1:]
words.append(doc[i])
yield words
put your phrases in a dict and use the iterator over the doc, checking the phrases at each iteration. This should give you performance between O(n) and O(n*log(n)).

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?

You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']

I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).

Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.