NLTK: How can I extract information based on sentence maps?

NLTK: How can I extract information based on sentence maps? - python

I know you can use noun extraction to get nouns out of sentences but how can I use sentence overlays/maps to take out phrases?
For example:
Sentence Overlay:
"First, #action; Second, Foobar"
Input:
"First, Dance and Code; Second, Foobar"
I want to return:
action = "Dance and Code"
Normal Noun Extractions wont work because it wont always be nouns
The way sentences are phrased differs so it cant be words[x] ... because the positioning of the words changes

You can slightly rewrite your string templates to turn them into regexps, and see which one (or which ones) match.
>>> template = "First, (?P<action>.*); Second, Foobar"
>>> mo = re.search(template, "First, Dance and Code; Second, Foobar")
>>> if mo:
print(mo.group("action"))
Dance and Code
You can even transform your existing strings into this kind of regexp (after escaping regexp metacharacters like .?*()).
>>> template = "First, #action; (Second, Foobar...)"
>>> re_template = re.sub(r"\\#(\w+)", r"(?P<\g<1>>.*)", re.escape(template))
>>> print(re_template)
First\,\ (?P<action>.*)\;\ \(Second\,\ Foobar\.\.\.\)

Related

Regular expression in python to extract noun phrase but exclude the adjective word like 'other'

This is code I define a noun phrase chunking method
def np_chunking(sentence):
import nltk
from nltk import word_tokenize,pos_tag, ne_chunk
from nltk import Tree
grammer = "NP: {<JJ>*<NN.*>+}\n{<NN.*>+}" # chunker rules. adjective+noun or one or more nouns
sen=sentence
cp=nltk.RegexpParser(grammer)
mychunk=cp.parse(pos_tag(word_tokenize(sen)))
result=mychunk
return result.draw()
It works like this
print(np_chunking("""I like to listen to music from musical genres,such as blues,rock and jazz."""))
But when I change the text into another sentence like
print(np_chunking("""He likes to play basketball,football and other sports."""))
I do want to extract noun phrase chunking with structure like adjective plus noun or mutiple nouns. But in the second example, the word other is in the sutructure of 'np_1, np_2 and other np_3'. After the 'and other' it often comes up with a hypernym.
In the second part
def hyponym_extract(prepared_text, hearst_patterns):
text=merge_NP(prepared_text)
hyponyms=[]
result=[]
if re.search(hearst_patterns[0][0],text)!=None:
matches=re.search(hearst_patterns[0][0],text)
NP_match=re.findall(r"NP_\w+",matches.group(0))
hyponyms=NP_match[1:]
result=[(NP_match[0],x) for x in hyponyms]
if re.search(hearst_patterns[1][0],text)!=None:
matches=re.search(hearst_patterns[1][0],text)
NP_match=re.findall(r"NP_\w+",matches.group(0))
hyponyms=NP_match[:-1]
result=[(NP_match[-1],x) for x in hyponyms]
return result
hearst_patterns = [("(NP_\w+ (, )?such as (NP_\w+ ?(, )?(and |or )?)+)", "first"),
("((NP_\w+ ?(, )?)+(and |or )?other NP_\w+)","last")] # two examples for hearst pattern
print(hyponym_extract(prepare_chunks(np_chunking("I like to listen to music from musical genres,such as blues,rock and jazz.")),hearst_patterns))
print(hyponym_extract(prepare_chunks(np_chunking("He likes to play basketball,football and other sports.")),hearst_patterns))
The other is a part of the hearst pattern to extract hypernym and hyponyms.
So how could I improve my first code to let the second one work correctly?

ordering text by the agreement with keywords

I am trying to order a number of short paragraphs by their agreement with a list of keywords. This is used to provide a user with the text ordered by interest.
Let's assume I already have the list of keywords, hopefully reflecting the users interest. I thought this is a fairly standard procedure and expected some python package for that. But so far my Google search was not very successful.
I can easily come up with a brute force solution myself, but I was wondering whether somebody knows an efficient way to do this?
EDIT:
Ok here is an example:
keywords = ['cats', 'food', 'Miau']
text1 = 'This is text about dogs'
text2 = 'This is text about food'
text3 = 'This is text about cat food'
I need a procedure which leads to the order text3, text2, text1
thanks

This is the simplest thing I can think of:
import string
input = open('document.txt', 'r')
text = input.read()
table = string.maketrans("","")
text = text.translate(table, string.punctuation)
wordlist = text.split()
agreement_cnt = 0
for word in list_of_keywords:
agreement_cnt += wordlist.count(word)
got the removing punctuation bit from here: Best way to strip punctuation from a string in Python.

Something like this might be a good starting point:
>>> keywords = ['cats', 'food', 'Miau']
>>> text1 = 'This is a text about food fed to cats'
>>> matched_word_count = len(set(text1.split()).intersection(set(keywords)))
>>> print matched_word_count
2
If you want to correct for capitalization or capture word forms (i.e. 'cat' instead of 'cats'), there's obviously more to consider, though.
Taking the above and capturing match counts for a list of different strings, and then sorting the results to find the "best" match, should be relatively simple.

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.

Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.

To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.

This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?

You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']

I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).

Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

How to find collocations in text, python

How do you find collocations in text?
A collocation is a sequence of words that occurs together unusually often.
python has built-in func bigrams that returns word pairs.
>>> bigrams(['more', 'is', 'said', 'than', 'done'])
[('more', 'is'), ('is', 'said'), ('said', 'than'), ('than', 'done')]
>>>
What's left is to find bigrams that occur more often based on the frequency of individual words. Any ideas how to put it in the code?

Try NLTK. You will mostly be interested in nltk.collocations.BigramCollocationFinder, but here is a quick demonstration to show you how to get started:
>>> import nltk
>>> def tokenize(sentences):
... for sent in nltk.sent_tokenize(sentences.lower()):
... for word in nltk.word_tokenize(sent):
... yield word
...
>>> nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
<Text: mary had a little lamb ....>
>>> text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
There are none in this small segment, but here goes:
>>> text.collocations(num=20)
Building collocations list

Here is some code that takes a list of lowercase words and returns a list of all bigrams with their respective counts, starting with the highest count. Don't use this code for large lists.
from itertools import izip
words = ["more", "is", "said", "than", "done", "is", "said"]
words_iter = iter(words)
next(words_iter, None)
count = {}
for bigram in izip(words, words_iter):
count[bigram] = count.get(bigram, 0) + 1
print sorted(((c, b) for b, c in count.iteritems()), reverse=True)
(words_iter is introduced to avoid copying the whole list of words as you would do in izip(words, words[1:])

import itertools
from collections import Counter
words = ['more', 'is', 'said', 'than', 'done']
nextword = iter(words)
next(nextword)
freq=Counter(zip(words,nextword))
print(freq)

A collocation is a sequence of tokens that are better treated as a single token when parsing e.g. "red herring" has a meaning that can't be derived from its components. Deriving a useful set of collocations from a corpus involves ranking the n-grams by some statistic (n-gram frequency, mutual information, log-likelihood, etc) followed by judicious manual editing.
Points that you appear to be ignoring:
(1) the corpus must be rather large ... attempting to get collocations from one sentence as you appear to suggest is pointless.
(2) n can be greater than 2 ... e.g. analysing texts written about 20th century Chinese history will throw up "significant" bigrams like "Mao Tse" and "Tse Tung".
What are you actually trying to achieve? What code have you written so far?

Agree with Tim McNamara on using nltk and problems with the unicode. However, I like the text class a lot - there is a hack that you can use to get the collocations as list , i discovered it looking at the source code . Apparently whenever you invoke the collocations method it saves it as a class variable!
import nltk
def tokenize(sentences):
for sent in nltk.sent_tokenize(sentences.lower()):
for word in nltk.word_tokenize(sent):
yield word
text = nltk.Text(tkn for tkn in tokenize('mary had a little lamb.'))
text.collocations(num=20)
collocations = [" ".join(el) for el in list(text._collocations)]
enjoy !

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.