Extract sentence based on regex conditions in python

Extract sentence based on regex conditions in python - python

I have a dataset containing 9000 sentences from which I need 20/20 statements based upon some conditions. However, when I try to match those conditions either the sentence is outputted or the conditions are not met. The first 20 sentences should contain one verb.
For the second part I would like to have sentences that contain more than 2 verbs.
Right now I have the following code for checking if the amount of verbs is less than 2
import re
import spacy
import en_core_web_md
nlp=en_core_web_md.load()
test = "This sentence has just 1 verb"
test2 = "I have put multiple verbs in this sentence because it is possible and I want it"
doc1 = nlp(test)
doc2 = nlp(test2)
empt = []
for item in doc1.sents:
verbs = 0
for token in item:
if token.pos_ == "VERB":
verbs += 1
if verbs < 2:
empt.append(item)
However, I end up with an empty list.
Can someone tell me what I am doing wrong so i can adjust this code for every additional condition?

You just need to pull the last two lines back two indentation levels. You only want to check the number of verbs in the entire sentence after all the tokens have been considered.

Related

Count number of texts in which a word occurs

I am building a word frequency, and relative frequency, list(s) for a collection of text files. Having discovered, by hand, that a couple of texts can overly influence the frequency of a word, one of the things I want to be able to do is count the number of times a word occurs. It strikes me that there are two ways to do this:
First, to compile a word frequency dictionary (as below -- and I'm not using the NLTK FreqDist because this code actually runs more quickly but if FreqDist has the above functionality built-in and I just didn't know it, I'll take it):
import nltk
tokenizer = nltk.tokenize.RegexpTokenizer(r'\w+')
freq_dic = {}
for text in ftexts:
words = tokenizer.tokenize(text)
for word in words:
# form dictionary
try:
freq_dic[word] += 1
except:
freq_dic[word] = 1
From there, I assume I'll need to write another loop that uses the keys above as keywords:
# This is just scratch code
for text in ftexts:
while True:
if keyword not in line:
continue
else:
break
count = count + 1
And then I'll find some way to mesh these two dictionaries into a tuple or, possibly, a pandas dataframe by word, such that:
word1, frequency, # of texts in which it occurs
word2, frequency, # of texts in which it occurs
The other thing that occurred to me as I was writing this question was to use SciKit's term frequency matrix and then count rows in which a word occurs? Is that possible?
ADDED TO CLARIFY:
Imagine three sentences:
["I need to keep count of the children.",
"If you want to know what the count is, just ask."
"There is nothing here but chickens, chickens, chickens."]
"count" occurs 2x but is in two different texts; "chickens" occurs three times, but is in only one text. What I want is a report that looks like this:
WORD, FREQ, TEXTS
count, 2, 2
chicken, 3, 1

Python Count Number of Phrases in Text

I have a list of product reviews/descriptions in excel and I am trying to classify them using Python based on words that appear in the reviews.
I import both the reviews, and a list of words that would indicate the product falling into a certain classification, into Python using Pandas and then count the number of occurrences of the classification words.
This all works fine for single classification words e.g. 'computer' but I am struggling to make it work for phrases e.g. 'laptop case'.
I have look through a few answers but none were successful for me including:
using just text.count(['laptop case', 'laptop bag']) as per the answer here: Counting phrase frequency in Python 3.3.2 but because you need to split the text up that does not work (and I think maybe text.count does not work for lists either?)
Other answers I have found only look at the occurrence of a single word. Is there something I can do to count words and phrases that does not involve the splitting of the body of text into individual words?
The code I currently have (that works for individual terms) is:
for i in df1.index:
descriptions = df1['detaileddescription'][i]
if type(descriptions) is str:
descriptions = descriptions.split()
pool.append(sum(map(descriptions.count, df2['laptop_bag'])))
else:
pool.append(0)
print(pool)

You're on the right track! You're currently splitting into single words, which facilitates finding occurrences of single words as you pointed out. To find phrases of length n you should split the text into chunks of length n, which are called n-grams.
To do that, check out the NLTK package:
from nltk import ngrams
sentence = 'I have a laptop case and a laptop bag'
n = 2
bigrams = ngrams(sentence.split(), n)
for gram in bigrams:
print(gram)

Sklearn's CountVectorizer is the standard way
from sklearn.feature_extraction import text
vectorizer = text.CountVectorizer()
vec = vectorizer.fit_transform(descriptions)
And if you want to see the counts as a dict:
count_dict = {k:v for k,v in zip(vectorizer.get_feature_names(), vec.toarray()[0]) if v>0}
print (count_dict)
The default is unigrams, you can use bigrams or higher ngrams with the ngram_range parameter

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.

Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.

To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.

This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

Finding the surrounding sentence of a char/word in a string

I am trying to get sentences from a string that contain a given substring using python.
I have access to the string (an academic abstract) and a list of highlights with start and end indexes. For example:
{
abstract: "...long abstract here..."
highlights: [
{
concept: 'a word',
start: 1,
end: 10
}
{
concept: 'cancer',
start: 123,
end: 135
}
]
}
I am looping over each highlight, locating it's start index in the abstract (the end doesn't really matter as I just need to get a location within a sentence), and then somehow need to identify the sentence that index occurs in.
I am able to tokenize the abstract into sentences using nltk.tonenize.sent_tokenize, but by doing that I render the index location useless.
How should I go about solving this problem? I suppose regexes are an option but the nltk tokenizer seems such a nice way of doing it that it would be a shame not to make use of it.. Or somehow reset the start index by finding the number of chars since the previous full stop/exclamation mark/question mark?

You are right, the NLTK tokenizer is really what you should be using in this situation since it is robust enough to handle delimiting mostly all sentences including ending a sentence with a "quotation." You can do something like this (paragraph from a random generator):
Start with,
from nltk.tokenize import sent_tokenize
paragraph = "How does chickens harden over the acceptance? Chickens comprises coffee. Chickens crushes a popular vet next to the eater. Will chickens sweep beneath a project? Coffee funds chickens. Chickens abides against an ineffective drill."
highlights = ["vet","funds"]
sentencesWithHighlights = []
Most intuitive way:
for sentence in sent_tokenize(paragraph):
for highlight in highlights:
if highlight in sentence:
sentencesWithHighlights.append(sentence)
break
But using this method we actually have what is effectively a 3x nested for loop. This is because we first check each sentence, then each highlight, then each subsequence in the sentence for the highlight.
We can get better performance since we know the start index for each highlight:
highlightIndices = [100,169]
subtractFromIndex = 0
for sentence in sent_tokenize(paragraph):
for index in highlightIndices:
if 0 < index - subtractFromIndex < len(sentence):
sentencesWithHighlights.append(sentence)
break
subtractFromIndex += len(sentence)
In either case we get:
sentencesWithHighlights = ['Chickens crushes a popular vet next to the eater.', 'Coffee funds chickens.']

I assume that all your sentences end with one of these three characters: !?.
What about looping over the list of highlights, creating a regexp group:
(?:list|of|your highlights)
Then matching your whole abstract against this regexp:
/(?:[\.!\?]|^)\s*([^\.!\?]*(?:list|of|your highlights)[^\.!\?]*?)(?=\s*[\.!\?])/ig
This way you would get the sentence containing at least one of your highlights in the first subgroup of each match (RegExr).

Another option (though it's tough to say how reliable it would be with variably defined text), would be to split the text into a list of sentences and test against them:
re.split('(?<=\?|!|\.)\s{0,2}(?=[A-Z]|$)', text)

How can I randomize this text generator even further?

I'm working on a random text generator -without using Markov chains- and currently it works without too many problems -actually generates a good amount of random sentences by my criteria but I want to make it even more accurate to prevent as many sentence repeats as possible-. Firstly, here is my code flow:
Enter a sentence as input -this is called trigger string, is assigned to a variable-
Get longest word in trigger string
Search all Project Gutenberg database for sentences that contain this word -regardless of uppercase lowercase-
Return the longest sentence that has the word I spoke about in step 3
Append the sentence in Step 1 and Step4 together
Assign the sentence in Step 4 as the new 'trigger' sentence and repeat the process. Note that I have to get the longest word in second sentence and continue like that and so on-
And here is my code:
import nltk
from nltk.corpus import gutenberg
from random import choice
import smtplib #will be for send e-mail option later
triggerSentence = raw_input("Please enter the trigger sentence: ")#get input str
longestLength = 0
longestString = ""
longestLen2 = 0
longestStr2 = ""
listOfSents = gutenberg.sents() #all sentences of gutenberg are assigned -list of list format-
listOfWords = gutenberg.words()# all words in gutenberg books -list format-
while triggerSentence:#run the loop so long as there is a trigger sentence
sets = []
sets2 = []
split_str = triggerSentence.split()#split the sentence into words
#code to find the longest word in the trigger sentence input
for piece in split_str:
if len(piece) > longestLength:
longestString = piece
longestLength = len(piece)
#code to get the sentences containing the longest word, then selecting
#random one of these sentences that are longer than 40 characters
for sentence in listOfSents:
if sentence.count(longestString):
sents= " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence #the first sentence that comes up after I enter input-
split_str = triggerSentence.split()
for apiece in triggerSentence: #find the longest word in this new sentence
if len(apiece) > longestLen2:
longestStr2 = piece
longestLen2 = len(apiece)
if longestStr2 == longestString:
second_longest = sorted(split_str, key=len)[-2]#this should return the second longest word in the sentence in case it's longest word is as same as the longest word of last sentence
#print second_longest #now get second longest word if first is same
#as longest word in previous sentence
for sentence in listOfSents:
if sentence.count(second_longest):
sents = " ".join(sentence)
if len(sents) > 40:
sets2.append(" ".join(sentence))
triggerSentence = choice(sets2)
else:
for sentence in listOfSents:
if sentence.count(longestStr2):
sents = " ".join(sentence)
if len(sents) > 40:
sets.append(" ".join(sentence))
triggerSentence = choice(sets)
print triggerSentence
According to my code, once I enter a trigger sentence, I should get another one that contains the longest word of the trigger sentence I entered. Then this new sentence becomes the trigger sentence and it's longest word is picked. This is where the problem sometimes occurs. I observed that despite the code lines I placed - starting from line 47 to the end- , the algorithm still can pick the same longest word in the sentences that come along, not looking for the second longest word.
For example:
Trigger string = "Scotland is a nice place."
Sentence 1 = -This is a random sentence with the word Scotland in it-
Now, this is where the problem can occur in my code at times -doesn't matter whether it comes up in sentence 2 or 942 or zillion or whatever, but I give it in sent.2 for example's sake-
Sentence 2 = Another sentence that has the word Scotland in it but not the second longest word in sentence 1. According to my code, this sentence should have been some sentence that contained the second longest word in sentence 1, not Scotland !
How can I solve this ? I'm trying to optimize the code as much as possible and any help is welcome.

There is nothing random about your algorithm at all. It should always be deterministic.
I'm not quite sure what you want to do here. If it is to generate random words, just use a dictionary and the random module. If you want to grab random sentences from the Gutenberg project, use the random module to pick a work and then a sentence out of that work.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.