Why is not TextBlob using / detecting the negation?

Why is not TextBlob using / detecting the negation? - python

I am using TextBlob to perform a sentiment analysis task. I have noticed that TextBlob is able to detect the negation in some cases while in other cases not.
Here are two simple examples
>>> from textblob.sentiments import PatternAnalyzer
>>> sentiment_analyzer = PatternAnalyzer()
# example 1
>>> sentiment_analyzer.analyze('This is good')
Sentiment(polarity=0.7, subjectivity=0.6000000000000001)
>>> sentiment_analyzer.analyze('This is not good')
Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)
# example 2
>>> sentiment_analyzer.analyze('I am the best')
Sentiment(polarity=1.0, subjectivity=0.3)
>>> sentiment_analyzer.analyze('I am not the best')
Sentiment(polarity=1.0, subjectivity=0.3)
As you can see in the second example when using the adjective best the polarity is not changing. I suspect that has to do with the fact that the adjective best is a very strong indicator, but doesn't seem right because the negation should have reversed the polarity (in my understanding).
Can anyone explain a little bit what's going? Is textblob using some negation mechanism at all or is it just that the word not is adding negative sentiment to the sentence? In either case, why does the second example has exactly the same sentiment in both cases? Is there any suggestion about how to overcome such obstacles?

(edit: my old answer was more about general classifiers and not about PatternAnalyzer)
TextBlob uses in your code the "PatternAnalyzer". Its behaviour is briefly discribed in that document: http://www.clips.ua.ac.be/pages/pattern-en#parser
We can see that:
The pattern.en module bundles a lexicon of adjectives (e.g., good, bad, amazing, irritating, ...) that occur frequently in product reviews, annotated with scores for sentiment polarity (positive ↔ negative) and subjectivity (objective ↔ subjective).
The sentiment() function returns a (polarity, subjectivity)-tuple for the given sentence, based on the adjectives it contains,
Here's an example that shows the behaviour of the algorithm. The polarity directly depends on the adjective used.
sentiment_analyzer.analyze('player')
Sentiment(polarity=0.0, subjectivity=0.0)
sentiment_analyzer.analyze('bad player')
Sentiment(polarity=-0.6999998, subjectivity=0.66666)
sentiment_analyzer.analyze('worst player')
Sentiment(polarity=-1.0, subjectivity=1.0)
sentiment_analyzer.analyze('best player')
Sentiment(polarity=1.0, subjectivity=0.3)
Professionnal softwares generally use complex tools based on neural networks and classifiers combined with lexical analysis. But for me, TextBlob just tries to give a result based on a direct result from the grammar analysis (here, the polarity of the adjectives). It's the source of the problem.
It does not try to check if the general sentence is negative or not (with the "not" word). It tries to check if the adjective is negated or not (as it works only with adjective, not with the general structure). Here, best is used as a noun and is not a negated adjective. So, the polarity is positive.
sentiment_analyzer.analyze('not the best')
Sentiment(polarity=1.0, subjectivity=0.3)
Just remplace the order of the words to make negation over the adjective and not the whole sentence.
sentiment_analyzer.analyze('the not best')
Sentiment(polarity=-0.5, subjectivity=0.3)
Here, the adjective is negated. So, the polarity is negative.
It's my explaination of that "strange behaviour".
The real implementation is defined in file:
https://github.com/sloria/TextBlob/blob/dev/textblob/_text.py
The interresing portion is given by:
if w in self and pos in self[w]:
p, s, i = self[w][pos]
# Known word not preceded by a modifier ("good").
if m is None:
a.append(dict(w=[w], p=p, s=s, i=i, n=1, x=self.labeler.get(w)))
# Known word preceded by a modifier ("really good").
...
else:
# Unknown word may be a negation ("not good").
if negation and w in self.negations:
n = w
# Unknown word. Retain negation across small words ("not a good").
elif n and len(w.strip("'")) > 1:
n = None
# Unknown word may be a negation preceded by a modifier ("really not good").
if n is not None and m is not None and (pos in self.modifiers or self.modifier(m[0])):
a[-1]["w"].append(n)
a[-1]["n"] = -1
n = None
# Unknown word. Retain modifier across small words ("really is a good").
elif m and len(w) > 2:
m = None
# Exclamation marks boost previous word.
if w == "!" and len(a) > 0:
...
If we enter "not a good" or "not the good", it will match the else part because it's not a single adjective.
The "not a good" part will match elif n and len(w.strip("'")) > 1: so it will reverse polarity. not the good will not match any pattern, so, the polarity will be the same of "best".
The entire code is a succession of fine tweaking, grammar indictions (such as adding ! increases polarity, adding a smiley indicates irony, ...). It's why some particular patterns will give strange results. To handle each specific case, you must check if your sentence will match any of the if sentences in that part of the code.
I hope I'll help

Related

Faster text search in Python with complex if statement

I have a large set of long text documents with punctuation. Three short examples are provided here:
doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]
and I have sets of words like the following:
wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])
I want to search all text documents that meet the following condition:
(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)
The or not condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.
This process would match the first text in doc but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".
I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?

Your solution appears to be linear in the length of the document - you won't be able to get any better than this without sorting, as the words you're looking for could be anywhere in the document. You could try using one loop over the entire doc:
or_satisfied = False
for w in doc:
if word in wAND: wAND.remove(word)
if not or_satisfied and word in wOR: or_satisfied = True
if word in wNOT: return False
return or_satisfied and not wAND

You can build regexps for the word bags you have, and use them:
def make_re(word_set):
return re.compile(
r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
flags=re.I,
)
wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)
def re_match(doc):
if not wOR_re.search(doc):
return False
if wNOT_re.search(doc):
return False
found = set()
expected = len(wAND)
for word in re.finditer(r'\w+', doc):
found.add(word)
if len(found) == expected:
break
return len(found) == expected
A quick timetest seems to say this is 89% faster than the original (and passes the original "test suite"), likely clearly due to the fact that
documents don't need to be cleaned (since the \bs limit matches to words and re.I deals with case normalization)
regexps are run in native code, which tends to always be faster than Python
name='original' iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match' iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match' iters=10000 time=0.203 iters_per_sec=49363.58
where bag_match is my original comment suggestion of using set intersections:
def bag_match(doc):
bag = set(clean_doc(doc))
return (
(bag.intersection(wOR) or not wOR) and
(bag.issuperset(wAND) or not wAND) and
(not bag.intersection(wNOT) or not wNOT)
)
If you already have cleaned the documents to an iterable of words (here I just slapped #lru_cache on clean_doc, which you probably wouldn't do in real life since your documents are likely to all be unique and caching wouldn't help), then bag_match is much faster:
name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60

Extracting collocates for a given word from a text corpus - Python

I am trying to find out how to extract the collocates of a specific word out of a text. As in: what are the words that make a statistically significant collocation with e.g. the word "hobbit" in the entire text corpus? I am expecting a result similar to a list of words (collocates ) or maybe tuples (my word + its collocate).
I know how to make bi- and tri-grams using nltk, and also how to select only the bi- or trigrams that contain my word of interest. I am using the following code (adapted from this StackOverflow question).
import nltk
from nltk.collocations import *
corpus = nltk.Text(text) # "text" is a list of tokens
trigram_measures = nltk.collocations.TrigramAssocMeasures()
tri_finder = TrigramCollocationFinder.from_words(corpus)
# Only trigrams that appear 3+ times
tri_finder.apply_freq_filter(3)
# Only the ones containing my word
my_filter = lambda *w: 'Hobbit' not in w
tri_finder.apply_ngram_filter(my_filter)
print tri_finder.nbest(trigram_measures.likelihood_ratio, 20)
This works fine and gives me a list of trigrams (one element of of which is my word) each with their log-likelihood value. But I don't really want to select words only from a list of trigrams. I would like to make all possible N-Gram combinations in a window of my choice (for example, all words in a window of 3 left and 3 right from my word - that would mean a 7-Gram), and then check which of those N-gram words has a statistically relevant frequency paired with my word of interest. I would like to take the Log-Likelihood value for that.
My idea would be:
1) Calculate all N-Gram combinations in different sizes containing my word (not necessarily using nltk, unless it allows to calculate units larger than trigrams, but i haven't found that option),
2) Compute the log-likelihood value for each of the words composing my N-grams, and somehow compare it against the frequency of the n-gram they appear in (?). Here is where I get lost a bit... I am not experienced in this and I don't know how to think this step.
Does anyone have suggestions how I should do?
And assuming I use the pool of trigrams provided by nltk for now: does anyone have ideas how to proceed from there to get a list of the most relevant words near my search word?
Thank you

Interesting problem ...
Related to 1) take a look at this thread...different nice solutions to make ngrams .. basically I lo
from nltk import ngrams
sentence = 'this is a foo bar sentences and i want to ngramize it'
n = 6
sixgrams = ngrams(sentence.split(), n)
for grams in sixgrams:
print (grams)
The other way could be:
phrases = Phrases(doc,min_count=2)
bigram = models.phrases.Phraser(phrases)
phrases = Phrases(bigram[doc],min_count=2)
trigram = models.phrases.Phraser(phrases)
phrases = Phrases(trigram[doc],min_count=2)
Quadgram = models.phrases.Phraser(phrases)
... (you could continue infinitely)
min_count controls the frequency of each word in the corpora.
Related to 2) It's somehow tricky calculating loglikelihood for more than two variables since you should count for all the permutations. look this thesis which guy proposed a solution (page 26 contains a good explanation).
However, in addition to log-likelihood function, there is PMI (Pointwise Mutual Information) metric which calculates the co-occurrence of pair of words divided by their individual frequency in the text. PMI is easy to understand and calculate which you could use it for each pair of the words.

Finding closest spellings by the same first letter via different distance measures

I am trying to write a function that will find closest spellings for a word (which may have been incorrectly spelled) 'by the same first letter' through different n-grams and distance measures.
For what I currently have
from nltk.corpus import words
from nltk import ngrams
from nltk.metrics.distance import edit_distance, jaccard_distance
first_letters = ['A','B','C']
spellings = words.words()
def recommendation(word):
n = 3
# n means 'n'-grams, here I use 3 as an example
spellings_new = [w for w in spellings if (w[0] in first_letters)]
dists = [________(set(ngrams(word, n)), set(ngrams(w, n))) for w in spellings_new]
# ______ is the distance measure
return spellings_new[dists.index(min(dists))]
The rest seems straightforward, but I don't know how to specify 'same initial letter' condition. In particular, if the misspelled word starts with the letter 'A', then the corrected word recommended from '.words' having the minimum distance measure with the misspelled word should also starts with 'A'. So on and so forth.
As you can see from the above function block, I use '(w[0] in first_letters)' as my 'initial letter condition,' but this doesn't do the trick and always return letters that start with different initials.
I have yet to find similar threads on this board addressing my question here, it will be appreciated if anyone could enlighten me on how to specify the 'initial letter condition'. If this question has somehow been asked before and deemed inappropriate, I will remove it.
Thank you.

You're really quite close. w[0] == word[0] can be used to check if the first letter is the same. After that set(w) and set(word) can be used to change the words into sets of letters. I then passed it into jaccard_distance, only because that's what you already had imported. It's possible there's a better solution.
def recommendation(word):
n = 3
# n means 'n'-grams, here I use 3 as an example
spellings_new = [w for w in spellings if (w[0] == word[0])]
dists = [jaccard_distance(set(w), set(word)) for w in spellings_new]
return spellings_new[dists.index(min(dists))]

Negation handling in sentiment analysis

I am in need of a little help here, I need to identify the negative words like "not good","not bad" and then identify the polarity (negative or positive) of the sentiment. I did everything except handling the negations. I just want to know how I can include negations into it. How do I go about it?

Negation handling is quite a broad field, with numerous different potential implementations. Here I can provide sample code that negates a sequence of text and stores negated uni/bi/trigrams in not_ form. Note that nltk isn't used here in favor of simple text processing.
# negate_sequence(text)
# text: sentence to process (creation of uni/bi/trigrams
# is handled here)
#
# Detects negations and transforms negated words into 'not_' form
#
def negate_sequence(text):
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
stripped = word.strip(delims).lower()
negated = "not_" + stripped if negation else stripped
result.append(negated)
if prev:
bigram = prev + " " + negated
result.append(bigram)
if pprev:
trigram = pprev + " " + bigram
result.append(trigram)
pprev = prev
prev = negated
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
if any(c in word for c in delims):
negation = False
return result
If we run this program on a sample input text = "I am not happy today, and I am not feeling well", we obtain the following sequences of unigrams, bigrams, and trigrams:
[ 'i',
'am',
'i am',
'not',
'am not',
'i am not',
'not_happy',
'not not_happy',
'am not not_happy',
'not_today',
'not_happy not_today',
'not not_happy not_today',
'and',
'not_today and',
'not_happy not_today and',
'i',
'and i',
'not_today and i',
'am',
'i am',
'and i am',
'not',
'am not',
'i am not',
'not_feeling',
'not not_feeling',
'am not not_feeling',
'not_well',
'not_feeling not_well',
'not not_feeling not_well']
We may subsequently store these trigrams in an array for future retreival and analysis. Process the not_ words as negative of the [sentiment, polarity] that you have defined for their counterparts.

this seems to be working decently well as a poor man's word negation in python. it's definitely not perfect, but may be useful for some cases. it takes a spacy sentence object.
def word_is_negated(word):
""" """
for child in word.children:
if child.dep_ == 'neg':
return True
if word.pos_ in {'VERB'}:
for ancestor in word.ancestors:
if ancestor.pos_ in {'VERB'}:
for child2 in ancestor.children:
if child2.dep_ == 'neg':
return True
return False
def find_negated_wordSentIdxs_in_sent(sent, idxs_of_interest=None):
""" """
negated_word_idxs = set()
for word_sent_idx, word in enumerate(sent):
if idxs_of_interest:
if word_sent_idx not in idxs_of_interest:
continue
if word_is_negated(word):
negated_word_idxs.add(word_sent_idx)
return negated_word_idxs
call it like this:
import spacy
nlp = spacy.load('en_core_web_lg')
find_negated_wordSentIdxs_in_sent(nlp("I have hope, but I do not like summer"))
EDIT:
As #Amandeep pointed out, depending on your use case, you may also want to include NOUNS, ADJECTIVES, ADVERBS in the line: if word.pos_ in {'VERB'}:.

It's been a while since I've worked on sentiment analysis, so not sure what the status of this area is now, and in any case I have never used nltk for this. So I wouldn't be able to point you to anything there. But in general, I think it's safe to say that this is an active area of research and an essential part of NLP. And that surely it isn't a problem that has been 'solved' yet. It's one of the finer, more interesting fields of NLP, involving irony, sarcams, scope (of negations). Often, coming up with a correct analysis means interpreting a lot of context/domain/discourse information. Which isn't straightforward at all.
You may want to look at this topic: Can an algorithm detect sarcasm. And some googling will probably give you a lot more information.
In short; your question is way too broad to come up with a specific answer.
Also, I wonder what you mean with "I did everything except handling the negations". You mean you identified 'negative' words? Have you considered that this information can be conveyed in a lot more than the words not, no, etc? Consider for example "Your solution was not good" vs. "Your solution was suboptimal".
What exactly you are looking for, and what will suffice in your situation, obivously depends on context and domain of application.
This probably wasn't the answer you were hoping for, but I'd suggest you do a bit more research (as a lot of smart things have been done by smart people in this field).

Discovering Poetic Form with NLTK and CMU Dict

Edit: This code has been worked on and released as a basic module: https://github.com/hyperreality/Poetry-Tools
I'm a linguist who has recently picked up python and I'm working on a project which hopes to automatically analyze poems, including detecting the form of the poem. I.e. if it found a 10 syllable line with 0101010101 stress pattern, it would declare that it's iambic pentameter. A poem with 5-7-5 syllable pattern would be a haiku.
I'm using the following code, part of a larger script, but I have a number of problems which are listed below the program:
corpus in the script is simply the raw text input of the poem.
import sys, getopt, nltk, re, string
from nltk.tokenize import RegexpTokenizer
from nltk.util import bigrams, trigrams
from nltk.corpus import cmudict
from curses.ascii import isdigit
...
def cmuform():
tokens = [word for sent in nltk.sent_tokenize(corpus) for word in nltk.word_tokenize(sent)]
d = cmudict.dict()
text = nltk.Text(tokens)
words = [w.lower() for w in text]
regexp = "[A-Za-z]+"
exp = re.compile(regexp)
def nsyl(word):
lowercase = word.lower()
if lowercase not in d:
return 0
else:
first = [' '.join([str(c) for c in lst]) for lst in max(d[lowercase])]
second = ''.join(first)
third = ''.join([i for i in second if i.isdigit()]).replace('2', '1')
return third
#return max([len([y for y in x if isdigit(y[-1])]) for x in d[lowercase]])
sum1 = 0
for a in words:
if exp.match(a):
print a,nsyl(a),
sum1 = sum1 + len(str(nsyl(a)))
print "\nTotal syllables:",sum1
I guess that the output that I want would be like this:
1101111101
0101111001
1101010111
The first problem is that I lost the line breaks during the tokenization, and I really need the line breaks to be able to identify form. This should not be too hard to deal with though. The bigger problems are that:
I can't deal with non-dictionary words. At the moment I return 0 for them, but this will confound any attempt to identify the poem, as the syllabic count of the line will probably decrease.
In addition, the CMU dictionary often says that there is stress on a word - '1' - when there is not - '0 - . Which is why the output looks like this: 1101111101, when it should be the stress of iambic pentameter: 0101010101
So how would I add some fudging factor so the poem still gets identified as iambic pentameter when it only approximates the pattern? It's no good to code a function that identifies lines of 01's when the CMU dictionary is not going to output such a clean result. I suppose I'm asking how to code a 'partial match' algorithm.

Welcome to stack overflow. I'm not that familiar with Python, but I see you have not received many answers yet so I'll try to help you with your queries.
First some advice: You'll find that if you focus your questions your chances of getting answers are greatly improved. Your post is too long and contains several different questions, so it is beyond the "attention span" of most people answering questions here.
Back on topic:
Before you revised your question you asked how to make it less messy. That's a big question, but you might want to use the top-down procedural approach and break your code into functional units:
split corpus into lines
For each line: find the syllable length and stress pattern.
Classify stress patterns.
You'll find that the first step is a single function call in python:
corpus.split("\n");
and can remain in the main function but the second step would be better placed in its own function and the third step would require to be split up itself, and would probably be better tackled with an object oriented approach. If you're in academy you might be able to convince the CS faculty to lend you a post-grad for a couple of months and help you instead of some workshop requirement.
Now to your other questions:
Not loosing line breaks: as #ykaganovich mentioned, you probably want to split the corpus into lines and feed those to the tokenizer.
Words not in dictionary/errors: The CMU dictionary home page says:
Find an error? Please contact the developers. We will look at the problem and improve the dictionary. (See at bottom for contact information.)
There is probably a way to add custom words to the dictionary / change existing ones, look in their site, or contact the dictionary maintainers directly.
You can also ask here in a separate question if you can't figure it out. There's bound to be someone in stackoverflow that knows the answer or can point you to the correct resource.
Whatever you decide, you'll want to contact the maintainers and offer them any extra words and corrections anyway to improve the dictionary.
Classifying input corpus when it doesn't exactly match the pattern: You might want to look at the link ykaganovich provided for fuzzy string comparisons. Some algorithms to look for:
Levenshtein distance: gives you a measure of how different two strings are as the number of changes needed to turn one string into another. Pros: easy to implement, Cons: not normalized, a score of 2 means a good match for a pattern of length 20 but a bad match for a pattern of length 3.
Jaro-Winkler string similarity measure: similar to Levenshtein, but based on how many character sequences appear in the same order in both strings. It is a bit harder to implement but gives you normalized values (0.0 - completely different, 1.0 - the same) and is suitable for classifying the stress patterns. A CS postgrad or last year undergrad should not have too much trouble with it ( hint hint ).
I think those were all your questions. Hope this helps a bit.

To preserve newlines, parse line by line before sending each line to the cmu parser.
For dealing with single-syllable words, you probably want to try both 0 and 1 for it when nltk returns 1 (looks like nltk already returns 0 for some words that would never get stressed, like "the"). So, you'll end up with multiple permutations:
1101111101
0101010101
1101010101
and so forth. Then you have to pick ones that look like a known forms.
For non-dictionary words, I'd also fudge it the same way: figure out the number of syllables (the dumbest way would be by counting the vowels), and permutate all possible stresses. Maybe add some more rules like "ea is a single syllable, trailing e is silent"...
I've never worked with other kinds of fuzzying, but you can check https://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison for some ideas.

This is my first post on stackoverflow.
And I'm a python newbie, so please excuse any deficits in code style.
But I too am attempting to extract accurate metre from poems.
And the code included in this question helped me, so I post what I came up with that builds on that foundation. It is one way to extract the stress as a single string, correct with a 'fudging factor' for the cmudict bias, and not lose words that are not in the cmudict.
import nltk
from nltk.corpus import cmudict
prondict = cmudict.dict()
#
# parseStressOfLine(line)
# function that takes a line
# parses it for stress
# corrects the cmudict bias toward 1
# and returns two strings
#
# 'stress' in form '0101*,*110110'
# -- 'stress' also returns words not in cmudict '0101*,*1*zeon*10110'
# 'stress_no_punct' in form '0101110110'
def parseStressOfLine(line):
stress=""
stress_no_punct=""
print line
tokens = [words.lower() for words in nltk.word_tokenize(line)]
for word in tokens:
word_punct = strip_punctuation_stressed(word.lower())
word = word_punct['word']
punct = word_punct['punct']
#print word
if word not in prondict:
# if word is not in dictionary
# add it to the string that includes punctuation
stress= stress+"*"+word+"*"
else:
zero_bool=True
for s in prondict[word]:
# oppose the cmudict bias toward 1
# search for a zero in array returned from prondict
# if it exists use it
# print strip_letters(s),word
if strip_letters(s)=="0":
stress = stress + "0"
stress_no_punct = stress_no_punct + "0"
zero_bool=False
break
if zero_bool:
stress = stress + strip_letters(prondict[word][0])
stress_no_punct=stress_no_punct + strip_letters(prondict[word][0])
if len(punct)>0:
stress= stress+"*"+punct+"*"
return {'stress':stress,'stress_no_punct':stress_no_punct}
# STRIP PUNCTUATION but keep it
def strip_punctuation_stressed(word):
# define punctuations
punctuations = '!()-[]{};:"\,<>./?##$%^&*_~'
my_str = word
# remove punctuations from the string
no_punct = ""
punct=""
for char in my_str:
if char not in punctuations:
no_punct = no_punct + char
else:
punct = punct+char
return {'word':no_punct,'punct':punct}
# CONVERT the cmudict prondict into just numbers
def strip_letters(ls):
#print "strip_letters"
nm = ''
for ws in ls:
#print "ws",ws
for ch in list(ws):
#print "ch",ch
if ch.isdigit():
nm=nm+ch
#print "ad to nm",nm, type(nm)
return nm
# TESTING results
# i do not correct for the '2'
line = "This day (the year I dare not tell)"
print parseStressOfLine(line)
line = "Apollo play'd the midwife's part;"
print parseStressOfLine(line)
line = "Into the world Corinna fell,"
print parseStressOfLine(line)
"""
OUTPUT
This day (the year I dare not tell)
{'stress': '01***(*011111***)*', 'stress_no_punct': '01011111'}
Apollo play'd the midwife's part;
{'stress': "0101*'d*01211***;*", 'stress_no_punct': '010101211'}
Into the world Corinna fell,
{'stress': '01012101*,*', 'stress_no_punct': '01012101'}

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.