I am in need of a little help here, I need to identify the negative words like "not good","not bad" and then identify the polarity (negative or positive) of the sentiment. I did everything except handling the negations. I just want to know how I can include negations into it. How do I go about it?
Negation handling is quite a broad field, with numerous different potential implementations. Here I can provide sample code that negates a sequence of text and stores negated uni/bi/trigrams in not_ form. Note that nltk isn't used here in favor of simple text processing.
# negate_sequence(text)
# text: sentence to process (creation of uni/bi/trigrams
# is handled here)
#
# Detects negations and transforms negated words into 'not_' form
#
def negate_sequence(text):
negation = False
delims = "?.,!:;"
result = []
words = text.split()
prev = None
pprev = None
for word in words:
stripped = word.strip(delims).lower()
negated = "not_" + stripped if negation else stripped
result.append(negated)
if prev:
bigram = prev + " " + negated
result.append(bigram)
if pprev:
trigram = pprev + " " + bigram
result.append(trigram)
pprev = prev
prev = negated
if any(neg in word for neg in ["not", "n't", "no"]):
negation = not negation
if any(c in word for c in delims):
negation = False
return result
If we run this program on a sample input text = "I am not happy today, and I am not feeling well", we obtain the following sequences of unigrams, bigrams, and trigrams:
[ 'i',
'am',
'i am',
'not',
'am not',
'i am not',
'not_happy',
'not not_happy',
'am not not_happy',
'not_today',
'not_happy not_today',
'not not_happy not_today',
'and',
'not_today and',
'not_happy not_today and',
'i',
'and i',
'not_today and i',
'am',
'i am',
'and i am',
'not',
'am not',
'i am not',
'not_feeling',
'not not_feeling',
'am not not_feeling',
'not_well',
'not_feeling not_well',
'not not_feeling not_well']
We may subsequently store these trigrams in an array for future retreival and analysis. Process the not_ words as negative of the [sentiment, polarity] that you have defined for their counterparts.
this seems to be working decently well as a poor man's word negation in python. it's definitely not perfect, but may be useful for some cases. it takes a spacy sentence object.
def word_is_negated(word):
""" """
for child in word.children:
if child.dep_ == 'neg':
return True
if word.pos_ in {'VERB'}:
for ancestor in word.ancestors:
if ancestor.pos_ in {'VERB'}:
for child2 in ancestor.children:
if child2.dep_ == 'neg':
return True
return False
def find_negated_wordSentIdxs_in_sent(sent, idxs_of_interest=None):
""" """
negated_word_idxs = set()
for word_sent_idx, word in enumerate(sent):
if idxs_of_interest:
if word_sent_idx not in idxs_of_interest:
continue
if word_is_negated(word):
negated_word_idxs.add(word_sent_idx)
return negated_word_idxs
call it like this:
import spacy
nlp = spacy.load('en_core_web_lg')
find_negated_wordSentIdxs_in_sent(nlp("I have hope, but I do not like summer"))
EDIT:
As #Amandeep pointed out, depending on your use case, you may also want to include NOUNS, ADJECTIVES, ADVERBS in the line: if word.pos_ in {'VERB'}:.
It's been a while since I've worked on sentiment analysis, so not sure what the status of this area is now, and in any case I have never used nltk for this. So I wouldn't be able to point you to anything there. But in general, I think it's safe to say that this is an active area of research and an essential part of NLP. And that surely it isn't a problem that has been 'solved' yet. It's one of the finer, more interesting fields of NLP, involving irony, sarcams, scope (of negations). Often, coming up with a correct analysis means interpreting a lot of context/domain/discourse information. Which isn't straightforward at all.
You may want to look at this topic: Can an algorithm detect sarcasm. And some googling will probably give you a lot more information.
In short; your question is way too broad to come up with a specific answer.
Also, I wonder what you mean with "I did everything except handling the negations". You mean you identified 'negative' words? Have you considered that this information can be conveyed in a lot more than the words not, no, etc? Consider for example "Your solution was not good" vs. "Your solution was suboptimal".
What exactly you are looking for, and what will suffice in your situation, obivously depends on context and domain of application.
This probably wasn't the answer you were hoping for, but I'd suggest you do a bit more research (as a lot of smart things have been done by smart people in this field).
Related
I have a large set of long text documents with punctuation. Three short examples are provided here:
doc = ["My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you?", "My house, the most beautiful!, is NEAR the #seaside. I really love holidays, do you love dogs?", "My house, the most beautiful!, is NEAR the #sea. I really love holidays, do you?"]
and I have sets of words like the following:
wAND = set(["house", "near"])
wOR = set(["seaside"])
wNOT = set(["dogs"])
I want to search all text documents that meet the following condition:
(any(w in doc for w in wOR) or not wOR) and (all(w in doc for w in wAND) or not wAND) and (not any(w in doc for w in wNOT) or not wNOT)
The or not condition in each parenthesis is needed as the three lists could be empty. Please notice that before applying the condition I also need to clean text from punctuation, transform it to lowercase, and split it into a set of words, which requires additional time.
This process would match the first text in doc but not the second and the third. Indeed, the second would not match as it contains the word "dogs" and the third because it does not include the word "seaside".
I am wondering if this general problem (with words in the wOR, wAND and wNOT lists changing) can be solved in a faster way, avoiding text pre-processing for cleaning. Maybe with a fast regex solution, that perhaps uses Trie(). Is that possible? or any other suggestion?
Your solution appears to be linear in the length of the document - you won't be able to get any better than this without sorting, as the words you're looking for could be anywhere in the document. You could try using one loop over the entire doc:
or_satisfied = False
for w in doc:
if word in wAND: wAND.remove(word)
if not or_satisfied and word in wOR: or_satisfied = True
if word in wNOT: return False
return or_satisfied and not wAND
You can build regexps for the word bags you have, and use them:
def make_re(word_set):
return re.compile(
r'\b(?:{})\b'.format('|'.join(re.escape(word) for word in word_set)),
flags=re.I,
)
wAND_re = make_re(wAND)
wOR_re = make_re(wOR)
wNOT_re = make_re(wNOT)
def re_match(doc):
if not wOR_re.search(doc):
return False
if wNOT_re.search(doc):
return False
found = set()
expected = len(wAND)
for word in re.finditer(r'\w+', doc):
found.add(word)
if len(found) == expected:
break
return len(found) == expected
A quick timetest seems to say this is 89% faster than the original (and passes the original "test suite"), likely clearly due to the fact that
documents don't need to be cleaned (since the \bs limit matches to words and re.I deals with case normalization)
regexps are run in native code, which tends to always be faster than Python
name='original' iters=10000 time=0.206 iters_per_sec=48488.39
name='re_match' iters=20000 time=0.218 iters_per_sec=91858.73
name='bag_match' iters=10000 time=0.203 iters_per_sec=49363.58
where bag_match is my original comment suggestion of using set intersections:
def bag_match(doc):
bag = set(clean_doc(doc))
return (
(bag.intersection(wOR) or not wOR) and
(bag.issuperset(wAND) or not wAND) and
(not bag.intersection(wNOT) or not wNOT)
)
If you already have cleaned the documents to an iterable of words (here I just slapped #lru_cache on clean_doc, which you probably wouldn't do in real life since your documents are likely to all be unique and caching wouldn't help), then bag_match is much faster:
name='orig-with-cached-clean-doc' iters=50000 time=0.249 iters_per_sec=200994.97
name='re_match-with-cached-clean-doc' iters=20000 time=0.221 iters_per_sec=90628.94
name='bag_match-with-cached-clean-doc' iters=100000 time=0.265 iters_per_sec=377983.60
I am trying to create a program that runs though a list of mental health terms, looks in a research abstract, and counts the number of times the word or phrase appears. I can get this to work with single words, but I'm struggling to do this with multiple words. I tried using NLTK ngrams too, but since the number of words from the mental health list varies (i.e., not all terms from the mental health list will be bigrams or trigrams), I couldn't get that to work either.
I want to emphasize that I know splitting each word will only allow single words to be counted, however, I'm just stuck on how to deal with a varying number of words from my list to count in the abstract.
Thanks!
from collections import Counter
abstracts = ['This is a mental health abstract about anxiety and bipolar
disorder as well as other things.', 'While this abstract is not about ptsd
or any trauma-related illnesses, it does have a mental health focus.']
for x2 in abstracts:
mh_terms = ['bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health']
c = Counter(s.lower().replace('.', '') for s in x2.split())
for term in mh_terms:
term = term.replace(',','')
term = term.replace('.','')
xx = (term, c.get(term, 0))
mh_total_occur = sum(c.get(v, 0) for v in mh_terms)
print(mh_total_occur)
From my example, both abstracts are getting a count of 1, but I want a count of two.
The problem is that you will never match "mental health" as you are only counting occurrences of single words split by the " " character.
I don't know if using a counter is the right solution here. If you did need an highly scalable and indexable solution, then n-grams are probably the way to go, but for small to medium problems it should be pretty quick to use regex pattern matching.
import re
abstracts = [
'This is a mental health abstract about anxiety and bipolar disorder as well as other things.',
'While this abstract is not about ptsd or any trauma-related illnesses, it does have a mental health focus.'
]
mh_terms = [
'bipolar disorder', 'anxiety', 'substance abuse disorder',
'ptsd', 'schizophrenia', 'mental health'
]
def _regex_word(text):
""" wrap text with special regex expression for start/end of words """
return '\\b{}\\b'.format(text)
def _normalize(text):
""" Remove any non alpha/numeric/space character """
return re.sub('[^a-z0-9 ]', '', text.lower())
normed_terms = [_normalize(term) for term in mh_terms]
for raw_abstract in abstracts:
print('--------')
normed_abstract = _normalize(raw_abstract)
# Search for all occurrences of chosen terms
found = {}
for norm_term in normed_terms:
pattern = _regex_word(norm_term)
found[norm_term] = len(re.findall(pattern, normed_abstract))
print('found = {!r}'.format(found))
mh_total_occur = sum(found.values())
print('mh_total_occur = {!r}'.format(mh_total_occur))
I tried to add helpers functions and comments to make it clear what I was doing.
Using the \b regex control character is important in general use cases because it prevents possible search terms like "miss" from matching words like "dismiss".
I am using TextBlob to perform a sentiment analysis task. I have noticed that TextBlob is able to detect the negation in some cases while in other cases not.
Here are two simple examples
>>> from textblob.sentiments import PatternAnalyzer
>>> sentiment_analyzer = PatternAnalyzer()
# example 1
>>> sentiment_analyzer.analyze('This is good')
Sentiment(polarity=0.7, subjectivity=0.6000000000000001)
>>> sentiment_analyzer.analyze('This is not good')
Sentiment(polarity=-0.35, subjectivity=0.6000000000000001)
# example 2
>>> sentiment_analyzer.analyze('I am the best')
Sentiment(polarity=1.0, subjectivity=0.3)
>>> sentiment_analyzer.analyze('I am not the best')
Sentiment(polarity=1.0, subjectivity=0.3)
As you can see in the second example when using the adjective best the polarity is not changing. I suspect that has to do with the fact that the adjective best is a very strong indicator, but doesn't seem right because the negation should have reversed the polarity (in my understanding).
Can anyone explain a little bit what's going? Is textblob using some negation mechanism at all or is it just that the word not is adding negative sentiment to the sentence? In either case, why does the second example has exactly the same sentiment in both cases? Is there any suggestion about how to overcome such obstacles?
(edit: my old answer was more about general classifiers and not about PatternAnalyzer)
TextBlob uses in your code the "PatternAnalyzer". Its behaviour is briefly discribed in that document: http://www.clips.ua.ac.be/pages/pattern-en#parser
We can see that:
The pattern.en module bundles a lexicon of adjectives (e.g., good, bad, amazing, irritating, ...) that occur frequently in product reviews, annotated with scores for sentiment polarity (positive ↔ negative) and subjectivity (objective ↔ subjective).
The sentiment() function returns a (polarity, subjectivity)-tuple for the given sentence, based on the adjectives it contains,
Here's an example that shows the behaviour of the algorithm. The polarity directly depends on the adjective used.
sentiment_analyzer.analyze('player')
Sentiment(polarity=0.0, subjectivity=0.0)
sentiment_analyzer.analyze('bad player')
Sentiment(polarity=-0.6999998, subjectivity=0.66666)
sentiment_analyzer.analyze('worst player')
Sentiment(polarity=-1.0, subjectivity=1.0)
sentiment_analyzer.analyze('best player')
Sentiment(polarity=1.0, subjectivity=0.3)
Professionnal softwares generally use complex tools based on neural networks and classifiers combined with lexical analysis. But for me, TextBlob just tries to give a result based on a direct result from the grammar analysis (here, the polarity of the adjectives). It's the source of the problem.
It does not try to check if the general sentence is negative or not (with the "not" word). It tries to check if the adjective is negated or not (as it works only with adjective, not with the general structure). Here, best is used as a noun and is not a negated adjective. So, the polarity is positive.
sentiment_analyzer.analyze('not the best')
Sentiment(polarity=1.0, subjectivity=0.3)
Just remplace the order of the words to make negation over the adjective and not the whole sentence.
sentiment_analyzer.analyze('the not best')
Sentiment(polarity=-0.5, subjectivity=0.3)
Here, the adjective is negated. So, the polarity is negative.
It's my explaination of that "strange behaviour".
The real implementation is defined in file:
https://github.com/sloria/TextBlob/blob/dev/textblob/_text.py
The interresing portion is given by:
if w in self and pos in self[w]:
p, s, i = self[w][pos]
# Known word not preceded by a modifier ("good").
if m is None:
a.append(dict(w=[w], p=p, s=s, i=i, n=1, x=self.labeler.get(w)))
# Known word preceded by a modifier ("really good").
...
else:
# Unknown word may be a negation ("not good").
if negation and w in self.negations:
n = w
# Unknown word. Retain negation across small words ("not a good").
elif n and len(w.strip("'")) > 1:
n = None
# Unknown word may be a negation preceded by a modifier ("really not good").
if n is not None and m is not None and (pos in self.modifiers or self.modifier(m[0])):
a[-1]["w"].append(n)
a[-1]["n"] = -1
n = None
# Unknown word. Retain modifier across small words ("really is a good").
elif m and len(w) > 2:
m = None
# Exclamation marks boost previous word.
if w == "!" and len(a) > 0:
...
If we enter "not a good" or "not the good", it will match the else part because it's not a single adjective.
The "not a good" part will match elif n and len(w.strip("'")) > 1: so it will reverse polarity. not the good will not match any pattern, so, the polarity will be the same of "best".
The entire code is a succession of fine tweaking, grammar indictions (such as adding ! increases polarity, adding a smiley indicates irony, ...). It's why some particular patterns will give strange results. To handle each specific case, you must check if your sentence will match any of the if sentences in that part of the code.
I hope I'll help
I have a list
['mPXSz0qd6j0 youtube ', 'lBz5XJRLHQM youtube ', 'search OpHQOO-DwlQ ',
'sachin 47427243 ', 'alex smith ', 'birthday JEaM8Lg9oK4 ',
'nebula 8x41n9thAU8 ', 'chuck norris ',
'searcher O6tUtqPcHDw ', 'graham wXqsg59z7m0 ', 'queries K70QnTfGjoM ']
Is there some way to identify the strings which can't be spelt in the list item and remove them?
You can use, e.g. PyEnchant for basic dictionary checking and NLTK to take minor spelling issues into account, like this:
import enchant
import nltk
spell_dict = enchant.Dict('en_US') # or whatever language supported
def get_distance_limit(w):
'''
The word is considered good
if it's no further from a known word than this limit.
'''
return len(w)/5 + 2 # just for example, allowing around 1 typo per 5 chars.
def check_word(word):
if spell_dict.check(word):
return True # a known dictionary word
# try similar words
max_dist = get_distance_limit(word)
for suggestion in spell_dict.suggest(word):
if nltk.edit_distance(suggestion, word) < max_dist:
return True
return False
Add a case normalisation and a filter for digits and you'll get a pretty good heuristics.
It is entirely possible to compare your list members to words that you don't believe to be valid for your input.
This can be done in many ways, partially depending on your definition of "properly spelled" and what you end up using for a comparison list. If you decide that numbers preclude an entry from being valid, or underscores, or mixed case, you could test for regex matching.
Post regex, you would have to decide what a valid character to split on should be. Is it spaces (are you willing to break on 'ad hoc' ('ad' is an abbreviation, 'hoc' is not a word))? Is it hyphens (this will break on hyphenated last names)?
With these above criteria decided, it's just a decision of what word, proper name, and common slang list to use and a list comprehension:
word_list[:] = [term for term in word_list if passes_my_membership_criteria(term)]
where passes_my_membership_criteria() is a function that contains the rules for staying in the list of words, returning False for things that you've decided are not valid.
This question already has answers here:
Counting the number of unique words in a document with Python
(8 answers)
Closed 9 years ago.
I want to count unique words in a text, but I want to make sure that words followed by special characters aren't treated differently, and that the evaluation is case-insensitive.
Take this example
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
print len(set(w.lower() for w in text.split()))
The result would be 16, but I expect it to return 14. The problem is that 'boy.' and 'boy' are evaluated differently, because of the punctuation.
import re
print len(re.findall('\w+', text))
Using a regular expression makes this very simple. All you need to keep in mind is to make sure that all the characters are in lowercase, and finally combine the result using set to ensure that there are no duplicate items.
print len(set(re.findall('\w+', text.lower())))
you can use regex here:
In [65]: text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
In [66]: import re
In [68]: set(m.group(0).lower() for m in re.finditer(r"\w+",text))
Out[68]:
set(['grown',
'boy',
'he',
'now',
'longer',
'no',
'is',
'there',
'up',
'one',
'a',
'the',
'has',
'handsome'])
I think that you have the right idea of using the Python built-in set type.
I think that it can be done if you first remove the '.' by doing a replace:
text = "There is one handsome boy. The boy has now grown up. He is no longer a boy now."
punc_char= ",.?!'"
for letter in text:
if letter == '"' or letter in punc_char:
text= text.replace(letter, '')
text= set(text.split())
len(text)
that should work for you. And if you need any of the other signs or punctuation points you can easily
add them into punc_char and they will be filtered out.
Abraham J.
First, you need to get a list of words. You can use a regex as eandersson suggested:
import re
words = re.findall('\w+', text)
Now, you want to get the number of unique entries. There are a couple of ways to do this. One way would be iterate through the words list and use a dictionary to keep track of the number of times you have seen a word:
cwords = {}
for word in words:
try:
cwords[word] += 1
except KeyError:
cwords[word] = 1
Now, finally, you can get the number of unique words by
len(cwords)