How to remove adjectives or attributive before noun? - python

Currently I am using nltk to remove all the adjectives, this is my attempt:
def remove_adj(sentence):
adjective_tags = ["JJ", "JJR", "JJS"]
tokens = nltk.word_tokenize(sentence)
tags = nltk.pos_tag(tokens)
for i in range(len(tags)):
word = [word for word,pos in tags if (pos not in adjective_tags)]
return ' '.join(word)
But what I need is different from this one. Here are some examples:
input: "who has the highest revenue" output: "who has the revenue"
input: "who earned more than average income" output: "who earned more than income"
input: "what is the mean of profit" output: "what is the profit"
Can anyone give me some suggestions? Thanks all in advance.

I think I understand what you are trying to achieve, but what problem are you having? I've run your code and it appears to work perfectly at removing adjectives.
A couple things are throwing me off though. For the below input/output, you can expect the word 'more' to be removed, as it is an adjective with token 'JJR'. Your post suggests that you were not expecting it to be removed.
input: "who earned more than average income" output: "who earned more than income"
Also, I'm not sure why you were expecting the word 'mean' to be removed in the below input/output, as it isn't an adjective.
input: "what is the mean of profit" output: "what is the profit"
A great place to check you sentences is Parts of Speech
Below would be your actual outputs, removing the adjectives correctly, and it seems to be doing just that.
input: "who has the highest revenue" output: "who has the revenue"
input: "who earned more than average income" output: "who earned than income"
input: "what is the mean of profit" output: "what is the mean of profit"
If you are simply trying to remove any descriptive elements pertaining to the noun, I would have to ask more about your problem. Your examples all ended with a noun, and this appears to be the noun you are focusing on. Will this be the case with all sentences that this code would handle? If so, you might consider iterating through your sentence backwards. You can easily identify the noun. As you step through, you would then look to see if the noun has a determiner (a, an, the) with tag 'DT', as you wouldn't want to remove that from what I see. You continue to step through removing everything until you reach an adjective or another noun. I don't know what your actual rules are for removing words on this one, but working backwards may help.
EDIT:
I tinkered with this a bit and got the below code to work exactly as you wanted on the outputs. You can populate tags in the 'stop_tags' variable if there are other speech tags you want it to stop on.
def remove_adj(sentence):
stop_tags = ["JJ", "JJR", "JJS", "NN"]
tokens = nltk.word_tokenize(sentence)
tags = list(reversed(nltk.pos_tag(tokens)))
noun_located = False
stop_reached = False
final_sent = ''
for word,pos in tags:
if noun_located == False and pos == 'NN':
noun_located = True
final_sent+=f' {word}'
elif stop_reached == False and pos in stop_tags:
stop_reached = True
elif stop_reached == True:
final_sent+=f' {word}'
final_sent = ' '.join(reversed(final_sent.split(' ')))
return final_sent
x = remove_adj('what is the mean of profit')
print(x)
`

Related

Finding the singular or plural form of a word with regex

Let's assume I have the sentence:
sentence = "A cow runs on the grass"
If I want to replace the word cow with "some" special token, I can do:
to_replace = "cow"
# A <SPECIAL> runs on the grass
sentence = re.sub(rf"(?!\B\w)({re.escape(to_replace)})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
Additionally, if I want to replace it's plural form, I could do:
sentence = "The cows run on the grass"
to_replace = "cow"
# Influenza is one of the respiratory <SPECIAL>
sentence = re.sub(rf"(?!\B\w)({re.escape(to_replace) + 's?'})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
which does the replacement even if the word to replace remains in its singular form cow, while the s? does the job to perform the replacement.
My question is what happens if I want to apply the same in a more general way, i.e., find-and-replace words which can be singular, plural - ending with s, and also plural - ending with es (note that I'm intentionally ignoring many edge cases that could appear - discussed in the comments of the question). Another way to frame the question would be how can add multiple optional ending suffixes to a word, so that it works for the following examples:
to_replace = "cow"
sentence1 = "The cow runs on the grass"
sentence2 = "The cows run on the grass"
# --------------
to_replace = "gas"
sentence3 = "There are many natural gases"
I suggest using regular python logic, remember to avoid stretching regexes too much if you don't need to:
phrase = "There are many cows in the field cowes"
for word in phrase.split():
if word == "cow" or word == "cow" + "s" or word == "cow" + "es":
phrase = phrase.replace(word, "replacement")
print(phrase)
Output:
There are many replacement in the field replacement
Apparently, for the use-case I posted, I can make the suffix optional. So it could go as:
re.sub(rf"(?!\B\w)({re.escape(e_obj) + '(s|es)?'})(?<!\w\B)", "<SPECIAL>", sentence, count=1)
Note that this would not work for many edge cases discussed in the comments!

Sentiment analysis with own lexicon

Hi i am supposed to make a sentiment analysis of the three sentences below. I am wondering how i will start this task because I am really stuck
I am supposed to write a function to determine the sentiment of a comment by using a word list as an
argument. Given a sentiment lexicon as follows:
• positive words: "good", "awesome", "excellent", "great"
• negative words: "bad", "broke", "terrible", "poor".
Then I need calculate the positive and negative words in the list in order to find out if the comment is positive or negative and print "positive comment" or "negative comment"
def splitandremovepunc(s):
t = s.maketrans("", "", string.punctuation)
return s.translate(t).split()
lst = ("Good for the price, but poor Bluetooth connections.")
lst2 = ("Excellent product. Awesome quality and good customer service.")
lst3 = ("The quality is terrible. I would not buy this product again.")
print(splitandremovepunc(lst))
print(splitandremovepunc(lst2))
print(splitandremovepunc(lst3))
One of the most basic ways to do this is simple counting the number of negative and positive words in each string like so:
lst_split = ("Good for the price, but poor Bluetooth connections.").lower().split()
positive_word_list = ['good', 'great']
negative_word_list = ['bad', 'poor']
positive_score = 0
negative_scope = 0
for word in positive_word_list:
positive_score += lst_split.count(word)
for word in negative_word_list:
negative_score += lst_split.count(word)
May I propose VADER (Valence Aware Dictionary and sEntiment Reasoner)? It’s a sentiment lexicon and you can use it like this:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
sentiment_analyzer = SentimentIntensityAnalyzer()
result = sentiment_analyzer.polarity_scores('Good for the price, but poor Bluetooth connections.')
result is a dictionary that includes information concerning the sentence's sentiment polarity. The first three values (neg, neu, pos) are the portion of the text that is negative, neutral and positive. The last value stands for compound score, which represents the overall negativity or positivity score, normalised, ranging from -1 to +1.

How to count the number of words in the list, provided there is more than one?

For example I have a text,
text = '''
Wales’ greatest moment. Lille is so close to the Belgian border, this was
essentially a home game for one of the tournament favourites. Their confident supporters mingled
with their new Welsh fans on the streets, buying into the carnival spirit - perhaps
more relaxed than some might have been before a quarter-final because they
thought this was their time.
In the driving rain, Wales produced the best performance in their history to carry
the nation into uncharted territory. Nobody could quite believe it.
'''
I need to get the number of words in this text, we enter the words with input().
Type will be a list, dict, set this required condition.
It is also not clear how to remove the attention to punctuation marks.
My solution, but perhaps there is a cleaner way.
text = list(text.split(' '))
word = input('Enter a word: ')
for i in text:
if text.count(word) < 2:
break
if word in text:
print(f'{word} - {text.count(word)}')
break
Output:
this - 2
the - 7
The 'moment' occurs only once in the text, we do not deduce it
You can think of this as two steps:
Clean the input
Find the count
A fast way to clean the input is to strip all the punctuation first using translate combined with string.punctuation:
import string
clean = text.translate(str.maketrans('', '', string.punctuation)).split()
Now you have all the text with no punctuation and can split it into words and count:
import string
clean = text.translate(str.maketrans('', '', string.punctuation)).split()
word = "this"
count = clean.count(word)
if count > 1:
print(f'{word} - {count}')
# prints: this - 2
Since you are using count you don't need to loop. Just be careful not to call count multiple times if you don't need to. Each time you do, it needs to look through the whole list. Notice above the code calls it once and saves it so we can use the count in multiple places.
You can use collections.Counter() to get a dictionary of the number of occurrences of each element in a list:
import collections
text = '''
Wales’ greatest moment. Lille is so close to the Belgian border, this was
essentially a home game for one of the tournament favourites. Their confident supporters mingled
with their new Welsh fans on the streets, buying into the carnival spirit - perhaps
more relaxed than some might have been before a quarter-final because they
thought this was their time.
In the driving rain, Wales produced the best performance in their history to carry
the nation into uncharted territory. Nobody could quite believe it.
'''
word = input('Enter a word: ')
# Remove punctuation from text
for char in text:
if char.lower() not in "abcdefghijklmnopqrstuvwxyz ":
text = text.replace(char, "")
wordcount = collections.Counter(text.split())
print(f"{word} - {wordcount[word]}")

Lowercase all text except elements in a list

I have a text like this: s = "I am Enrolled in a course, MPhil since 2014. I LOVE this SO MuCH."
And a list of words list = ["MPhil", "MuCH"]
I am looking for a regex code that is able to lowercase all the text except the elements of the list.
I found this regex solution that is able to lowercase all except the words between '':
s = re.sub(r"\b(?<!')(\w+)(?!')\b", lambda match: match.group(1).lower(), s)
But I don't know how to turn it into my case.
I tried to split the text and check if it's from the list or not but I didn't found it really practical.
If someone could give me a hint or suggest me something I'll be thankful
Just see whether the word you've matched is in the set of words to keep as-is:
import re
words_to_keep = {"MPhil", "MuCH"}
def replace_if_not_in_keeplist(match):
word = match.group()
if word in words_to_keep:
return word
return word.lower()
s = "I am Enrolled in a course, MPhil since 2014. I LOVE this SO MuCH."
s2 = re.sub(r"\w+", replace_if_not_in_keeplist, s)
print(s)
print(s2)
outputs
I am Enrolled in a course, MPhil since 2014. I LOVE this SO MuCH.
i am enrolled in a course, MPhil since 2014. i love this so MuCH.

How to slice a string input at a certain unknown index

A string is given as an input (e.g. "What is your name?"). The input always contains a question which I want to extract. But the problem that I am trying to solve is that the input is always with unneeded input.
So the input could be (but not limited to) the following:
1- "eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn" 2- "What is your\nlastname and email?\ndasf?lkjas" 3- "askjdmk.\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"
(Notice that at the third input, the question starts with the word "Given" and end with "yourself?")
The above input examples are generated by the pytesseract OCR library of scanning an image and converting it into text
I only want to extract the question from the garbage input and nothing else.
I tried to use find('?', 1) function of the re library to get index of last part of the question (assuming for now that the first question mark is always the end of the question and not part of the input that I don't want). But I can't figure out how to get the index of the first letter of the question. I tried to loop in reverse and get the first spotted \n in the input, but the question doesn't always have \n before the first letter of the question.
def extractQuestion(q):
index_end_q = q.find('?', 1)
index_first_letter_of_q = 0 # TODO
question = '\n ' . join(q[index_first_letter_of_q :index_end_q ])
A way to find the question's first word index would be to search for the first word that has an actual meaning (you're interested in English words I suppose). A way to do that would be using pyenchant:
#!/usr/bin/env python
import enchant
GLOSSARY = enchant.Dict("en_US")
def isWord(word):
return True if GLOSSARY.check(word) else False
sentences = [
"eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
for sentence in sentences:
for i,w in enumerate(sentence.split()):
if isWord(w):
print('index: {} => {}'.format(i, w))
break
The above piece of code gives as a result:
index: 3 => What
index: 0 => What
index: 0 => Given
You could try a regular expression like \b[A-Z][a-z][^?]+\?, meaning:
The start of a word \b with an upper case letter [A-Z] followed by a lower case letter [a-z],
then a sequence of non-questionmark-characters [^?]+,
followed by a literal question mark \?.
This can still have some false positives or misses, e.g. if a question actually starts with an acronym, or if there is a name in the middle of the question, but for you examples it works quite well.
>>> tests = ["eo000 ATATAT EG\n\nWhat is your name?\nkgda dasflkjasn",
"What is your\nlastname and email?\ndasf?lkjas",
"\nGiven your skills\nhow would you rate yourself?\nand your name? dasf?"]
>>> import re
>>> p = r"\b[A-Z][a-z][^?]+\?"
>>> [re.search(p, t).group() for t in tests]
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
If that's one blob of text, you can use findall instead of search:
>>> text = "\n".join(tests)
>>> re.findall(p, text)
['What is your name?',
'What is your\nlastname and email?',
'Given your skills\nhow would you rate yourself?']
Actually, this also seems to work reasonably well for questions with names in them:
>>> t = "asdGARBAGEasd\nHow did you like St. Petersburg? more stuff with ?"
>>> re.search(p, t).group()
'How did you like St. Petersburg?'

Categories