NLTK punkt sentence tokenizer splitting on numeric bullets - python

I am using nltk PunktSentenceTokenizer for splitting paragraphs into sentences. I have paragraphs as follows:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
Output:
['1.', 'Candidate is very poor in mathematics.', '2.', 'Interpersonal skills are good.', '3.', 'Very enthusiastic about social work']
I tried to add sent starters using below code but that didnt even work out.
from nltk.tokenize.punkt import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()
tokenizer._params.sent_starters.add('1.')
I really appreciate if anybody could drive me towards correct direction
Thanks in advance :)

The use of regular expressions can provide a solution to this type of problem, as illustrated by the code below:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
import re
reSentenceEnd = re.compile("\.|$")
reAtLeastTwoLetters = re.compile("[a-zA-Z]{2}")
previousMatch = 0
sentenceStart = 0
end = len(paragraphs)
while(True):
candidateSentenceEnd = reSentenceEnd.search(paragraphs, previousMatch)
# A sentence must contain at least two consecutive letters:
if reAtLeastTwoLetters.search(paragraphs[sentenceStart:candidateSentenceEnd.end()]) :
print(paragraphs[sentenceStart:candidateSentenceEnd.end()])
sentenceStart = candidateSentenceEnd.end()
if candidateSentenceEnd.end() == end:
break
previousMatch=candidateSentenceEnd.start() + 1
the output is:
Candidate is very poor in mathematics.
Interpersonal skills are good.
Very enthusiastic about social work
Many tokenizers including (nltk and Spacy) can handle regular expressions. Adapting this code to their framework might not be trivial though.

Related

How to pick up adjectives or nouns out of a text?

For example like:
sentence = 'An old lady lives in a small red house. She has three cute cats, and their names match with their colors: White, Cinnamon, and Chocolate. They are poor but happy.'
So I hope to get 2 lists like these:
adj = ['old','small','red','cute','White','poor','happy']
noun = ['lady','house','cats','names','colors','Cinnamon','Chocolate']
I saw someone mentioned NLTK, but I haven't used the package so I would hope for some instructions.
What you need is called Part-Of-Speech(POS) Tagging, you could check:
NLTK: https://www.nltk.org/book/ch05.html, https://www.geeksforgeeks.org/part-speech-tagging-stop-words-using-nltk-python/
Spacy: https://spacy.io/usage/linguistic-features#pos-tagging
If not enough, there are a lot of additional instructions for newbies out there if you google 'POS Tagging + python'.
btw, I would recommend spacy as it is more modern.

Autodetect and translate two or more languages ins a sentence using python

I have the following example sentence
text_to_translate1=" I want to go for swimming however the 天气 (weather) is not good"
As you can see there exist two languages in the sentence (i.e, English and Chinese).
I want a translator. The result that I want is the following:
I want to go for swimming however the weather(weather) is not good
I used the deepl translator but cannot autodetect two languages in one.
Code that I follow:
import deepl
auth_key = "xxxxx"
translator = deepl.Translator(auth_key)
result = translator.translate_text(text_to_translate1, target_lang="EN-US")
print(result.text)
print(result.detected_source_lang)
The result is the following:
I want to go for swimming however the 天气 (weather) is not good
EN
Any ideas?
I do not have access to the DeepL Translator, but according to their API you can supply a source language as a parameter:
result = translator.translate_text(text_to_translate1, source_lang="ZH", target_lang="EN-US")
If this doesn't work, my next best idea (if your target language is only English) is to send only the words that aren't English, then translate them in place. You can do this by looping over all words, checking if they are English, and if not, then translating them. You can then merge all the words together.
I would do it like so:
text_to_translate1 = "I want to go for swimming however the 天气 (weather) is not good"
new_sentence = []
# Split the sentence into words
for word in text_to_translate1.split(" "):
# For each word, check if it is English
if not word.isascii():
new_sentence.append(translator.translate_text(word, target_lang="EN-US").text)
else:
new_sentence.append(word)
# Merge the sentence back together
translated_sentence = " ".join(new_sentence)
print(translated_sentence)

Issues in lemmatization (nltk)

I am using nltk lemmatizer as follows.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
mystring = "the sand rock needed to be mixed and shaked well before using it for construction works"
splits=mystring.split()
mystring = " ".join(lemmatizer.lemmatize(w) for w in splits)
print(mystring)
I am expecting the output to be
sand rock need to be mix and shake well before use it for construction work
However, in the output I get (mentioned below) it seems like words such as needed, mixed, shaked, using have not changed to its base form.
sand rock needed to be mixed and shaked well before using it for construction work
Is there a way to resolve this problem?
You can replace the second last line with this.
mystring = " ".join(lemmatizer.lemmatize(w,pos ='v') for w in splits)
pos is part of speech tag.

Remove word extension in python

I've got a text with several words. I want to remove all the derivative extension of the words. For example I want to remove extensions -ed -ing and keep the initial verb. If I i have the verifying or verified to keep verify f.e. I found the method strip in python which removes a specific string from the beginning or end of a string but is not what exactly I want. Is there any library which does such a thing in python for example?
I've tried to perform the code from proposed post and I've noticed a weird trimming in several words. For example I've got the following text
We goin all the way βπƒβ΅οΈβ΅οΈ
Think ive caught on to a really good song ! Im writing π
Lookin back on the stuff i did when i was lil makes me laughh π‚
I sneezed on the beat and the beat got sicka
#nashnewvideo http://t.co/10cbUQswHR
Homee βοΈβοΈβοΈπ΄
So much respect for this man , truly amazing guy βοΈ #edsheeran
http://t.co/DGxvXpo1OM"
What a day ..
RT #edsheeran: Having some food with #ShawnMendes
#VoiceSave christina π
Im gunna make the βοΈ sign my signature pose
You all are so beautiful .. π soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention"""
And after the use of the following code (also I remove non latin characters and urls)
we goin all the way
think ive caught on to a realli good song im write
lookin back on the stuff i did when i wa lil make me laughh
i sneez on the beat and the beat got sicka
nashnewvideo
home
so much respect for thi man truli amaz guy
what a day
rt have some food with
voicesav christina
im gunna make the sign my signatur pose
you all are so beauti soooo beauti
thought that wa a realli awesom quot
beauti thing dont ask for attent
For example it trims beautiful to beauti and quote to quot really to realli. My code is the following:
reader = csv.reader(f)
print doc
for row in reader:
text = re.sub(r"(?:\#|https?\://)\S+", "", row[2])
filter(lambda x: x in string.printable, text)
out = text.translate(string.maketrans("",""), string.punctuation)
out = re.sub("[\W\d]", " ", out.strip())
word_list = out.split()
str1 = ""
for verb in word_list:
verb = verb.lower()
verb = nltk.stem.porter.PorterStemmer().stem_word(verb)
str1 = str1+" "+verb+" "
list.append(str1)
str1 = "\n"
Instead stemmer you can use lemmatizer. Here's an example with python NLTK:
from nltk.stem import WordNetLemmatizer
s = """
You all are so beautiful soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention
"""
wnl = WordNetLemmatizer()
print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention
In some cases, it may not do what you expect:
print wnl.lemmatize('going') #going
Then you can combine both approaches: stemming and lemmatization.
your question is a little bit general, but if you have a static text that is already defined, the best way is to write your own stemmer. because the Porter and Lancaster stemmers follow their own rules for stripping affixes, and the WordNet lemmatizer only removes affixes if the resulting word is in its dictionary.
You can write something like:
import re
def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
def stemmer(phrase):
for word in phrase:
if stem(word):
print re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', word)
so for "processing processes" you will have:
>> stemmer('processing processes')
[('process', 'ing'),('process', 'es')]

Select sentence having a selected word in it

Suppose i have a paragraph:
text = '''Darwin published his theory of evolution with compelling evidence in his 1859 book On the Origin of Species, overcoming scientific rejection of earlier concepts of transmutation of species.[4][5] By the 1870s the scientific community and much of the general public had accepted evolution as a fact. However, many favoured competing explanations and it was not until the emergence of the modern evolutionary synthesis from the 1930s to the 1950s that a broad consensus developed in which natural selection was the basic mechanism of evolution.[6][7] In modified form, Darwin's scientific discovery is the unifying theory of the life sciences, explaining the diversity of life.[8][9]'''
If say i enter a word (favoured), then how can i remove the entire sentence the word is in.
The method i used earlier was tedious; i would use sent_tokenize to break the para (which is over 13000 words) and since i had to check for more than 1000 words, i would run a loop to check for each word in each sentence. This takes a lot of time as there are over 400 sentences.
Instead i want to check for those 1000 words in the para, and when the word is found it selects all words before till full stop and all words after, till full stop.
This removes all sentences (things bounded by a .) that contain the word somewhere.
def remove_sentence(input, word):
return ".".join((sentence for sentence in input.split(".")
if word not in sentence))
>>>> remove_sentence(text, "published")
"[4][5] By the 1870s the scientific community and much of the general public had accepted evolution as a fact. However, many favoured competing explanations and it was not until the emergence of the modern evolutionary synthesis from the 1930s to the 1950s that a broad consensus developed in which natural selection was the basic mechanism of evolution.[6][7] In modified form, Darwin's scientific discovery is the unifying theory of the life sciences, explaining the diversity of life.[8][9]"
>>>
>>> remove_sentence(text, "favoured")
"Darwin published his theory of evolution with compelling evidence in his 1859 book On the Origin of Species, overcoming scientific rejection of earlier concepts of transmutation of species.[4][5] By the 1870s the scientific community and much of the general public had accepted evolution as a fact.[6][7] In modified form, Darwin's scientific discovery is the unifying theory of the life sciences, explaining the diversity of life.[8][9]"
I'm not sure to understand you question, but you can do something like:
text = 'whatever....'
sentences = text.split('.')
good_sentences = [e for e in sentences if 'my_word' not in e]
Is that what you are looking for?
You might be interested in trying something similar to the following program:
import re
SENTENCES = ('This is a sentence.',
'Hello, world!',
'Where do you want to go today?',
'The apple does not fall far from the tree.',
'Sally sells sea shells by the sea shore.',
'The Jungle Book has several stories in it.',
'Have you ever been up to the moon?',
'Thank you for helping with my problem!')
BAD_WORDS = frozenset(map(str.lower, ('to', 'sea')))
def main():
for index, sentence in enumerate(SENTENCES):
if frozenset(words(sentence.lower())) & BAD_WORDS:
print('Delete:', repr(sentence))
words = lambda sentence: (m.group() for m in re.finditer('\w+', sentence))
if __name__ == '__main__':
main()
Reason
You start out with the sentences that you want to filter and the words you want to find.
You compare each sentence's set of words with the set of words you are looking for.
If there was an intersection, the sentence you are looking at is one you will remove.
Output
Delete: 'Where do you want to go today?'
Delete: 'Sally sells sea shells by the sea shore.'
Delete: 'Have you ever been up to the moon?'

Categories