Issues in lemmatization (nltk) - python

I am using nltk lemmatizer as follows.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
mystring = "the sand rock needed to be mixed and shaked well before using it for construction works"
splits=mystring.split()
mystring = " ".join(lemmatizer.lemmatize(w) for w in splits)
print(mystring)
I am expecting the output to be
sand rock need to be mix and shake well before use it for construction work
However, in the output I get (mentioned below) it seems like words such as needed, mixed, shaked, using have not changed to its base form.
sand rock needed to be mixed and shaked well before using it for construction work
Is there a way to resolve this problem?

You can replace the second last line with this.
mystring = " ".join(lemmatizer.lemmatize(w,pos ='v') for w in splits)
pos is part of speech tag.

Related

Autodetect and translate two or more languages ins a sentence using python

I have the following example sentence
text_to_translate1=" I want to go for swimming however the 天气 (weather) is not good"
As you can see there exist two languages in the sentence (i.e, English and Chinese).
I want a translator. The result that I want is the following:
I want to go for swimming however the weather(weather) is not good
I used the deepl translator but cannot autodetect two languages in one.
Code that I follow:
import deepl
auth_key = "xxxxx"
translator = deepl.Translator(auth_key)
result = translator.translate_text(text_to_translate1, target_lang="EN-US")
print(result.text)
print(result.detected_source_lang)
The result is the following:
I want to go for swimming however the 天气 (weather) is not good
EN
Any ideas?
I do not have access to the DeepL Translator, but according to their API you can supply a source language as a parameter:
result = translator.translate_text(text_to_translate1, source_lang="ZH", target_lang="EN-US")
If this doesn't work, my next best idea (if your target language is only English) is to send only the words that aren't English, then translate them in place. You can do this by looping over all words, checking if they are English, and if not, then translating them. You can then merge all the words together.
I would do it like so:
text_to_translate1 = "I want to go for swimming however the 天气 (weather) is not good"
new_sentence = []
# Split the sentence into words
for word in text_to_translate1.split(" "):
# For each word, check if it is English
if not word.isascii():
new_sentence.append(translator.translate_text(word, target_lang="EN-US").text)
else:
new_sentence.append(word)
# Merge the sentence back together
translated_sentence = " ".join(new_sentence)
print(translated_sentence)

How to lemmatise nouns?

I am trying to lemmatise words like "Escalation" to "Escalate" using NLTK.stem Wordlemmatizer.
word_lem = WordNetLemmatizer()
print( word_lem.lemmatize("escalation", pos = "n")
Which pos tag should be used to get result like "escalate"
First, please notice that:
Stemming usually refers to a crude heuristic process that chops off the ends of words in the hope of achieving this goal correctly most of the time, and often includes the removal of derivational affixes. Lemmatization usually refers to doing things properly with the use of a vocabulary and morphological analysis of words, normally aiming to remove inflectional endings only and to return the base or dictionary form of a word, which is known as the lemma .
Now, if you desire to obtain a canonical form for both "escalation" and "escalate", you can use a summarizer, e.g., Porter stemmer.
from nltk.stem import PorterStemmer
ps = PorterStemmer()
print(ps.stem("escalate"))
print(ps.stem("escalation"))
Although the result is escal, but it is the same for both words.

NLTK punkt sentence tokenizer splitting on numeric bullets

I am using nltk PunktSentenceTokenizer for splitting paragraphs into sentences. I have paragraphs as follows:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
Output:
['1.', 'Candidate is very poor in mathematics.', '2.', 'Interpersonal skills are good.', '3.', 'Very enthusiastic about social work']
I tried to add sent starters using below code but that didnt even work out.
from nltk.tokenize.punkt import PunktSentenceTokenizer
tokenizer = PunktSentenceTokenizer()
tokenizer._params.sent_starters.add('1.')
I really appreciate if anybody could drive me towards correct direction
Thanks in advance :)
The use of regular expressions can provide a solution to this type of problem, as illustrated by the code below:
paragraphs = "1. Candidate is very poor in mathematics. 2. Interpersonal skills are good. 3. Very enthusiastic about social work"
import re
reSentenceEnd = re.compile("\.|$")
reAtLeastTwoLetters = re.compile("[a-zA-Z]{2}")
previousMatch = 0
sentenceStart = 0
end = len(paragraphs)
while(True):
candidateSentenceEnd = reSentenceEnd.search(paragraphs, previousMatch)
# A sentence must contain at least two consecutive letters:
if reAtLeastTwoLetters.search(paragraphs[sentenceStart:candidateSentenceEnd.end()]) :
print(paragraphs[sentenceStart:candidateSentenceEnd.end()])
sentenceStart = candidateSentenceEnd.end()
if candidateSentenceEnd.end() == end:
break
previousMatch=candidateSentenceEnd.start() + 1
the output is:
Candidate is very poor in mathematics.
Interpersonal skills are good.
Very enthusiastic about social work
Many tokenizers including (nltk and Spacy) can handle regular expressions. Adapting this code to their framework might not be trivial though.

How to identify the subject of a sentence?

Can Python + NLTK be used to identify the subject of a sentence? From what I have learned till now is that a sentence can be broken into a head and its dependents. For e.g. "I shot an elephant". In this sentence, I and elephant are dependents to shot. But How do I discern that the subject in this sentence is I.
You can use Spacy.
Code
import spacy
nlp = spacy.load('en')
sent = "I shot an elephant"
doc=nlp(sent)
sub_toks = [tok for tok in doc if (tok.dep_ == "nsubj") ]
print(sub_toks)
As NLTK book (exercise 29) says, "One common way of defining the subject of a sentence S in English is as the noun phrase that is the child of S and the sibling of VP."
Look at tree example: indeed, "I" is the noun phrase that is the child of S that is the sibling of VP, while "elephant" is not.
English language has two voices: Active voice and passive voice. Lets take most used voice: Active voice.
It follows subject-verb-object model. To mark the subject, write a rule set with POS tags. Tag the sentence I[NOUN] shot[VERB] an elephant[NOUN]. If you see the first noun is subject, then there is a verb and then there is an object.
If you want to make it more complicated, a sentence- I shot an elephant with a gun. Here the prepositions or subordinate conjunctions like with, at, in can be given roles. Here the sentence will be tagged as I[NOUN] shot[VERB] an elephant[NOUN] with[IN] a gun[NOUN]. You can easily say that word with gets instrumentative role. You can build a rule based system to get role of every word in the sentence.
Also look at the patterns in passive voice and write rules for the same.
rake_nltk (pip install rake_nltk) is a python library that wraps nltk and apparently uses the RAKE algorithm.
from rake_nltk import Rake
rake = Rake()
kw = rake.extract_keywords_from_text("Can Python + NLTK be used to identify the subject of a sentence?")
ranked_phrases = rake.get_ranked_phrases()
print(ranked_phrases)
# outputs the keywords ordered by rank
>>> ['used', 'subject', 'sentence', 'python', 'nltk', 'identify']
By default the stopword list from nltk is used. You can provide your custom stopword list and punctuation chars by passing them in the constructor:
rake = Rake(stopwords='mystopwords.txt', punctuations=''',;:!##$%^*/\''')
By default string.punctuation is used for punctuation.
The constructor also accepts a language keyword which can be any language supported by nltk.
Stanford Corenlp Tool can also be used to extract Subject-Relation-Object information of a sentence.
Attaching screenshot of same:
code using spacy :
here the doc is the sentence and dep='nsubj' for subject and 'dobj' for object
import spacy
nlp = spacy.load('en_core_web_lg')
def get_subject_object_phrase(doc, dep):
doc = nlp(doc)
for token in doc:
if dep in token.dep_:
subtree = list(token.subtree)
start = subtree[0].i
end = subtree[-1].i + 1
return str(doc[start:end])
You can paper over the issue by doing something like doc = nlp(text.decode('utf8')), but this will likely bring you more bugs in future.
Credits: https://github.com/explosion/spaCy/issues/380

Remove word extension in python

I've got a text with several words. I want to remove all the derivative extension of the words. For example I want to remove extensions -ed -ing and keep the initial verb. If I i have the verifying or verified to keep verify f.e. I found the method strip in python which removes a specific string from the beginning or end of a string but is not what exactly I want. Is there any library which does such a thing in python for example?
I've tried to perform the code from proposed post and I've noticed a weird trimming in several words. For example I've got the following text
We goin all the way βπƒβ΅οΈβ΅οΈ
Think ive caught on to a really good song ! Im writing π
Lookin back on the stuff i did when i was lil makes me laughh π‚
I sneezed on the beat and the beat got sicka
#nashnewvideo http://t.co/10cbUQswHR
Homee βοΈβοΈβοΈπ΄
So much respect for this man , truly amazing guy βοΈ #edsheeran
http://t.co/DGxvXpo1OM"
What a day ..
RT #edsheeran: Having some food with #ShawnMendes
#VoiceSave christina π
Im gunna make the βοΈ sign my signature pose
You all are so beautiful .. π soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention"""
And after the use of the following code (also I remove non latin characters and urls)
we goin all the way
think ive caught on to a realli good song im write
lookin back on the stuff i did when i wa lil make me laughh
i sneez on the beat and the beat got sicka
nashnewvideo
home
so much respect for thi man truli amaz guy
what a day
rt have some food with
voicesav christina
im gunna make the sign my signatur pose
you all are so beauti soooo beauti
thought that wa a realli awesom quot
beauti thing dont ask for attent
For example it trims beautiful to beauti and quote to quot really to realli. My code is the following:
reader = csv.reader(f)
print doc
for row in reader:
text = re.sub(r"(?:\#|https?\://)\S+", "", row[2])
filter(lambda x: x in string.printable, text)
out = text.translate(string.maketrans("",""), string.punctuation)
out = re.sub("[\W\d]", " ", out.strip())
word_list = out.split()
str1 = ""
for verb in word_list:
verb = verb.lower()
verb = nltk.stem.porter.PorterStemmer().stem_word(verb)
str1 = str1+" "+verb+" "
list.append(str1)
str1 = "\n"
Instead stemmer you can use lemmatizer. Here's an example with python NLTK:
from nltk.stem import WordNetLemmatizer
s = """
You all are so beautiful soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention
"""
wnl = WordNetLemmatizer()
print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention
In some cases, it may not do what you expect:
print wnl.lemmatize('going') #going
Then you can combine both approaches: stemming and lemmatization.
your question is a little bit general, but if you have a static text that is already defined, the best way is to write your own stemmer. because the Porter and Lancaster stemmers follow their own rules for stripping affixes, and the WordNet lemmatizer only removes affixes if the resulting word is in its dictionary.
You can write something like:
import re
def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
def stemmer(phrase):
for word in phrase:
if stem(word):
print re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', word)
so for "processing processes" you will have:
>> stemmer('processing processes')
[('process', 'ing'),('process', 'es')]

Categories