Remove word extension in python - python

I've got a text with several words. I want to remove all the derivative extension of the words. For example I want to remove extensions -ed -ing and keep the initial verb. If I i have the verifying or verified to keep verify f.e. I found the method strip in python which removes a specific string from the beginning or end of a string but is not what exactly I want. Is there any library which does such a thing in python for example?
I've tried to perform the code from proposed post and I've noticed a weird trimming in several words. For example I've got the following text
We goin all the way βπƒβ΅οΈβ΅οΈ
Think ive caught on to a really good song ! Im writing π
Lookin back on the stuff i did when i was lil makes me laughh π‚
I sneezed on the beat and the beat got sicka
#nashnewvideo http://t.co/10cbUQswHR
Homee βοΈβοΈβοΈπ΄
So much respect for this man , truly amazing guy βοΈ #edsheeran
http://t.co/DGxvXpo1OM"
What a day ..
RT #edsheeran: Having some food with #ShawnMendes
#VoiceSave christina π
Im gunna make the βοΈ sign my signature pose
You all are so beautiful .. π soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention"""
And after the use of the following code (also I remove non latin characters and urls)
we goin all the way
think ive caught on to a realli good song im write
lookin back on the stuff i did when i wa lil make me laughh
i sneez on the beat and the beat got sicka
nashnewvideo
home
so much respect for thi man truli amaz guy
what a day
rt have some food with
voicesav christina
im gunna make the sign my signatur pose
you all are so beauti soooo beauti
thought that wa a realli awesom quot
beauti thing dont ask for attent
For example it trims beautiful to beauti and quote to quot really to realli. My code is the following:
reader = csv.reader(f)
print doc
for row in reader:
text = re.sub(r"(?:\#|https?\://)\S+", "", row[2])
filter(lambda x: x in string.printable, text)
out = text.translate(string.maketrans("",""), string.punctuation)
out = re.sub("[\W\d]", " ", out.strip())
word_list = out.split()
str1 = ""
for verb in word_list:
verb = verb.lower()
verb = nltk.stem.porter.PorterStemmer().stem_word(verb)
str1 = str1+" "+verb+" "
list.append(str1)
str1 = "\n"

Instead stemmer you can use lemmatizer. Here's an example with python NLTK:
from nltk.stem import WordNetLemmatizer
s = """
You all are so beautiful soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention
"""
wnl = WordNetLemmatizer()
print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention
In some cases, it may not do what you expect:
print wnl.lemmatize('going') #going
Then you can combine both approaches: stemming and lemmatization.

your question is a little bit general, but if you have a static text that is already defined, the best way is to write your own stemmer. because the Porter and Lancaster stemmers follow their own rules for stripping affixes, and the WordNet lemmatizer only removes affixes if the resulting word is in its dictionary.
You can write something like:
import re
def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
def stemmer(phrase):
for word in phrase:
if stem(word):
print re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', word)
so for "processing processes" you will have:
>> stemmer('processing processes')
[('process', 'ing'),('process', 'es')]

Related

Autodetect and translate two or more languages ins a sentence using python

I have the following example sentence
text_to_translate1=" I want to go for swimming however the 天气 (weather) is not good"
As you can see there exist two languages in the sentence (i.e, English and Chinese).
I want a translator. The result that I want is the following:
I want to go for swimming however the weather(weather) is not good
I used the deepl translator but cannot autodetect two languages in one.
Code that I follow:
import deepl
auth_key = "xxxxx"
translator = deepl.Translator(auth_key)
result = translator.translate_text(text_to_translate1, target_lang="EN-US")
print(result.text)
print(result.detected_source_lang)
The result is the following:
I want to go for swimming however the 天气 (weather) is not good
EN
Any ideas?
I do not have access to the DeepL Translator, but according to their API you can supply a source language as a parameter:
result = translator.translate_text(text_to_translate1, source_lang="ZH", target_lang="EN-US")
If this doesn't work, my next best idea (if your target language is only English) is to send only the words that aren't English, then translate them in place. You can do this by looping over all words, checking if they are English, and if not, then translating them. You can then merge all the words together.
I would do it like so:
text_to_translate1 = "I want to go for swimming however the 天气 (weather) is not good"
new_sentence = []
# Split the sentence into words
for word in text_to_translate1.split(" "):
# For each word, check if it is English
if not word.isascii():
new_sentence.append(translator.translate_text(word, target_lang="EN-US").text)
else:
new_sentence.append(word)
# Merge the sentence back together
translated_sentence = " ".join(new_sentence)
print(translated_sentence)

How do extract text in complete sentences/paragraphs with Python?

I am currently trying to convert a PDF into text for the purposes of ML, but whenever I do so, it returns the text in broken lines, which is making the text less readable.
Here is what I am currently doing to convert the text:
import fitz, spacy
with fitz.open("asset/example2.pdf") as doc:
text_ = ""
for page in doc:
text_ += page.getText()
and here are the results:
Animals - Animals have
always been near my
heart and it has led me to
involve myself in animal
rights events and
protests. It still stays as
a dream of mine to go
volunteer at an animal
sanctuary one day.
Food Travel - Through a
diet change, I have
found my love for food
and exploring different
food cultures across the
world. I feel confident
saying that I could write
an extensive
encyclopaedia for great
vegan restaurants.
what would be the best way to approach this?
I don't quite understand what result you are looking for but you if you would like all the text to be on one line you can use text.replace('\n', ''). If you want. You may also find text.split(separator) and separator.join(list) useful for formating your string, for example:
string = 'This is my \nfirst sentance. This \nsecond sentance\n.'
print(string)
string = string.replace('\n', '')
sentanceList = string.split('.')
string = '.\n'.join(sentanceList)
print(string)
I hope this answers your question.

Issues in lemmatization (nltk)

I am using nltk lemmatizer as follows.
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
mystring = "the sand rock needed to be mixed and shaked well before using it for construction works"
splits=mystring.split()
mystring = " ".join(lemmatizer.lemmatize(w) for w in splits)
print(mystring)
I am expecting the output to be
sand rock need to be mix and shake well before use it for construction work
However, in the output I get (mentioned below) it seems like words such as needed, mixed, shaked, using have not changed to its base form.
sand rock needed to be mixed and shaked well before using it for construction work
Is there a way to resolve this problem?
You can replace the second last line with this.
mystring = " ".join(lemmatizer.lemmatize(w,pos ='v') for w in splits)
pos is part of speech tag.

How to capitalize every beginning of a sentence in a text in python? [duplicate]

This question already has answers here:
How to capitalize the first letter of every sentence?
(15 answers)
Closed 2 years ago.
I want to create a function that takes as an input a string which is a text, and I want to capitalize every letter that lies after a punctuation. The thing is, strings don't work like lists so I don't really know how to do it, I tried to do this, but it doesn't seem to be working :
def capitalize(strin):
listrin=list(strin)
listrin[0]=listrin[0].upper()
ponctuation=['.','!','?']
strout=''
for x in range (len(listrin)):
if listrin[x] in ponctuation:
if x!=len(listrin):
if listrin[x+1]!=" ":
listrin[x+1]=listrin[x+1].upper()
elif listrin[x+2]!=" ":
listrin[x+1]=listrin[x+1].upper()
for y in range(len(listrin)):
strout=strout+listrin[y]
return strout
For now, I am trying to solve it with this string: 'hello! how are you? please remember capitalization. EVERY time.'
I use regexp to do this.
>>> import re
>>> line = 'hi. hello! how are you? fine! me too, haha. haha.'
>>> re.sub(r"(?:^|(?:[.!?]\s+))(.)",lambda m: m.group(0).upper(), line)
'Hi. Hello! How are you? Fine! Me too, haha. Haha.'
The most basic approach is to split the sentences based on the punctuation, then you will have a list. Then loop into all the items of list, strip() them and then capitalize() them. Something like below might solve your problem:
import re
input_sen = 'hello! how are you? please remember capitalization. EVERY time.'
sentence = re.split(pass_your_punctuation_list_here, input_sen)
for i in sentence:
print(i.strip().capitalize(), end='')
However better to use nltk library:
from nltk.tokenize import sent_tokenize
input_sen = 'hello! how are you? please remember capitalization. EVERY time.'
sentences = sent_tokenize(input_sen)
sentences = [sent.capitalize() for sent in sentences]
print(sentences)
It is better to use NLTK library or some other NLP library than manually writing rules and regex because it takes care of many cases which we don't account.
It solves the problem of Sentence boundary disambiguation.
Sentence boundary disambiguation (SBD), also known as sentence
breaking, is the problem in natural language processing of deciding
where sentences begin and end. Often natural language processing tools
require their input to be divided into sentences for a number of
reasons. However sentence boundary identification is challenging
because punctuation marks are often ambiguous. For example, a period
may denote an abbreviation, decimal point, an ellipsis, or an email
address – not the end of a sentence. About 47% of the periods in the
Wall Street Journal corpus denote abbreviations. As well, question
marks and exclamation marks may appear in embedded quotations,
emoticons, computer code, and slang. Languages like Japanese and
Chinese have unambiguous sentence-ending markers.
Hope it helps.

Cannot define rule priority in grako grammar for handling special tokens

I am trying to analyze some documents by a grammar generated via Grako that should parse simple sentences for further analysis but face some difficulties with some special tokens.
The (Grako-style) EBNF looks like:
abbr::str = "etc." | "feat.";
word::str = /[^.]+/;
sentence::Sentence = content:{abbr | word} ".";
page::Page = content:{sentence};
I used the upper grammar on following content:
This is a sentence. This is a sentence feat. an abbrevation. I don't
now feat. etc. feat. know English.
The result using a simple NodeWalker:
[
'This is a sentence.',
'This is a sentence feat.',
'an abbrevation.',
"I don't know feat.",
'etc. feat. know English.'
]
My expectation:
[
'This is a sentence.',
'This is a sentence feat. an abbrevation.',
"I don't know feat. etc. feat. know English."
]
I have no clue why this happens, especially in the last sentence where the abbreviations are part of the sentence while they are not in the prior sentences. To be clear, I want the abbr rule in the sentence definition to have a higher priority than the word rule, but I don't know how to achieve this. I played around with the negative and positive lookahead without success. I know how to achieve my expected results with regular expressions, but a context-free grammar is required for the further analysis, so I want to put everything in one grammar for the sake of readability. It has been a while since I last used grammars this way, but I don't remember running in that kind of problem. I searched a while via Google with no success, so maybe the community might share some insight.
Thanks in advance.
Code I used for testing, if required:
from grako.model import NodeWalker, ModelBuilderSemantics
from parser import MyParser
class MyWalker(NodeWalker):
def walk_Page(self, node):
content = [self.walk(c) for c in node.content]
print(content)
def walk_Sentence(self, node):
return ' '.join(node.content) + "."
def walk_str(self, node):
return node
def main(filename: str):
parser = MyParser(semantics=ModelBuilderSemantics())
with open(filename, 'r', encoding='utf-8') as src:
result = parser.parse(src.read(), 'page')
walker = HRBWalker()
walker.walk(result)
Packages used:
Python 3.5.2
Grako 3.16.5
The problem is with the regular expression you're using for the word rule. Regular expressions will parse over whatever you tell them to, and that regexp is eating over whitespace.
This modified grammar does what you want:
##grammar:: Pages
abbr::str = "etc." | "feat.";
word::str = /[^.\s]+/;
sentence::Sentence = content:{abbr | word} ".";
page::Page = content:{sentence};
start = page ;
A --trace run revealed the problem right away.

Categories