How do extract text in complete sentences/paragraphs with Python? - python

I am currently trying to convert a PDF into text for the purposes of ML, but whenever I do so, it returns the text in broken lines, which is making the text less readable.
Here is what I am currently doing to convert the text:
import fitz, spacy
with fitz.open("asset/example2.pdf") as doc:
text_ = ""
for page in doc:
text_ += page.getText()
and here are the results:
Animals - Animals have
always been near my
heart and it has led me to
involve myself in animal
rights events and
protests. It still stays as
a dream of mine to go
volunteer at an animal
sanctuary one day.
Food Travel - Through a
diet change, I have
found my love for food
and exploring different
food cultures across the
world. I feel confident
saying that I could write
an extensive
encyclopaedia for great
vegan restaurants.
what would be the best way to approach this?

I don't quite understand what result you are looking for but you if you would like all the text to be on one line you can use text.replace('\n', ''). If you want. You may also find text.split(separator) and separator.join(list) useful for formating your string, for example:
string = 'This is my \nfirst sentance. This \nsecond sentance\n.'
print(string)
string = string.replace('\n', '')
sentanceList = string.split('.')
string = '.\n'.join(sentanceList)
print(string)
I hope this answers your question.

Related

Autodetect and translate two or more languages ins a sentence using python

I have the following example sentence
text_to_translate1=" I want to go for swimming however the 天气 (weather) is not good"
As you can see there exist two languages in the sentence (i.e, English and Chinese).
I want a translator. The result that I want is the following:
I want to go for swimming however the weather(weather) is not good
I used the deepl translator but cannot autodetect two languages in one.
Code that I follow:
import deepl
auth_key = "xxxxx"
translator = deepl.Translator(auth_key)
result = translator.translate_text(text_to_translate1, target_lang="EN-US")
print(result.text)
print(result.detected_source_lang)
The result is the following:
I want to go for swimming however the 天气 (weather) is not good
EN
Any ideas?
I do not have access to the DeepL Translator, but according to their API you can supply a source language as a parameter:
result = translator.translate_text(text_to_translate1, source_lang="ZH", target_lang="EN-US")
If this doesn't work, my next best idea (if your target language is only English) is to send only the words that aren't English, then translate them in place. You can do this by looping over all words, checking if they are English, and if not, then translating them. You can then merge all the words together.
I would do it like so:
text_to_translate1 = "I want to go for swimming however the 天气 (weather) is not good"
new_sentence = []
# Split the sentence into words
for word in text_to_translate1.split(" "):
# For each word, check if it is English
if not word.isascii():
new_sentence.append(translator.translate_text(word, target_lang="EN-US").text)
else:
new_sentence.append(word)
# Merge the sentence back together
translated_sentence = " ".join(new_sentence)
print(translated_sentence)

Search key phrases in text

I'm looking for a fast solution which allows me to find predefined phrases (1-5 words) in a (not big) text.
The phrases can be up to 1000. Suppose, the simple find() function is not a good solution.
Could you advise what should I use?
Thanks in advance.
Update
Why i don't want to use bruit force search:
I believe, it is not fast enough.
Text can have some inclusions in the phrases. I.e. phrase can be Bank America, but text has bank of America.
Phrases can be a little bit changed - apostrophes, -s ending etc.
I'm not sure about your goal but you can easily find predefined prephrasses in text like that:
predefined_phrases = ["hello", "unicorns with a big mouth!", "Sweet donats"]
isnt_big_text = "A big mouse fly by unicorns with a big mouth! with hello wold."
for phrase in predefined_phrases:
if phrase in isnt_big_text:
print("Phrase '%s' found in text" % phrase)

Using a regular expression to find all noun phrases in a paragraph following the occurrence of a specific phrase

I have a data frame of paragraphs, which I have (*can) split into word tokens and sentence tokens and am looking to find all the noun phrases following any instance where the phrase: "contribute to" or "donate to" occurs.
Or really some form of that, so:
"Contributions are welcome to be made to the charity of your choice."
---> would return: "the charity of your choice"
and
"blah blah blah donations, in honor of Firstname Lastname, can be made to ABC Foundation"
---> would return: "ABC Foundation"
I've created a regular expression work-around that captures the correct phrase about 90% of the time... see below:
text = nltk.Text(nltk.word_tokenize(x))
donation = TokenSearcher(text).findall(r"<\.> <.*>{,15}? <donat.*|contrib.*> <.*>*? <to> (<.*>+?) <\.|\,|\;> ")
donation = [' '.join(tokens) for tokens in donation]
return donation
I'd like to clean up that regular expression to get rid of the "{,15}" requirements because it's missing some of the values that I need. However, I'm not too polished with the "greedy" expressions and can't get it to work correctly.
so this phrase:
While she lived a full life , had many achievements and made many
**contributions** , FirstName is remembered by most for her cheerful smile ,
colorful track suits , and beautiful necklaces hand made by daughter FirstName .
FirstName always cherished her annual visit home for Thanksgiving to visit
brother FirstName LastName
is returning: "visit brother FirstName Lastname" due to the previous mentioning of contributions even though the word "to" comes well after 15 words later.
Looks like you're struggling with how to restrict your search criteria to a single sentence. So just use the NLTK to break your text into sentences (which it can do far better than just looking at periods), and your problem disappears.
sents = nltk.sent_tokenize(x) # `x` is a single string, as in your example
recipients = []
for sent in sents:
m = re.search(r"\b(contrib|donat).*?\bto\b([^.,;]*)", sent)
if m:
recipients.append(m.group(2).strip())
For further work, I also recommend that you use a better tool than Text, which is intended for simple interactive explorations. If you do want to do more with your texts, the nltk's PlaintextCorpusReader is your friend.
(?:contrib|donat|gifts)(?=[^\.]+\bto\b[^\.]+).*to\s([^\.]+)
Example
If works and does what you need then let me know and I will explain my regexp.

Remove word extension in python

I've got a text with several words. I want to remove all the derivative extension of the words. For example I want to remove extensions -ed -ing and keep the initial verb. If I i have the verifying or verified to keep verify f.e. I found the method strip in python which removes a specific string from the beginning or end of a string but is not what exactly I want. Is there any library which does such a thing in python for example?
I've tried to perform the code from proposed post and I've noticed a weird trimming in several words. For example I've got the following text
We goin all the way βπƒβ΅οΈβ΅οΈ
Think ive caught on to a really good song ! Im writing π
Lookin back on the stuff i did when i was lil makes me laughh π‚
I sneezed on the beat and the beat got sicka
#nashnewvideo http://t.co/10cbUQswHR
Homee βοΈβοΈβοΈπ΄
So much respect for this man , truly amazing guy βοΈ #edsheeran
http://t.co/DGxvXpo1OM"
What a day ..
RT #edsheeran: Having some food with #ShawnMendes
#VoiceSave christina π
Im gunna make the βοΈ sign my signature pose
You all are so beautiful .. π soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention"""
And after the use of the following code (also I remove non latin characters and urls)
we goin all the way
think ive caught on to a realli good song im write
lookin back on the stuff i did when i wa lil make me laughh
i sneez on the beat and the beat got sicka
nashnewvideo
home
so much respect for thi man truli amaz guy
what a day
rt have some food with
voicesav christina
im gunna make the sign my signatur pose
you all are so beauti soooo beauti
thought that wa a realli awesom quot
beauti thing dont ask for attent
For example it trims beautiful to beauti and quote to quot really to realli. My code is the following:
reader = csv.reader(f)
print doc
for row in reader:
text = re.sub(r"(?:\#|https?\://)\S+", "", row[2])
filter(lambda x: x in string.printable, text)
out = text.translate(string.maketrans("",""), string.punctuation)
out = re.sub("[\W\d]", " ", out.strip())
word_list = out.split()
str1 = ""
for verb in word_list:
verb = verb.lower()
verb = nltk.stem.porter.PorterStemmer().stem_word(verb)
str1 = str1+" "+verb+" "
list.append(str1)
str1 = "\n"
Instead stemmer you can use lemmatizer. Here's an example with python NLTK:
from nltk.stem import WordNetLemmatizer
s = """
You all are so beautiful soooo beautiful
Thought that was a really awesome quote
Beautiful things don't ask for attention
"""
wnl = WordNetLemmatizer()
print " ".join([wnl.lemmatize(i) for i in s.split()]) #You all are so beautiful soooo beautiful Thought that wa a really awesome quote Beautiful thing don't ask for attention
In some cases, it may not do what you expect:
print wnl.lemmatize('going') #going
Then you can combine both approaches: stemming and lemmatization.
your question is a little bit general, but if you have a static text that is already defined, the best way is to write your own stemmer. because the Porter and Lancaster stemmers follow their own rules for stripping affixes, and the WordNet lemmatizer only removes affixes if the resulting word is in its dictionary.
You can write something like:
import re
def stem(word):
for suffix in ['ing', 'ly', 'ed', 'ious', 'ies', 'ive', 'es', 's', 'ment']:
if word.endswith(suffix):
return word[:-len(suffix)]
return word
def stemmer(phrase):
for word in phrase:
if stem(word):
print re.findall(r'^(.*)(ing|ly|ed|ious|ies|ive|es|s|ment)$', word)
so for "processing processes" you will have:
>> stemmer('processing processes')
[('process', 'ing'),('process', 'es')]

Python package to find pre-defined keywords/tags in a file / url / string

Are there any python packages that can take a list of keywords / tags and match them up to a given string / file / url ?
Specifically using stemming and/or some other synonym way of matching.
i.e. my pre saved keywords:
Ski,
Bike,
Climb
my text:
Skiing in the mountains is great
Should get tagged with Ski
Skiing and mountain biking is fun
Should get tagged with Ski And Bike
And if I've got a synonyms file somewhere mapping Bike to MTB
MTB is a great way to spend the day
Should get tagged Bike
See Thesaurus (you can also try different modules, such as synonym module).
Also you can test sentences for containing specific strings using in:
>>> 'Ski' in 'Skiing in the mountains is great'
True
>>> 'Bike' in 'Skiing in the mountains is great'
False
I don't know any package to do that but actually this is very simple with plain python. using re (regex) standard package. something like
import re
key_words =['ski','bike','climb']
input = "Skiing and mountain biking is fun"
input_words = input.split()#split on space
[word.lower() for word in input_words]
input_tags =[]
for word in input_words:
for key in key_words:
if re.search(key,word):
input_tags.append(key)

Categories