How to extract compound words from a multiline string in Python

How to extract compound words from a multiline string in Python - python

I have a text string from which I want to extract specific words (fruits) that may appear in it; the respective words are stored in a set. (Actually, the set is very large, but I tried to simplify the code). I achieved extracting single fruit words with this simple code:
# The text string.
text = """
My friend loves healthy food: Yesterday, he enjoyed
an apple, a pine apple and a banana. But what he
likes most, is a blueberry-orange cake.
"""
# Remove noisy punctuation and lowercase the text string.
prep_text = text.replace(",", "")
prep_text = prep_text.replace(".", "")
prep_text = prep_text.replace(":", "")
prep_text = prep_text.lower()
# The word set.
fruits = {"apple", "banana", "orange", "blueberry",
"pine apple"}
# Extracting single fruits.
extracted_fruits = []
for word in prep_text.split():
if word in fruits:
extracted_fruits.append(word)
print(extracted_fruits)
# Out: ['apple', 'apple', 'banana']
# Missing: 'pine apple', 'blueberry-orange'
# False: the second 'apple'
But if the text string contains a fruit compound separated by a space (here: "pine apple"), it is not extracted (or rather, just "apple" is extracted from it, even though I don't want this occurrence of "apple" because it's part of the compound). I know this is because I used split() on prep_text. Neither extracted is the hyphenated combination "blueberry-orange", which I want to get as well. Other hyphenated words that don't include fruits should not be extracted, though.
If I could use a variable fruit for each item in the fruits set, I would solve it with f-strings like:
fruit = # How can I get a single fruit element and similarly all fruit elements from 'fruits'?
hyphenated_fruit = f"{fruit}-{fruit}"
for word in prep_text.split():
if word == hyphenated_fruit:
extracted_fruits.append(word)
I can't use the actual strings "blueberry" and "orange" as variables though, because other fruits could also appear hyphenated in a different text string. Moreover, I don't want to just add "blueberry-orange" to the set - I'm searching for a way without changing the set.
Is there a way to add "pine apple" and "blueberry-orange" as well to the extracted_fruits list?
I appreciate any help and tips. Thanks a lot in advance!

The quickest (and dirtiest?) approach might be to use this regex:
(banana|orange|(pine\s*)?apple|blueberry)(-\s*(banana|orange|(pine\s*)?apple|blueberry))?
It will match pineapple, pine apple, pine apple, and pine apple with newlines or tabs between pine and apple.
Demo
One disadvantage is, of course, that it matches things like orange-orange and there's repeated text in there. You can construct the regex programmatically to fix that, though.
It's just a start but may be good enough for your use case. You can grow it to add more capabilities for a bit, I think.

Related

How to extract every string between two substring in a paragraph?

After web-scrapping, I get the following:
[<p>xxx<p>, <p>1.apple</p>, <p>aaa</p>, <p>xxxxx</p>, <p>xxxxx</p>, <p>2.orange</p>, <p>aaa</p>, <p>xxxxx</p>,<p>3.banana</p>, <p>aaa</p>, <p>xxxxx</p>]
From the list, "xxxx" are those useless values. I can see the pattern that the result I want is between two substrings. Substring1 = "<p>1" / "<p>2" / "<p>3" ; Substring2 = "</p>, <p>aaa".
Assume this pattern repeats hundreds of times. How do I get the result by python? Many thanks !!
My target result is :
apple
orange
banana
I have tried to use split and tried [sub1:sub2] but it doesn't work

From what I INFER from your question (assuming the words you're looking for follow a beacon of format <p>number. ), a regex would do the job:
import re
print(re.findall(r'<p>\d+.([^<]+)', html_string)
# ['apple', 'orange', 'banana']

How do I find predefined words in a wall of text?

I am thinking of an idea that I want to do for my ownself project during this quarantine phase . I am learning python by myself so I thought maybe I could see what I can do.
Question: I want to decipher large, but irregular count text and I want to find words in them, think of it like finding words. I may know the words that I find.
For example, I want to find
fruits = ["banana", "apples", "oranges"]
in
Text = "sdasfdsfdscbananassafdfdsafscdfbfnhujuyjhtrgrfeaddaDWEAFSERGRapplesfsgrgvscfaefcwecfrvtbhytofsdasrangesdaeubfuenijnzcjbvnkMDLOwkdpoaDPOSKPKFEOFJsfjuf"
How could I do that?
Also its my 1st time posting here so I am not really confident about this community.
sorry & thank you

Loop through all the words you want to find and check if they're in your text
for fruit in fruits:
if fruit in Text:
print(f"Found {fruit}")
Or, using a list comprehension:
found = [fruit for fruit in fruits if fruit in Text]
print(found)

fruits = ["banana", "apples", "oranges"]
text = "sdasfdsfdscbananassafdfdsafscdfbfnhujuyjhtrgrfeaddaDWEAFSERGRapplesfsgrgvscfaefcwecfrvtbhytofsdasrangesdaeubfuenijnzcjbvnkMDLOwkdpoaDPOSKPKFEOFJsfjuf"
for fruit in fruits:
if fruit in text:
print(fruit, "True")
else:
print(fruit, "False")

Check how many words from a given list occur in list of text/strings

I have a list of text data which contains reviews, something likes this:
1. 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.'
2. 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
3. 'This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
I have a seperate list of words which I want to know exists in the these reviews:
['food','science','good','buy','feedback'....]
I want to know which of these words are present in the review and select reviews which contains certain number of these words. For example, lets say only select reviews which contains atleast 3 of the words from this list, so it displays all those reviews, but also show which of those were encountered in the review while selecting it.
I have the code for selecting reviews containing at least 3 of the words, but how do I get the second part which tells me which words exactly were encountered. Here is my initial code:
keywords = list(words)
text = list(df.summary.values)
sentences=[]
for element in text:
if len(set(keywords)&set(element.split(' '))) >=3:
sentences.append(element)

To answer the second part, allow me to revisit how to approach the first part. A handy approach here is to cast your review strings into sets of word strings.
Like this:
review_1 = "I have bought several of the Vitality canned dog food products and"
review_1 = set(review_1.split(" "))
Now the review_1 set contains one of every word. Then take your list of words, convert it to a set, and do an intersection.
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = review_1.intersection(words)
The resulting set, matches, contains all the words that are common. The length of this is the number of matches.
Now, this does not work if you cared about how many of each word matches. For example, if the word "food" is found twice in the review and "science" is found once, does that count as matching three words?
If so, let me know via comment and I can write some code to update the answer to include that scenario.
EDIT: Updating to include comment question
If you want to keep a count of how many times each word repeats, then hang onto the review list. Only cast it to set when performing the intersection. Then, use the 'count' list method to count the number of times each match appears in the review. In the example below, I use a dictionary to store the results.
review_1 = "I have bought several of the Vitality canned dog food products and"
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = set(review_1).intersection(words)
match_counts = dict()
for match in matches:
match_counts[match] = words.count(match)

You can use set intersection for finding the common words:
def filter_reviews(data, *, trigger_words = frozenset({'food', 'science', 'good', 'buy', 'feedback'})):
for review in data:
words = review.split() # use whatever method is appropriate to get the words
common = trigger_words.intersection(words)
if len(common) >= 3:
yield review, common

How to find words before/after certain keywords?

I am using the following code to count the number of phrases in a doc file:
phrases = ['yellow bananas']
clean_text = " ".join(re.findall(r'\w+(?:-\w+)*', doc))
for phrase in phrases:
if phrase in clean_text:
if phrase not in list_of_phrases:
list_of_phrases[phrase] = clean_text.count(phrase)
else:
list_of_phrases[phrase] += clean_text.count(phrase)
The question is, is it possible instead of getting the whole sentence, to get one, two, three etc words before/after the keywords I am searching for?
EDIT:
Sample doc:
Yellow bananas are nice. I like fruits. Nobody knows how many fruits there are out there. There are yellow bananas and many other fruits. Bananas, apples, oranges, mangos.
Ouput would be a count of phrases that contain the keyword e.g. 'yellow bananas' in this case with 1,2,3 etc. words before and after the keywords.

How to use backreferences as index to substitute via list?

I have a list
fruits = ['apple', 'banana', 'cherry']
I like to replace all these elements by their index in the list. I know, that I can go through the list and use replace of a string like
text = "I like to eat apple, but banana are fine too."
for i, fruit in enumerate(fruits):
text = text.replace(fruit, str(i))
How about using regular expression? With \number we can backreference to a match. But
import re
text = "I like to eat apple, but banana are fine too."
text = re.sub('apple|banana|cherry', fruits.index('\1'), text)
doesn't work. I get an error that \x01 is not in fruits. But \1 should refer to 'apple'.
I am interested in the most efficient way to do the replacement, but I also like to understand regex better. How can I get the match string from the backreference in regex.
Thanks a lot.

Using Regex.
Ex:
import re
text = "I like to eat apple, but banana are fine too."
fruits = ['apple', 'banana', 'cherry']
pattern = re.compile("|".join(fruits))
text = pattern.sub(lambda x: str(fruits.index(x.group())), text)
print(text)
Output:
I like to eat 0, but 1 are fine too.

We Keep Coding

Python is a programming language that lets you work quickly and integrate systems more effectively.