How do I find predefined words in a wall of text? - python

I am thinking of an idea that I want to do for my ownself project during this quarantine phase . I am learning python by myself so I thought maybe I could see what I can do.
Question: I want to decipher large, but irregular count text and I want to find words in them, think of it like finding words. I may know the words that I find.
For example, I want to find
fruits = ["banana", "apples", "oranges"]
in
Text = "sdasfdsfdscbananassafdfdsafscdfbfnhujuyjhtrgrfeaddaDWEAFSERGRapplesfsgrgvscfaefcwecfrvtbhytofsdasrangesdaeubfuenijnzcjbvnkMDLOwkdpoaDPOSKPKFEOFJsfjuf"
How could I do that?
Also its my 1st time posting here so I am not really confident about this community.
sorry & thank you

Loop through all the words you want to find and check if they're in your text
for fruit in fruits:
if fruit in Text:
print(f"Found {fruit}")
Or, using a list comprehension:
found = [fruit for fruit in fruits if fruit in Text]
print(found)

fruits = ["banana", "apples", "oranges"]
text = "sdasfdsfdscbananassafdfdsafscdfbfnhujuyjhtrgrfeaddaDWEAFSERGRapplesfsgrgvscfaefcwecfrvtbhytofsdasrangesdaeubfuenijnzcjbvnkMDLOwkdpoaDPOSKPKFEOFJsfjuf"
for fruit in fruits:
if fruit in text:
print(fruit, "True")
else:
print(fruit, "False")

Related

How to extract compound words from a multiline string in Python

I have a text string from which I want to extract specific words (fruits) that may appear in it; the respective words are stored in a set. (Actually, the set is very large, but I tried to simplify the code). I achieved extracting single fruit words with this simple code:
# The text string.
text = """
My friend loves healthy food: Yesterday, he enjoyed
an apple, a pine apple and a banana. But what he
likes most, is a blueberry-orange cake.
"""
# Remove noisy punctuation and lowercase the text string.
prep_text = text.replace(",", "")
prep_text = prep_text.replace(".", "")
prep_text = prep_text.replace(":", "")
prep_text = prep_text.lower()
# The word set.
fruits = {"apple", "banana", "orange", "blueberry",
"pine apple"}
# Extracting single fruits.
extracted_fruits = []
for word in prep_text.split():
if word in fruits:
extracted_fruits.append(word)
print(extracted_fruits)
# Out: ['apple', 'apple', 'banana']
# Missing: 'pine apple', 'blueberry-orange'
# False: the second 'apple'
But if the text string contains a fruit compound separated by a space (here: "pine apple"), it is not extracted (or rather, just "apple" is extracted from it, even though I don't want this occurrence of "apple" because it's part of the compound). I know this is because I used split() on prep_text. Neither extracted is the hyphenated combination "blueberry-orange", which I want to get as well. Other hyphenated words that don't include fruits should not be extracted, though.
If I could use a variable fruit for each item in the fruits set, I would solve it with f-strings like:
fruit = # How can I get a single fruit element and similarly all fruit elements from 'fruits'?
hyphenated_fruit = f"{fruit}-{fruit}"
for word in prep_text.split():
if word == hyphenated_fruit:
extracted_fruits.append(word)
I can't use the actual strings "blueberry" and "orange" as variables though, because other fruits could also appear hyphenated in a different text string. Moreover, I don't want to just add "blueberry-orange" to the set - I'm searching for a way without changing the set.
Is there a way to add "pine apple" and "blueberry-orange" as well to the extracted_fruits list?
I appreciate any help and tips. Thanks a lot in advance!
The quickest (and dirtiest?) approach might be to use this regex:
(banana|orange|(pine\s*)?apple|blueberry)(-\s*(banana|orange|(pine\s*)?apple|blueberry))?
It will match pineapple, pine apple, pine apple, and pine apple with newlines or tabs between pine and apple.
Demo
One disadvantage is, of course, that it matches things like orange-orange and there's repeated text in there. You can construct the regex programmatically to fix that, though.
It's just a start but may be good enough for your use case. You can grow it to add more capabilities for a bit, I think.

How to find words before/after certain keywords?

I am using the following code to count the number of phrases in a doc file:
phrases = ['yellow bananas']
clean_text = " ".join(re.findall(r'\w+(?:-\w+)*', doc))
for phrase in phrases:
if phrase in clean_text:
if phrase not in list_of_phrases:
list_of_phrases[phrase] = clean_text.count(phrase)
else:
list_of_phrases[phrase] += clean_text.count(phrase)
The question is, is it possible instead of getting the whole sentence, to get one, two, three etc words before/after the keywords I am searching for?
EDIT:
Sample doc:
Yellow bananas are nice. I like fruits. Nobody knows how many fruits there are out there. There are yellow bananas and many other fruits. Bananas, apples, oranges, mangos.
Ouput would be a count of phrases that contain the keyword e.g. 'yellow bananas' in this case with 1,2,3 etc. words before and after the keywords.

Drop all strings that are a subset of another string in the same list

I'm working on a scraping project and for some reason on some paragraphs I get both the complete paragraph and also the same paragraph divided in segments. So, if the paragraph is "My house is green and. I like it.", I sometimes get:
["My house is green. I like it.", "My house is green.", "I like it."]
So, when I turn everything into text I will get that paragraph duplicated. Is there any way I can check which strings are a subset of other strings in a list?
My desired output in this case would be to be left only with ["My house is green. I like it."]
An efficient approach is to iterate through the list sorted by the lengths of phrases in reverse order, and add each possible sub-phrase to a set, so that you can use the set to efficiently check if the current phrase is a sub-phrase of a previous, longer phrase:
output = []
seen = set()
for phrase in sorted(l, key=len, reverse=True):
words = tuple(phrase.split())
if words not in seen:
output.append(phrase)
seen.update({words[i: i + n + 1] for n in range(len(words)) for i in range(len(words) - n)})
so that given:
l = ["My house is green. I like it.", "My house is green.", "I like it."]
output becomes:
['My house is green. I like it.']
I would take the longest string out of the list like this:
arr = ["My house is green. I like it.", "My house is green.", "I like it."]
print(max(arr, key=len))
The longest string can't be a substring of the others by definition

Python: how to check if a string contains an element from a list and bring out that element?

I have a list of sentences (exg) and a list of fruit names (fruit_list). I have a code to check whether sentences contain elements from the fruit_list, as following:
exg = ["I love apple.", "there are lots of health benefits of apple.",
"apple is especially hight in Vitamin C,", "alos provide Vitamin A as a powerful antioxidant!"]
fruit_list = ["pear", "banana", "mongo", "blueberry", "kiwi", "apple", "orange"]
for j in range(0, len(exg)):
sentence = exg[j]
if any(word in sentence for word in fruit_list):
print(sentence)
Output is below: only sentences contain words with "apple"
I love apple.
there are lots of health benefits of apple.
apple is especially hight in Vitamin C,
But I'd love to print out which word was an element of the fruit_list and was found in sentences. In this example, I'd love to have an output of the word "apple", instead of sentences contains the word apple.
Hope this makes sense. Please send help and thank you so so much!
Try using in to check for word in fruit_list, then you can use fruit as a variable later.
In order to isolate which word was found you'll need to use a different method than any(). any() only it only cares if it can find a word in fruit_list. it does not care which word or where in the list it was found.
exg = ["I love apple.", "there are lots of health benefits of apple.",
"apple is especially hight in Vitamin C,", "alos provide Vitamin A as a powerful antioxidant!"]
fruit_list = ["pear", "banana", "mongo", "blueberry", "kiwi", "apple", "orange"]
# You can remove the 0 from range, because it starts at 0 by default
# You can also loop through sentence directly
for sentence in exg:
for word in fruit_list:
if(word in sentence):
print("Found word:", word, " in:", sentence)
Result:
Found word: apple in: I love apple.
Found word: apple in: there are lots of health benefits of apple.
Found word: apple in: apple is especially hight in Vitamin C,
Instead of any with a generator expression, you can use a for clause with break:
for j in range(0, len(exg)):
sentence = exg[j]
for word in fruit_list:
if word in sentence:
print(f'{word}: {sentence}')
break
Result:
apple: I love apple.
apple: there are lots of health benefits of apple.
apple: apple is especially hight in Vitamin C,
More idiomatic is to iterate the list rather than a range for indexing:
for sentence in exg:
for word in fruit_list:
if word in sentence:
print(f'{word}: {sentence}')
break
This will do the job.
for j in range(0, len(fruit_list)):
fruit = fruit_list[j]
if any(fruit in sentence for sentence in exg):
print(fruit)

Python: Searchable dictionary within a dictionary?

I'm new to Python, not used to handling this kind of stuff.
The code I've written:
search = str.lower(raw_input("What are you looking for?" + " " ))
knowledge = {"apple": 123, "test" : "cats"}
def return_the_input(search):
if search in knowledge:
print knowledge.get(search)
else:
print "No."
return_the_input(search)
So what I'd like for it to do would be to ask you what you're looking for (apple), then apple would display something similar to a Ls command in unix. So it would look like this:
"What are you looking for?" --input apple. Apple would then print out other values, like
Butter
Sauce
Snacks
And would then ask "What next?" --input butter
And all of the information I have on apple butter would display.
So do I set this up in the code as
knowledge = {"apple":{"butter": "info on butter here", "sauce": "info on sauce here"}, "cats":{et cetera}}
And then act upon that somehow to get the formatting I want? I assume maybe with some kind of For loop, or just print statements?
Your idea would work. But then what if an item belonged under two categories? A better solution would be to store all of the pieces of knowledge and then have a data structure mapping search words to knowledge.
knowledge = [
[("apple", "butter"), "made from cooked down apple sauce"],
[("peanut", "butter"), "made from crushed peanuts"],
]
def make_terms_from_knowledge(info):
search_terms = {}
for n, item in enumerate(info):
for i in item[0]:
search_terms.setdefault(i, []).append(n)
return search_terms
terms = make_terms_from_knowledge(knowledge)
print terms["butter"]
for entry in terms["butter"]:
print knowledge[entry][1]
You will need to regenerate terms whenever knowledge changes length. This is very simple but should get you thinking down other paths plus show some Python tips.

Categories