How to find words before/after certain keywords? - python

I am using the following code to count the number of phrases in a doc file:
phrases = ['yellow bananas']
clean_text = " ".join(re.findall(r'\w+(?:-\w+)*', doc))
for phrase in phrases:
if phrase in clean_text:
if phrase not in list_of_phrases:
list_of_phrases[phrase] = clean_text.count(phrase)
else:
list_of_phrases[phrase] += clean_text.count(phrase)
The question is, is it possible instead of getting the whole sentence, to get one, two, three etc words before/after the keywords I am searching for?
EDIT:
Sample doc:
Yellow bananas are nice. I like fruits. Nobody knows how many fruits there are out there. There are yellow bananas and many other fruits. Bananas, apples, oranges, mangos.
Ouput would be a count of phrases that contain the keyword e.g. 'yellow bananas' in this case with 1,2,3 etc. words before and after the keywords.

Related

How to extract compound words from a multiline string in Python

I have a text string from which I want to extract specific words (fruits) that may appear in it; the respective words are stored in a set. (Actually, the set is very large, but I tried to simplify the code). I achieved extracting single fruit words with this simple code:
# The text string.
text = """
My friend loves healthy food: Yesterday, he enjoyed
an apple, a pine apple and a banana. But what he
likes most, is a blueberry-orange cake.
"""
# Remove noisy punctuation and lowercase the text string.
prep_text = text.replace(",", "")
prep_text = prep_text.replace(".", "")
prep_text = prep_text.replace(":", "")
prep_text = prep_text.lower()
# The word set.
fruits = {"apple", "banana", "orange", "blueberry",
"pine apple"}
# Extracting single fruits.
extracted_fruits = []
for word in prep_text.split():
if word in fruits:
extracted_fruits.append(word)
print(extracted_fruits)
# Out: ['apple', 'apple', 'banana']
# Missing: 'pine apple', 'blueberry-orange'
# False: the second 'apple'
But if the text string contains a fruit compound separated by a space (here: "pine apple"), it is not extracted (or rather, just "apple" is extracted from it, even though I don't want this occurrence of "apple" because it's part of the compound). I know this is because I used split() on prep_text. Neither extracted is the hyphenated combination "blueberry-orange", which I want to get as well. Other hyphenated words that don't include fruits should not be extracted, though.
If I could use a variable fruit for each item in the fruits set, I would solve it with f-strings like:
fruit = # How can I get a single fruit element and similarly all fruit elements from 'fruits'?
hyphenated_fruit = f"{fruit}-{fruit}"
for word in prep_text.split():
if word == hyphenated_fruit:
extracted_fruits.append(word)
I can't use the actual strings "blueberry" and "orange" as variables though, because other fruits could also appear hyphenated in a different text string. Moreover, I don't want to just add "blueberry-orange" to the set - I'm searching for a way without changing the set.
Is there a way to add "pine apple" and "blueberry-orange" as well to the extracted_fruits list?
I appreciate any help and tips. Thanks a lot in advance!
The quickest (and dirtiest?) approach might be to use this regex:
(banana|orange|(pine\s*)?apple|blueberry)(-\s*(banana|orange|(pine\s*)?apple|blueberry))?
It will match pineapple, pine apple, pine apple, and pine apple with newlines or tabs between pine and apple.
Demo
One disadvantage is, of course, that it matches things like orange-orange and there's repeated text in there. You can construct the regex programmatically to fix that, though.
It's just a start but may be good enough for your use case. You can grow it to add more capabilities for a bit, I think.

Find fuzzy search occurrences of a huge list of phrases in a large list of sentences

The phrases can be of at max length 4 words and the sentences can be of length at max 40 words.
Let's say the phrases are:
three-course meal
came with us
in the end
we ate much
and the sentences are :
We had a three-course meal.
Brad came to dinner with us.
In the end, we all felt like we ate too much.
Then,
sentence 1 should match with three-course meal
sentence 2 should match with came with us
sentence 3 should match with in the end and we ate much
Earlier questions like - Find occurrences of huge list of phrases in text do not tackle fuzzy search which is needed in my case.

Finding unique example sentences from a list for a list of words

I have a list of 3000 (mostly unique) words sorted by their frequency in English. I also have a list of 3000 unique sentences. Ideally I would like to use Python to generate a list of one example sentence for the use of each word. So each word would have a sentence, which contains that word, paired with it. But no sentence should be paired with more than one word and no word should have more than one sentence associated with it.
But here is the catch, this is a messy dataset, so many words are going to appear in more than one sentence, some words will only appear in one sentence, and many words will not appear in any of the sentences. So I'm not going to get my ideal result. Instead, what I would like is an optimal list with the greatest number of sentences matched with words. And then a list of sentences that were omitted. Also, ideally, the sorted list should prefer to find sentences for lower frequency words than for higher frequency ones. (Since it will be easier to go back and find replacement sentences for higher frequency words.)
Here is an abbreviated example to help clarify:
words = ["the", "cat", "dog", "fish", "runs"]
sentences = ["the dog and cat are friends", "the dog runs all the time", "the dog eats fish", "I love to eat fish", "Granola is yummy too"]
output = ["", "the dog and cat are friends", "the dog eats fish", "I love to eat fish", "the dog runs all the time"]
omitted = ["Granola is yummy too"]
As you can see:
"Granola is yummy too" was omitted because it doesn't contain any of the words.
"the dog and cat are friends" was matched with "cat" because it is the only sentence that contains "cat"
"the dog runs all the time" was matched with "runs" because it is the only sentence that contains "runs"
"the dog eats fish" was matched with "dog" because "dog" is less frequent than "the" in English
"I love to eat fish" was matched with "fish" because the only other sentence with "fish" was already used
"the" didn't have any sentences left that matched with it
I'm not sure where to even start writing the code for this. (I'm a linguist who dabbles in coding on the side, not a professional coder.) So any help would be greatly appreciated!
...where to even start...
Here is a kind of naive approach without any attempt to optimize.
make a dictionary with the words as the keys and a list for the value
{'word1':[], 'word2':[], ...}
for each item in the dictionary
iterate over the sentences and append a sentence to the item's list if the word is in the sentence
Or maybe:
make a set of the words
make an empty dictionary
for each sentence
find the intersection of the words-in-the-sentence with the set of words
add an item to the dictionary using the sentence for the key and the intersection for the value

Drop all strings that are a subset of another string in the same list

I'm working on a scraping project and for some reason on some paragraphs I get both the complete paragraph and also the same paragraph divided in segments. So, if the paragraph is "My house is green and. I like it.", I sometimes get:
["My house is green. I like it.", "My house is green.", "I like it."]
So, when I turn everything into text I will get that paragraph duplicated. Is there any way I can check which strings are a subset of other strings in a list?
My desired output in this case would be to be left only with ["My house is green. I like it."]
An efficient approach is to iterate through the list sorted by the lengths of phrases in reverse order, and add each possible sub-phrase to a set, so that you can use the set to efficiently check if the current phrase is a sub-phrase of a previous, longer phrase:
output = []
seen = set()
for phrase in sorted(l, key=len, reverse=True):
words = tuple(phrase.split())
if words not in seen:
output.append(phrase)
seen.update({words[i: i + n + 1] for n in range(len(words)) for i in range(len(words) - n)})
so that given:
l = ["My house is green. I like it.", "My house is green.", "I like it."]
output becomes:
['My house is green. I like it.']
I would take the longest string out of the list like this:
arr = ["My house is green. I like it.", "My house is green.", "I like it."]
print(max(arr, key=len))
The longest string can't be a substring of the others by definition

Python: how to check if a string contains an element from a list and bring out that element?

I have a list of sentences (exg) and a list of fruit names (fruit_list). I have a code to check whether sentences contain elements from the fruit_list, as following:
exg = ["I love apple.", "there are lots of health benefits of apple.",
"apple is especially hight in Vitamin C,", "alos provide Vitamin A as a powerful antioxidant!"]
fruit_list = ["pear", "banana", "mongo", "blueberry", "kiwi", "apple", "orange"]
for j in range(0, len(exg)):
sentence = exg[j]
if any(word in sentence for word in fruit_list):
print(sentence)
Output is below: only sentences contain words with "apple"
I love apple.
there are lots of health benefits of apple.
apple is especially hight in Vitamin C,
But I'd love to print out which word was an element of the fruit_list and was found in sentences. In this example, I'd love to have an output of the word "apple", instead of sentences contains the word apple.
Hope this makes sense. Please send help and thank you so so much!
Try using in to check for word in fruit_list, then you can use fruit as a variable later.
In order to isolate which word was found you'll need to use a different method than any(). any() only it only cares if it can find a word in fruit_list. it does not care which word or where in the list it was found.
exg = ["I love apple.", "there are lots of health benefits of apple.",
"apple is especially hight in Vitamin C,", "alos provide Vitamin A as a powerful antioxidant!"]
fruit_list = ["pear", "banana", "mongo", "blueberry", "kiwi", "apple", "orange"]
# You can remove the 0 from range, because it starts at 0 by default
# You can also loop through sentence directly
for sentence in exg:
for word in fruit_list:
if(word in sentence):
print("Found word:", word, " in:", sentence)
Result:
Found word: apple in: I love apple.
Found word: apple in: there are lots of health benefits of apple.
Found word: apple in: apple is especially hight in Vitamin C,
Instead of any with a generator expression, you can use a for clause with break:
for j in range(0, len(exg)):
sentence = exg[j]
for word in fruit_list:
if word in sentence:
print(f'{word}: {sentence}')
break
Result:
apple: I love apple.
apple: there are lots of health benefits of apple.
apple: apple is especially hight in Vitamin C,
More idiomatic is to iterate the list rather than a range for indexing:
for sentence in exg:
for word in fruit_list:
if word in sentence:
print(f'{word}: {sentence}')
break
This will do the job.
for j in range(0, len(fruit_list)):
fruit = fruit_list[j]
if any(fruit in sentence for sentence in exg):
print(fruit)

Categories