Detecting which words are the same between two pieces of text - python

I need some python advice to implement an algorithm.
What I need is to detect which words from text 1 are in text 2:
Text 1: "Mary had a dog. The dog's name was Ethan. He used to run down
the meadow, enjoying the flower's scent."
Text 2: "Mary had a cat. The cat's name was Coco. He used to run down
the street, enjoying the blue sky."
I'm thinking I could use some pandas datatype to check repetitions, but I'm not sure.
Any ideas on how to implement this would be very helpful. Thank you very much in advance.

Since you do not show any work of your own, I'll just give an overall algorithm.
First, split each text into its words. This can be done in several ways. You could remove any punctuation then split on spaces. You need to decide if an apostrophe as in dog's is part of the word--you probably want to leave apostrophes in. But remove periods, commas, and so forth.
Second, place the words for each text into a set.
Third, use the built-in set operations to find which words are in both sets.
This will answer your actual question. If you want a different question that involves the counts or positions of the words, you should make that clear.

You can use dictionary to first store words from first text and than just simply look up while iterating the second text. But this will take space.
So best way is to use regular expressions.

First extract words from both strings into lists. I assume you would want to ignore any trailing periods or commas. Add one of the lists to a set (for expected constant time lookup). For each word in another list, check if it's also present in the set; That gets you words common in both of the texts. I assumed that duplicate elements are counted only once. Following is the code for doing this:
def get_words(text):
words = text.split()
for i in range(len(words)):
words[i] = words[i].strip('.,')
return words
def common_words(text1, text2):
words1 = get_words(text1)
words2 = set(get_words(text2))
common = set()
for word in words1:
if word in words2:
common.add(word)
return common
For your example, it would return:
{'enjoying', 'had', 'to', 'Mary', 'used', 'the', 'The', 'was', 'down', 'name', 'He', 'run', 'a'}
Note that words "the" and "The" are counted as distinct. If you don't want that, you can convert all words to lower case; words[i] = lower(words[i].strip('.,'))

Related

How to extract list of words out of a string with no spaces

I have a dataset and one of the column contains sentences, in some of sentences the words are stucking together. i want to extract this words if there appears on each row. ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']. I use this code for extracting the words
ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']
path = '|'.join(r"\b{}\b".format(x) for x in ingredients_list)
ing_l = df['ingredients'].str.findall(pat, flags=re.I).str.join(' ')
ing_l= ing_l.replace("","Unknown")
Its works great but, it didn't extract words from ingredients_list, if one of the words are stuck with another, i mean in a sentence "breadmilkcoffee" it fails to extract "milk" among this stucking words.
I asking a related question for helping me to order the words i extract, Sort the values of first list using second list with different length in Python .
But i didn't extract all the words.
Do you have any solution to this problem? Thank you a lot
You are using the \b special character, which asserts that the pattern appears at a word boundary.
Removing this should allow you to match items in ingredients_list when they are not separated by a space from the rest of the string.
ingredients_list=['water','milk', 'yeast', 'banana', 'sugar', 'ananas']
path = '|'.join(r"{}".format(x) for x in ingredients_list)
ing_l = df['ingredients'].str.findall(pat, flags=re.I).str.join(' ')
ing_l= ing_l.replace("","Unknown")

Is there a way to find which word is after a certain word in a string?

We can check if a certain word/sentence is in a string by doing if "the example word" in string but I also want to find out the group of words that are after that word that we just found out, for example if
string = "The earth is shaped like Big Chungus"
Suppose I want to find out the series of words that are after "The", then how would I approach doing it?
Assuming you are doing this in python.
Say u have the string as follows:
string = "The earth is shaped like Big Chungus"
You can use the split method to cast the words in the list:
a=string.split()
print(a)
a will be as follows:
['The', 'earth', 'is', 'shaped', 'like', 'Big', 'Chungus']
Say you want to get all the words after a particular word you can use list slicing in combination with join keyword. In this case, you want to get all the words after The you can do as follows:
print(' '.join(a[1:]))
This will give output as follows:
earth is shaped like Big Chungus

Split Big String by Specific Word In Python

I want to split a big string by word and that word is repeating in that big string.
Example what i expect :
We have tried to split a code, please check below
string.split("RFF+AAJ:")
So we need a bunch of list that i have described in my above screenshot.
You can get your result with the help of regex :-
import re
string = 'helloisworldisbyeishi'
re.split('(is)', string) # Splitting from 'is'
Output
['hello', 'is', 'world', 'is', 'bye', 'is', 'hi']
I hope it may help you.
split returns one single list with the complete string in it ( it is just split in parts ). So the list here contains the part before the first "RFF+AAJ:", then the part between the two "RFF+AAJ:"s and the last part, after the second "RFF+AAJ:". If you want to have three differrent lists use:
all = string.split("RFF+AAJ:")
first = all[0]
second = all[1]
third = all[2]
And the elements will be stored in first, second and third.
If you want to create lists, use first = list(first) # and so on.
Hope that helped.

An efficient way to find elements of a list that contain substrings from another list

list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
Assume I have two string lists, I want to use a new list to store the matched pairs of those two lists just like below:
list3=[["happy new year","happy"],["happy new year","new"],["a happy story","happy"],["hold on","hold"]]
which means I need to get all pairs of strings in one list with their substrings in another list.
Actually that is about some Chinese ancient scripts data. The first list contains names of people in 10th to 13th century, and the second list contains titles of all the poems at that period. Ancient Chinese people often record their social relations in the title of their works. For example, someone may write a poem titled "For my friend Wang Anshi". In this case, the people "Wang Anshi" in the first list should be matched with this title. Also their are cases like "For my friend Wang Anshi and Su Shi" which contains more than one people in the title. So basically that's a huge work involved 30,000 people and 160,000 poems.
Following is my code:
list3 = []
for i in list1:
for j in list2:
if str(i).count(str(j)) > 0:
list3.append([i,j])
I use str(i) because python always takes my Chinese strings as float. And this code does work but too too too slow. I must figure out another way to do that. Thanks!
Use a regular expression to do the searching, via the re module. A regular expression engine can work out matching elements in a search through text much better than a nested for loop can.
I'm going to use better variable names here to make it clearer where what list has to go; titles are the poem titles you are searching through, and names the things you are trying to match. matched are the (title, name) pairs you want to produce:
import re
titles = ["happy new year", "game over", "a happy story", "hold on"]
names = ["happy", "new", "hold"]
by_reverse_length = sorted(names, key=len, reverse=True)
pattern = "|".join(map(re.escape, by_reverse_length))
any_name = re.compile("({})".format(pattern))
matches = []
for title in titles:
for match in any_name.finditer(title):
matches.append((title, match.group()))
The above produces your required output:
>>> matches
[('happy new year', 'happy'), ('happy new year', 'new'), ('a happy story', 'happy'), ('hold on', 'hold')]
The names are sorted by length, in reverse, so that longer names are found before shorter with the same prefix; e.g. Hollander is found before Holland is found before Holl.
The pattern string is created from your names to form a ...|...|... alternatives pattern, any one of those patterns can match, but the regex engine will find those listed earlier in the sequence over those put later, hence the need to reverse sort by length. The (...) parentheses around the whole pattern of names tells the regular expression engine to capture that part of the text, in a group. The match.group() call in the loop can then extract the matched text.
The re.escape() function call is there to prevent 'meta characters' in the names, characters with special meaning such as ^, $, (, ), etc, from being interpreted as their special regular expression meanings.
The re.finditer() function (and method on compiled patterns) then finds non-overlapping matches in order from left to right, so it'll never match shorter substrings, and gives us the opportunity to extract the match object for each. This gives you more options if you want to know about starting positions of the matches and other metadata as well, should you want those. Otherwise, re.findall() could also be used here.
If you are going to use the above on text with Western alphabets and not on Chinese, then you probably also want to add word boundary markers, \b:
any_name = re.compile("\b({})\b".format(pattern))
otherwise substrings part of a larger word can be matched. Since Chinese has no word boundary characters (such as spaces and punctuation) you don't want to use \b in such texts.
If the lists are longer, it might be worth building a sort of "index" of the sentences a given word appears in. Creating the index takes about as long as finding the first word from list2 in all the sentences in list1 (it has to loop over all the words in all the sentences), and once created, you can get the sentences containing a word much faster in O(1).
list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
import collections
index = collections.defaultdict(list)
for sentence in list1:
for word in sentence.split():
index[word].append(sentence)
res = [[sentence, word] for word in list2 for sentence in index[word]]
Result:
[['happy new year', 'happy'],
['a happy story', 'happy'],
['happy new year', 'new'],
['hold on', 'hold']]
This uses str.split to split the words at spaces, but if the sentences are more complex, e.g. if they contain punctuation, you might use a regular expression with word boundaries \b instead, and possibly normalize the sentences (e.g. convert to lowercase or apply a stemmer, not sure if this is applicable to Chinese, though).
This can be done quite easily in an absolutely strightforward way.
Option A: Finding "all" possible combinations: To find all strings in one list that contain substrings from another list, loop over all your strings of your list1 (the strings to assess) and for each element check whether it contains a substring of list2:
list1 = ["happy new year", "game over", "a happy story", "hold on"]
list2 = ["happy", "new", "hold"]
[(string, substring) for string in list1 for substring in list2 if substring in string]
>>> [('happy new year', 'happy'), ('happy new year', 'new'), ('a happy story', 'happy'), ('hold on', 'hold')]
(I do think the title of your question is a bit misleading, though, as you are not only asking for elements of a list that contain a substring of another list, but as per your code example you are looking for 'all possible combinations'.)
Thus option B: Finding "any" combination: Much simpler and faster, if you really only need what the question says, you can improve performance by finding only the 'any' matches:
[string for string in list1 if ( substring in string for substring in list2)]
Option B will also allow you to improve performance. In case the lists are very long, you can run B first, create a subset (only strings that will actually produce a match with a substring), and then expand again to catch 'all' instead of any.

Derive words from string based on key words

I have a string (text_string) from which I want to find words based on my so called key_words. I want to store the result in a list called expected_output.
The expected output is always the word after the keyword (the number of spaces between the keyword and the output word doesn't matter). The expected_output word is then all characters until the next space.
Please see the example below:
text_string = "happy yes_no!?. why coding without paus happy yes"
key_words = ["happy","coding"]
expected_output = ['yes_no!?.', 'without', 'yes']
expected_output explanation:
yes_no!?. (since it comes after happy. All signs are included until the next space.)
without (since it comes after coding. the number of spaces surronding the word doesn't matter)
yes (since it comes after happy)
You can solve it using regex. Like this e.g.
import re
expected_output = re.findall('(?:{0})\s+?([^\s]+)'.format('|'.join(key_words)), text_string)
Explanation
(?:{0}) Is getting your key_words list and creating a non-capturing group with all the words inside this list.
\s+? Add a lazy quantifier so it will get all spaces after any of the former occurrences up to the next character which isn't a space
([^\s]+) Will capture the text right after your key_words until a next space is found
Note: in case you're running this too many times, inside a loop i.e, you ought to use re.compile on the regex string before in order to improve performance.
We will use re module of Python to split your strings based on whitespaces.
Then, the idea is to go over each word, and look if that word is part of your keywords. If yes, we set take_it to True, so that next time the loop is processed, the word will be added to taken which stores all the words you're looking for.
import re
def find_next_words(text, keywords):
take_it = False
taken = []
for word in re.split(r'\s+', text):
if take_it == True:
taken.append(word)
take_it = word in keywords
return taken
print(find_next_words("happy yes_no!?. why coding without paus happy yes", ["happy", "coding"]))
results in ['yes_no!?.', 'without', 'yes']

Categories