Python -- Search Subsrtring Full Word(s) - python

I want to find the number of times a substring occurred in a string. I was doing this
termCount = content.count(term)
But if i search like "Ford" it returned result set like
"Ford Motors" Result: 1 Correct
"cannot afford Ford" Result: 2 Incorrect
"ford is good" Result: 1 Correct
The search term can have multiple terms like "Ford Motors" or "Ford Auto".
For example if i search "Ford Motor"
"Ford Motors" Result: 1 Correct
"cannot afford Ford Motor" Result: 1 Correct
"Ford Motorway" Result: 1 InCorrect
What i want is to search them case insensitive and as a whole. Mean if I search a substring it should be contained as a whole as a word or a phrase (In case of multiple terms) not part of the word. And also I need the count of the terms. How do I achieve it.

You can use regex, and in this case use re.findall then get the length of matched list :
re.findall(r'\byour_term\b',s)
Demo
>>> s="Ford Motors cannot afford Ford Motor Ford Motorway Ford Motor."
>>> import re
>>> def counter(str,term):
... return len(re.findall(r'\b{}\b'.format(term),str))
...
>>> counter(s,'Ford Motor')
2
>>> counter(s,'Ford')
4
>>> counter(s,'Fords')
0

I would split the strings by spaces so that we have independent words and then from there I would carry out the count.
terms = ['Ford Motors', 'cannot afford Ford', 'ford is good'];
splitWords = [];
for term in terms:
#take each string in the list and split it into words
#then add these words to a list called splitWords.
splitWords.extend(term.lower().split())
print(splitWords.count("ford"))

Related

Pandas isin() does not return anything even when the keywords exist in the dataframe

I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.
keywords = ['fake', 'false', 'lie']
df1:
text
19152
I think she is the Corona Virus....
19154
Boy you hate to see that. I mean seeing how it was contained and all.
19155
Tell her it’s just the fake flu, it will go away in a few days.
19235
Is this fake news?
...
...
20540
She’ll believe it’s just alternative facts.
Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.
str.contains()
df1[df1['text'].str.contains("|".join(keywords))]
The problem with str.contains() is that the result is not limited to the exact keywords. For example, it returns sentences with believe (e.g., row 20540) because lie is a substring of "believe"!
pandas.Series.isin
To find the rows including the exact keywords, I used pd.Series.isin:
df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]
Even though I see there are matches in df1, it doesn't return anything.
import re
df[df.text.apply(lambda x: any(i for i in re.findall('\w+', x) if i in keywords))]
Output:
text
2 Tell her it’s just the fake flu, it will go aw...
3 Is this fake news?
If text is as follows,
df1 = pd.DataFrame()
df1['text'] = [
"Dear Kellyanne, Please seek the help of Paula White I believe ...",
"trump saying it was under controll was a lie, ...",
"Her mouth should hanve been ... All the lies she has told ...",
"she'll believe ...",
"I do believe in ...",
"This value is false ...",
"This value is fake ...",
"This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']
First,
Simple way is this.
df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
text
5 This value is false ...
6 This value is fake ...
It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.
Second,
So if remove a special letter in the text data like
new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z]+", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
Now It can catch the word "lie,".
text
1 trump saying it was under controll was a lie, ...
5 This value is false ...
6 This value is fake ...
Third,
It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python
I think splitting words then matching is a better and straightforward approach, e.g. if the df and keywords are
df = pd.DataFrame({'text': ['lama abc', 'cow def', 'foo bar', 'spam egg']})
keywords = ['foo', 'lama']
df
text
0 lama abc
1 cow def
2 foo bar
3 spam egg
This should return the correct result
df.loc[pd.Series(any(word in keywords for word in words) for words in df['text'].str.findall(r'\w+'))]
text
0 lama abc
2 foo bar
Explaination
First, do words splitting in df['text']
splits = df['text'].str.findall(r'\w+')
splits is
0 [lama, abc]
1 [cow, def]
2 [foo, bar]
3 [spam, egg]
Name: text, dtype: object
Then we need to find if there exists any word in a row should appear in the keywords
# this is answer for a single row, if words is the split list of that row
any(word in keywords for word in words)
# for the entire dataframe, use a Series, `splits` from above is word split lists for every line
rows = pd.Series(any(word in keywords for word in words) for words in splits)
rows
0 True
1 False
2 True
3 False
dtype: bool
Now we can find the correct rows with
df.loc[rows]
text
0 lama abc
2 foo bar
Be aware this approach could consume much more memory as it needs to generate the split list on each line. So if you have huge data sets, this might be a problem.
I believe it's because pd.Series.isin() checks if the string is in the column, and not if the string in the column contains a specific word. I just tested this code snippet:
s = pd.Series(['lama abc', 'cow', 'lama', 'beetle', 'lama',
'hippo'], name='animal')
s.isin(['cow', 'lama'])
And as I was thinking, the first string, even containing the word 'lama', returns False.
Maybe try using regex? See this: searching a word in the column pandas dataframe python

Extract words in a paragraph that are similar to words in list

I have the following string:
"The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
List of words to be extracted:
["town","teddy","chicken","boy went"]
NB: town and teddy are wrongly spelt in the given sentence.
I have tried the following but I get other words that are not part of the answer:
import difflib
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town","teddy","chicken","boy went"]
[difflib.get_close_matches(x.lower().strip(), sent.split()) for x in list1 ]
I am getting the following result:
[['twn', 'to'], ['tddy'], ['chicken.', 'picked'], ['went']]
instead of:
'twn', 'tddy', 'chicken','boy went'
Notice in the documentation for difflib.get_closest_matches():
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Return a list of the best "good enough" matches. word is a sequence for which close matches are desired (typically a string), and
possibilities is a list of sequences against which to match word
(typically a list of strings).
Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.
Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are
ignored.
At the moment, you are using the default n and cutoff arguments.
You can specify either (or both), to narrow down the returned matches.
For example, you could use a cutoff score of 0.75:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), cutoff=0.75) for x in list1]
Or, you could specify that only at most 1 match should be returned:
result = [difflib.get_close_matches(x.lower().strip(), sent.split(), n=1) for x in list1]
In either case, you could use a list comprehension to flatten the lists of lists (since difflib.get_close_matches() always returns a list):
matches = [r[0] for r in result]
Since you also want to check for close matches of bigrams, you can do so by extracting pairings of adjacent "words", and pass them to difflib.get_close_matches() as part of the possibilities argument.
Here is a full working example of this in action:
import difflib
import re
sent = "The boy went to twn and bought sausage and chicken. He then picked a tddy for his sister"
list1 = ["town", "teddy", "chicken", "boy went"]
# this extracts overlapping pairings of "words"
# i.e. ['The boy', 'boy went', 'went to', 'to twn', ...
pairs = re.findall(r'(?=(\b[^ ]+ [^ ]+\b))', sent)
# we pass the sent.split() list as before
# and concatenate the new pairs list to the end of it also
result = [difflib.get_close_matches(x.lower().strip(), sent.split() + pairs, n=1) for x in list1]
matches = [r[0] for r in result]
print(matches)
# ['twn', 'tddy', 'chicken.', 'boy went']
If you read Python documentation fordifflib.get_close_matches()
https://docs.python.org/3/library/difflib.html
It returns all possible best matches.
Method signature:
difflib.get_close_matches(word, possibilities, n=3, cutoff=0.6)
Here n is the maximum number of close matches to return. So I think you can pass this as 1.
>>> [difflib.get_close_matches(x.lower().strip(), sent.split(),1)[0] for x in list1]
['twn', 'tddy', 'chicken.', 'went']

Check how many words from a given list occur in list of text/strings

I have a list of text data which contains reviews, something likes this:
1. 'I have bought several of the Vitality canned dog food products and have found them all to be of good quality. The product looks more like a stew than a processed meat and it smells better. My Labrador is finicky and she appreciates this product better than most.'
2. 'Product arrived labeled as Jumbo Salted Peanuts...the peanuts were actually small sized unsalted. Not sure if this was an error or if the vendor intended to represent the product as "Jumbo".',
3. 'This is a confection that has been around a few centuries. It is a light, pillowy citrus gelatin with nuts - in this case Filberts. And it is cut into tiny squares and then liberally coated with powdered sugar. And it is a tiny mouthful of heaven. Not too chewy, and very flavorful. I highly recommend this yummy treat. If you are familiar with the story of C.S. Lewis\' "The Lion, The Witch, and The Wardrobe" - this is the treat that seduces Edmund into selling out his Brother and Sisters to the Witch.
I have a seperate list of words which I want to know exists in the these reviews:
['food','science','good','buy','feedback'....]
I want to know which of these words are present in the review and select reviews which contains certain number of these words. For example, lets say only select reviews which contains atleast 3 of the words from this list, so it displays all those reviews, but also show which of those were encountered in the review while selecting it.
I have the code for selecting reviews containing at least 3 of the words, but how do I get the second part which tells me which words exactly were encountered. Here is my initial code:
keywords = list(words)
text = list(df.summary.values)
sentences=[]
for element in text:
if len(set(keywords)&set(element.split(' '))) >=3:
sentences.append(element)
To answer the second part, allow me to revisit how to approach the first part. A handy approach here is to cast your review strings into sets of word strings.
Like this:
review_1 = "I have bought several of the Vitality canned dog food products and"
review_1 = set(review_1.split(" "))
Now the review_1 set contains one of every word. Then take your list of words, convert it to a set, and do an intersection.
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = review_1.intersection(words)
The resulting set, matches, contains all the words that are common. The length of this is the number of matches.
Now, this does not work if you cared about how many of each word matches. For example, if the word "food" is found twice in the review and "science" is found once, does that count as matching three words?
If so, let me know via comment and I can write some code to update the answer to include that scenario.
EDIT: Updating to include comment question
If you want to keep a count of how many times each word repeats, then hang onto the review list. Only cast it to set when performing the intersection. Then, use the 'count' list method to count the number of times each match appears in the review. In the example below, I use a dictionary to store the results.
review_1 = "I have bought several of the Vitality canned dog food products and"
words = ['food','science','good','buy','feedback'....]
words = set(['food','science','good','buy','feedback'....])
matches = set(review_1).intersection(words)
match_counts = dict()
for match in matches:
match_counts[match] = words.count(match)
You can use set intersection for finding the common words:
def filter_reviews(data, *, trigger_words = frozenset({'food', 'science', 'good', 'buy', 'feedback'})):
for review in data:
words = review.split() # use whatever method is appropriate to get the words
common = trigger_words.intersection(words)
if len(common) >= 3:
yield review, common

Python: fastest way to re.findall twice?

I like regular expressions. I often find myself using multiple regex statements to narrow in on the value I need when trying to get a substring from a large block of text.
So far, my approach has been the following:
Use resultOfRegex1 = re.findall(firstRegex, myString) for my first regex
Check to see that resultOfRegex1[0] exists
Use resultOfRegex2 = re.findall(secondRegex, resultOfRegex1[0]) for
my second regex
Check to see that resultOfRegex2[0] exists, and print that value
But I feel like this is much more verbose and costly than it has to be. Is there an easier/faster way to match one regex and then match another regex based on the result of the first?
The whole point of groups is to allow extraction of subgroups from an overall match.
For example, instead two searches done the following fashion:
>>> import re
>>> s = 'The winning team scored 15 points and used only 2 timeouts'
>>> score_clause = re.search(r'scored \d+ point', s).group(0)
>>> re.search(r'\d+', score_clause).group(0)
'15'
Do a single search with a sub-group:
>>> re.search(r'scored (\d+) point', s).group(1)
'15'
One other thought: if you want to make decisions about whether to continue a findall-style search based on the first match, a reasonable choice would be to use re.finditer and extract values as needed:
>>> game_results = '''\
10 point victory: 1 in first period, 6 in second period, 3 in third period.
5 point victory: 0 in first period, 5 in second period, 0 in third period.
12 point victory: 5 in first period, 3 in second period, 4 in third period.
7 point victory: 3 in first period, 0 in second period, 4 in third period.
'''.splitlines()
>>> # Show period-by-period scores for games won by 8 or more points
>>> for game_result in game_results:
it = re.finditer(r'\d+', game_result)
if int(next(it).group(0)) >= 8:
print 'Big win:', [int(mo.group(0)) for mo in it]
Big win: [1, 6, 3]
Big win: [5, 3, 4]

Search in a string and obtain the 2 words before and after the match in Python

I'm using Python to search some words (also multi-token) in a description (string).
To do that I'm using a regex like this
result = re.search(word, description, re.IGNORECASE)
if(result):
print ("Trovato: "+result.group())
But what I need is to obtain the first 2 word before and after the match. For example if I have something like this:
Parking here is horrible, this shop sucks.
"here is" is the word that I looking for. So after I matched it with my regex I need the 2 words (if exists) before and after the match.
In the example:
Parking here is horrible, this
"Parking" and horrible, this are the words that I need.
ATTTENTION
The description cab be very long and the pattern "here is" can appear multiple times?
How about string operations?
line = 'Parking here is horrible, this shop sucks.'
before, term, after = line.partition('here is')
before = before.rsplit(maxsplit=2)[-2:]
after = after.split(maxsplit=2)[:2]
Result:
>>> before
['Parking']
>>> after
['horrible,', 'this']
Try this regex: ((?:[a-z,]+\s+){0,2})here is\s+((?:[a-z,]+\s*){0,2})
with re.findall and re.IGNORECASE set
Demo
I would do it like this (edit: added anchors to cover most cases):
(\S+\s+|^)(\S+\s+|)here is(\s+\S+|)(\s+\S+|$)
Like this you will always have 4 groups (might have to be trimmed) with the following behavior:
If group 1 is empty, there was no word before (group 2 is empty too)
If group 2 is empty, there was only one word before (group 1)
If group 1 and 2 are not empty, they are the words before in order
If group 3 is empty, there was no word after
If group 4 is empty, there was only one word after
If group 3 and 4 are not empty, they are the words after in order
Corrected demo link
Based on your clarification, this becomes a bit more complicated. The solution below deals with scenarios where the searched pattern may in fact also be in the two preceding or two subsequent words.
line = "Parking here is horrible, here is great here is mediocre here is here is "
print line
pattern = "here is"
r = re.search(pattern, line, re.IGNORECASE)
output = []
if r:
while line:
before, match, line = line.partition(pattern)
if match:
if not output:
before = before.split()[-2:]
else:
before = ' '.join([pattern, before]).split()[-2:]
after = line.split()[:2]
output.append((before, after))
print output
Output from my example would be:
[(['Parking'], ['horrible,', 'here']), (['is', 'horrible,'], ['great', 'here']), (['is', 'great'], ['mediocre', 'here']), (['is', 'mediocre'], ['here', 'is']), (['here', 'is'], [])]

Categories