Delete words with regex patterns in Python from a dataframe - python

I'm playing around with regular expression in Python for the below data.
Random
0 helloooo
1 hahaha
2 kebab
3 shsh
4 title
5 miss
6 were
7 laptop
8 welcome
9 pencil
I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).
Here is the code:
import pandas as pd
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)
print(df)
Below is the output for the above with a Warning message:
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Random
0 hahaha
1 kebab
2 shsh
3 title
4 were
5 laptop
6 welcome
7 pencil
However, I expect to see this:
Random
0 laptop
1 welcome
2 pencil

You can use Series.str.contains directly to create a mask and disable the user warning before and enable it after:
import pandas as pd
import warnings
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning
Output:
>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>
7 laptop
8 welcome
9 pencil
Name: Random, dtype: object
The regex you have contains an issue: the quantifier is put outside of the group, and \1 was looking for the wrong repeated string. Also, the \b word boundary is superflous. The ([a-z]+)[a-z]?\1 pattern matches for one or more letters, then any one optional letter, and the same substring right after it.
See the regex demo.
We can safely disable the user warning because we deliberately use the capturing group here, as we need to use a backreference in this regex pattern. The warning needs re-enabling to avoid using capturing groups in other parts of our code where it is not necessary.

IIUC, you can use sth like the pattern r'(\w+)(\w)?\1', i.e., one or more letters, an optional letter, and the letters from the first match. This gives the right result:
df[~df.Random.str.contains(r'(\w+)(\w)?\1')]

Related

Pandas isin() does not return anything even when the keywords exist in the dataframe

I'd like to search for a list of keywords in a text column and select all rows where the exact keywords exist. I know this question has many duplicates, but I can't understand why the solution is not working in my case.
keywords = ['fake', 'false', 'lie']
df1:
text
19152
I think she is the Corona Virus....
19154
Boy you hate to see that. I mean seeing how it was contained and all.
19155
Tell her it’s just the fake flu, it will go away in a few days.
19235
Is this fake news?
...
...
20540
She’ll believe it’s just alternative facts.
Expected results: I'd like to select rows that have the exact keywords in my list ('fake', 'false', 'lie). For example, in the above df, it should return rows 19155 and 19235.
str.contains()
df1[df1['text'].str.contains("|".join(keywords))]
The problem with str.contains() is that the result is not limited to the exact keywords. For example, it returns sentences with believe (e.g., row 20540) because lie is a substring of "believe"!
pandas.Series.isin
To find the rows including the exact keywords, I used pd.Series.isin:
df1[df1.text.isin(keywords)]
#df1[df1['text'].isin(keywords)]
Even though I see there are matches in df1, it doesn't return anything.
import re
df[df.text.apply(lambda x: any(i for i in re.findall('\w+', x) if i in keywords))]
Output:
text
2 Tell her it’s just the fake flu, it will go aw...
3 Is this fake news?
If text is as follows,
df1 = pd.DataFrame()
df1['text'] = [
"Dear Kellyanne, Please seek the help of Paula White I believe ...",
"trump saying it was under controll was a lie, ...",
"Her mouth should hanve been ... All the lies she has told ...",
"she'll believe ...",
"I do believe in ...",
"This value is false ...",
"This value is fake ...",
"This song is fakelove ..."
]
keywords = ['misleading', 'fake', 'false', 'lie']
First,
Simple way is this.
df1[df1.text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
text
5 This value is false ...
6 This value is fake ...
It'll not catch the words like "believe", but can't catch the words "lie," because of the special letter.
Second,
So if remove a special letter in the text data like
new_text = df1.text.apply(lambda x: re.sub("[^0-9a-zA-Z]+", " ", x))
df1[new_text.apply(lambda x: True if pd.Series(x.split()).isin(keywords).sum() else False)]
Now It can catch the word "lie,".
text
1 trump saying it was under controll was a lie, ...
5 This value is false ...
6 This value is fake ...
Third,
It can't still catch the word lies. It can be solved by using a library that tokenizes to the same verb from a different forms verb. You can find how to tokenize from here(tokenize-words-in-a-list-of-sentences-python
I think splitting words then matching is a better and straightforward approach, e.g. if the df and keywords are
df = pd.DataFrame({'text': ['lama abc', 'cow def', 'foo bar', 'spam egg']})
keywords = ['foo', 'lama']
df
text
0 lama abc
1 cow def
2 foo bar
3 spam egg
This should return the correct result
df.loc[pd.Series(any(word in keywords for word in words) for words in df['text'].str.findall(r'\w+'))]
text
0 lama abc
2 foo bar
Explaination
First, do words splitting in df['text']
splits = df['text'].str.findall(r'\w+')
splits is
0 [lama, abc]
1 [cow, def]
2 [foo, bar]
3 [spam, egg]
Name: text, dtype: object
Then we need to find if there exists any word in a row should appear in the keywords
# this is answer for a single row, if words is the split list of that row
any(word in keywords for word in words)
# for the entire dataframe, use a Series, `splits` from above is word split lists for every line
rows = pd.Series(any(word in keywords for word in words) for words in splits)
rows
0 True
1 False
2 True
3 False
dtype: bool
Now we can find the correct rows with
df.loc[rows]
text
0 lama abc
2 foo bar
Be aware this approach could consume much more memory as it needs to generate the split list on each line. So if you have huge data sets, this might be a problem.
I believe it's because pd.Series.isin() checks if the string is in the column, and not if the string in the column contains a specific word. I just tested this code snippet:
s = pd.Series(['lama abc', 'cow', 'lama', 'beetle', 'lama',
'hippo'], name='animal')
s.isin(['cow', 'lama'])
And as I was thinking, the first string, even containing the word 'lama', returns False.
Maybe try using regex? See this: searching a word in the column pandas dataframe python

How to match this pattern using regex in Python

I have a list of names with different notations:
for example:
myList = [ab2000, abc2000_2000, AB2000, ab2000_1, ABC2000_01, AB2000_2, ABC2000_02, AB2000_A1]
the standarized version for those different notations are, for example:
'ab2000' is 'ABC2000'
'ab2000_1' is 'ABC2000_01'
'AB2000_A1' is 'ABC2000_A1'
What I tried is to separate the different characters of the string using compile.
input:
compiled = re.compile(r'[A-Za-z]+|\d+|\W+')
compiled.findall("AB2000_2000_A1")
output:
characters = ['AB', '2000', '2000', 'A', '1']
Then applying:
characters = list(set(characters))
To finally try to match the values of that list with the main components of the string: an alpha format followed by a digit format followed by an alphanumeric format.
But as you can see in the previous output I can't match 'A1' into a single character using \W+. My desired output is:
characters = ['AB', '2000', '2000', 'A1']
any idea to fix that?
o any better idea to solve my problem in general. Thank you, in advance.
Use the following pattern with optional groups and capturing groups:
r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?'
and re.I flag.
Note that (?:_([A-Z\d]+))? must be repeated in order to match both
third and fourth group. If you attempted to "repeat" this group, putting
it once with "*" it would match only the last group, skipping the third
group.
To test it, I ran the following test:
myList = ['ab2000', 'abc2000_2000', 'AB2000', 'ab2000_1', 'ABC2000_01',
'AB2000_2', 'ABC2000_02', 'AB2000_A1', 'AB2000_2000_A1']
pat = re.compile(r'([A-Z]+)(\d+)(?:_([A-Z\d]+))?(?:_([A-Z\d]+))?', re.I)
for tt in myList:
print(f'{tt:16} ', end=' ')
mtch = pat.match(tt)
if mtch:
for it in mtch.groups():
if it is not None:
print(f'{it:5}', end=' ')
print()
getting:
ab2000 ab 2000
abc2000_2000 abc 2000 2000
AB2000 AB 2000
ab2000_1 ab 2000 1
ABC2000_01 ABC 2000 01
AB2000_2 AB 2000 2
ABC2000_02 ABC 2000 02
AB2000_A1 AB 2000 A1
AB2000_2000_A1 AB 2000 2000 A1

How to apply regex for multiple phrases on a dataframe column?

Hello I have a dataframe where I want to remove a specific set of characters 'fwd', 're', 'RE' from every row that starts with these phrases or contains these phrases. The issue I am facing is that I do not know how to apply regex for each case.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd:RE:Re: Please take action on the action needed items
4 Fix all the mistakes please
5 Fwd:Re: Take action on the attachments in this email
6 Fwd:RE: Action is required
I want a result dataframe like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 Fix all the mistakes please
5 Take action on the attachments in this email
6 Action is required
To get rid of 'Fwd' I used df['msg'].str.replace(r'^Fwd: ','')
If they can be anywhere in the string, you could use a repeating pattern:
^(?:(?:Fwd|R[eE]):)+\s*
^ Start of string
(?: Non capturing group
(?:Fwd|R[eE]): match either Fwd, Re or RE
)+ Close non capturing group and repeat 1+ times
\s* Match trailing whitespaces
Regex demo
In the replacement use an empty string.
You could also make the pattern case insensitive using re.IGNORECASE and use (?:fwd|re) if you want to match all possible variations.
For example
str.replace(r'^(?:(?:Fwd|R[eE]):)+\s*','')
The key concept in this case I believe is using the | operator which works as Either or Or for the pattern. It's very useful for these cases.
This is how I would solve the problem:
import pandas as pd
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7],
'summary':['Fwd: Please look at the attached documents and take action ',
'NSN for the ones who care',
'News for all team members ',
'Fwd:RE:Re: Please take action on the action needed items',
'Fix all the mistakes please ',
'Fwd:Re: Take action on the attachments in this email',
'Fwd:RE: Action is required',
'Redemption!']})
df['clean'] = df['summary'].str.replace(r'^Fwd:|R[eE]:\s*','')
print(df)
Output:
index ... clean
0 0 ... Please look at the attached documents and tak...
1 1 ... NSN for the ones who care
2 2 ... News for all team members
3 3 ... Please take action on the action needed items
4 4 ... Fix all the mistakes please
5 5 ... Take action on the attachments in this email
6 6 ... Action is required
7 7 ... Redemption!

Extract hashtags from columns of a pandas dataframe

i have a dataframe df. I want to extract hashtags from tweets where Max==45.:
Max Tweets
42 via #VIE_unlike at #fashion
42 Ny trailer #katamaritribute #ps3
45 Saved a baby bluejay from dogs #fb
45 #Niley #Niley #Niley
i m trying something like this but its giving empty dataframe:
df.loc[df['Max'] == 45, [hsh for hsh in 'tweets' if hsh.startswith('#')]]
is there something in pandas which i can use to perform this effectively and faster.
You can use pd.Series.str.findall:
In [956]: df.Tweets.str.findall(r'#.*?(?=\s|$)')
Out[956]:
0 [#fashion]
1 [#katamaritribute, #ps3]
2 [#fb]
3 [#Niley, #Niley, #Niley]
This returns a column of lists.
If you want to filter first and then find, you can do so quite easily using boolean indexing:
In [957]: df.Tweets[df.Max == 45].str.findall(r'#.*?(?=\s|$)')
Out[957]:
2 [#fb]
3 [#Niley, #Niley, #Niley]
Name: Tweets, dtype: object
The regex used here is:
#.*?(?=\s|$)
To understand it, break it down:
#.*? - carries out a non-greedy match for a word starting with a hashtag
(?=\s|$) - lookahead for the end of the word or end of the sentence
If it's possible you have # in the middle of a word that is not a hashtag, that would yield false positives which you wouldn't want. In that case, You can modify your regex to include a lookbehind:
(?:(?<=\s)|(?<=^))#.*?(?=\s|$)
The regex lookbehind asserts that either a space or the start of the sentence must precede a # character.

getting words between m and n characters

I am trying to get all names that start with a capital letter and ends with a full-stop on the same line where the number of characters are between 3 and 5
My text is as follows:
King. Great happinesse
Rosse. That now Sweno, the Norwayes King,
Craues composition:
Nor would we deigne him buriall of his men,
Till he disbursed, at Saint Colmes ynch,
Ten thousand Dollars, to our generall vse
King. No more that Thane of Cawdor shall deceiue
Our Bosome interest: Goe pronounce his present death,
And with his former Title greet Macbeth
Rosse. Ile see it done
King. What he hath lost, Noble Macbeth hath wonne.
I am testing it out on this link. I am trying to get all words between 3 and 5 but haven't succeeded.
Does this produce your desired output?
import re
re.findall(r'[A-Z].{2,4}\.', text)
When text contains the text in your question it will produce this output:
['King.', 'Rosse.', 'King.', 'Rosse.', 'King.']
The regex pattern matches any sequence of characters following an initial capital letter. You can tighten that up if required, e.g. using [a-z] in the pattern [A-Z][a-z]{2,4}\. would match an upper case character followed by between 2 to 4 lowercase characters followed by a literal dot/period.
If you don't want duplicates you can use a set to get rid of them:
>>> set(re.findall(r'[A-Z].{2,4}\.', text))
set(['Rosse.', 'King.'])
You may have your own reasons for wanting to use regexs here, but Python provides a rich set of string methods and (IMO) it's easier to understand the code using these:
matched_words = []
for line in open('text.txt'):
words = line.split()
for word in words:
if word[0].isupper() and word[-1] == '.' and 3 <= len(word)-1 <=5:
matched_words.append(word)
print matched_words

Categories