How to apply regex for multiple phrases on a dataframe column? - python

Hello I have a dataframe where I want to remove a specific set of characters 'fwd', 're', 'RE' from every row that starts with these phrases or contains these phrases. The issue I am facing is that I do not know how to apply regex for each case.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd:RE:Re: Please take action on the action needed items
4 Fix all the mistakes please
5 Fwd:Re: Take action on the attachments in this email
6 Fwd:RE: Action is required
I want a result dataframe like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 Fix all the mistakes please
5 Take action on the attachments in this email
6 Action is required
To get rid of 'Fwd' I used df['msg'].str.replace(r'^Fwd: ','')

If they can be anywhere in the string, you could use a repeating pattern:
^(?:(?:Fwd|R[eE]):)+\s*
^ Start of string
(?: Non capturing group
(?:Fwd|R[eE]): match either Fwd, Re or RE
)+ Close non capturing group and repeat 1+ times
\s* Match trailing whitespaces
Regex demo
In the replacement use an empty string.
You could also make the pattern case insensitive using re.IGNORECASE and use (?:fwd|re) if you want to match all possible variations.
For example
str.replace(r'^(?:(?:Fwd|R[eE]):)+\s*','')

The key concept in this case I believe is using the | operator which works as Either or Or for the pattern. It's very useful for these cases.
This is how I would solve the problem:
import pandas as pd
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7],
'summary':['Fwd: Please look at the attached documents and take action ',
'NSN for the ones who care',
'News for all team members ',
'Fwd:RE:Re: Please take action on the action needed items',
'Fix all the mistakes please ',
'Fwd:Re: Take action on the attachments in this email',
'Fwd:RE: Action is required',
'Redemption!']})
df['clean'] = df['summary'].str.replace(r'^Fwd:|R[eE]:\s*','')
print(df)
Output:
index ... clean
0 0 ... Please look at the attached documents and tak...
1 1 ... NSN for the ones who care
2 2 ... News for all team members
3 3 ... Please take action on the action needed items
4 4 ... Fix all the mistakes please
5 5 ... Take action on the attachments in this email
6 6 ... Action is required
7 7 ... Redemption!

Related

Delete words with regex patterns in Python from a dataframe

I'm playing around with regular expression in Python for the below data.
Random
0 helloooo
1 hahaha
2 kebab
3 shsh
4 title
5 miss
6 were
7 laptop
8 welcome
9 pencil
I would like to delete the words which have patterns of repeated letters (e.g. blaaaa), repeated pair of letters (e.g. hahaha) and any words which have the same adjacent letters around one letter (e.g.title, kebab, were).
Here is the code:
import pandas as pd
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
df = df.loc[~df.agg(lambda x: x.str.contains(r"([a-z])+\1{1,}\b"), axis=1).any(1)].reset_index(drop=True)
print(df)
Below is the output for the above with a Warning message:
UserWarning: This pattern has match groups. To actually get the groups, use str.extract.
Random
0 hahaha
1 kebab
2 shsh
3 title
4 were
5 laptop
6 welcome
7 pencil
However, I expect to see this:
Random
0 laptop
1 welcome
2 pencil
You can use Series.str.contains directly to create a mask and disable the user warning before and enable it after:
import pandas as pd
import warnings
data = {'Random' : ['helloooo', 'hahaha', 'kebab', 'shsh', 'title', 'miss', 'were', 'laptop', 'welcome', 'pencil']}
df = pd.DataFrame(data)
warnings.filterwarnings("ignore", 'This pattern has match groups') # Disable the warning
df['Random'] = df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
warnings.filterwarnings("always", 'This pattern has match groups') # Enable the warning
Output:
>>> df['Random'][~df['Random'].str.contains(r"([a-z]+)[a-z]?\1")]
# =>
7 laptop
8 welcome
9 pencil
Name: Random, dtype: object
The regex you have contains an issue: the quantifier is put outside of the group, and \1 was looking for the wrong repeated string. Also, the \b word boundary is superflous. The ([a-z]+)[a-z]?\1 pattern matches for one or more letters, then any one optional letter, and the same substring right after it.
See the regex demo.
We can safely disable the user warning because we deliberately use the capturing group here, as we need to use a backreference in this regex pattern. The warning needs re-enabling to avoid using capturing groups in other parts of our code where it is not necessary.
IIUC, you can use sth like the pattern r'(\w+)(\w)?\1', i.e., one or more letters, an optional letter, and the letters from the first match. This gives the right result:
df[~df.Random.str.contains(r'(\w+)(\w)?\1')]

Get unique id using regex

I have following text:
This is the foo test the date purchase id is /STAR2015A. This is another foo test the purchase is /STAR2022M. Yet another foo test, get it back by if u dont like, purchase id is /STAR2039K. You wont be surprised if i write another id /STAR2050L.
I want to get all the unique purchase ids. It starts with /STAR every time and ends with letter A-M. Also, the number ranges from 2010 - 2050. I tried following but it doesnt return any result:
import re
dset = []
text = "This is the foo test the date purchase id is /STAR2015A. This is another foo test the purchase is /STAR2022M. Yet another foo test, get it back by if u dont like, purchase id is /STAR2039K. You wont be surprised if i write another id /STAR2050L. "
pattern = re.findall("[^\/STAR[20][10-50][A-M]]",text)
print(pattern)
Let me know how to solve this.
You could use
/STAR20(?:[1-4]\d|50)[A-M]
/STAR20 Match literally
(?: Non capture group
[1-4]\d Match 10 - 49
| or
50 Match 50
) Close group
[A-M] Match A - M
Regex demo | Python demo
Example
result = re.findall(r"/STAR20(?:[1-4]\d|50)[A-M]", text)

Removing rows from a DataFrame based on words in a string

Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?
Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]
If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?

Python Regex to extract codes from a string

I have a string like -
Srting = "$33.53 with 2 coupon codes : \r\n\r\n1) CODEONE\r\n\r\n2)
CODETWO \r\n\r\nBoth coupons only work if you buy 1 by 1"
I want to extract coupon codes "CODEONE" and "CODETWO" from this string if the following if condition gets true -
if "coupon code" in string:
Please help how i can extract these coupon codes? Actually i need a generic RE for this because i may have other strings where location of the codes may occur at different place and it is also possible that there is only one code
This might help.
import re
Srting = "$33.53 with 2 coupon codes : \r\n\r\n1) CODEONE\r\n\r\n2) CODETWO \r\n\r\nBoth coupons only work if you buy 1 by 1"
for i in re.findall("\d+\)(.*)", Srting):
print(i.strip())
Output:
CODEONE
CODETWO

A Python regex to find soccer team fixtures in string

I am using the Requests module to access the HTML from my target website and then using Beautiful Soup to select a specific element on the website. The element in question is a table that contains the results thus far of the English Premier League 2016/2017 season. The table contains the match date, the teams involved, the full-time score and the half-time score. I want to use Python to parse the HTML of the table element and extract the fixtures listed on there. The teams are always listed as:
Team A - Team B
A team name can be 1-3 separate strings (e.g. Burnley, Manchester United, West Ham United.
My attempt so far is:
import re
teamsRegex = re.compile(r'((\w+\s)+-(\s\w+)+)')
My logic here is that the first team can be 1-3 separate strings in length and each string is always followed by a white space. Therefore, the pattern (\w+\s)+ represents a string of any length followed by a white space and can be repeated 1 or many times. The second team name will always begin with a white space following the "-" character and again can be a string of any length, repeated 1 or many times (\s\w+)+.
I'm sort of achieving the desired results but the above is not entirely correct. I am returned a list with my desired result at index 0 followed by the first string of index 0 as index 1, and the last string in index 0 as index 2.
Example string:
'Burnley - Swansea City align=center width=45> 0 - 1 align=center> (0-0)'
Regex finds:
[('Burnley - Swansea City', 'Burnley ', ' City'), ('0 - 1', '0 ', ' 1')]
I would just like it to find [('Burnley - Swansea City')]
Many thanks in anticipation of any help!
r'(?:[A-Z][a-z]*\s)+-(?:\s[A-Z][a-z]*)+'
Here you have two non-capturing (?:, so you'll get the full match only) groups to match the teams' names. I chose to use letters explicitly, so the expressions only match words beginning with capital letters and exclude digits. You should change that if the teams' names can contain digits (like "BVB 09").
Depending on the HTML file's content one could add a final lookahead (?= align) to increase specifity.
Edit:
To match up to three capitals and optional '&'s, try this :
r'(?:[A-Z&]{1,3}[a-z]*\s)+-(?:\s[A-Z&]{1,3}[a-z]*)+'

Categories