Get unique id using regex - python

I have following text:
This is the foo test the date purchase id is /STAR2015A. This is another foo test the purchase is /STAR2022M. Yet another foo test, get it back by if u dont like, purchase id is /STAR2039K. You wont be surprised if i write another id /STAR2050L.
I want to get all the unique purchase ids. It starts with /STAR every time and ends with letter A-M. Also, the number ranges from 2010 - 2050. I tried following but it doesnt return any result:
import re
dset = []
text = "This is the foo test the date purchase id is /STAR2015A. This is another foo test the purchase is /STAR2022M. Yet another foo test, get it back by if u dont like, purchase id is /STAR2039K. You wont be surprised if i write another id /STAR2050L. "
pattern = re.findall("[^\/STAR[20][10-50][A-M]]",text)
print(pattern)
Let me know how to solve this.

You could use
/STAR20(?:[1-4]\d|50)[A-M]
/STAR20 Match literally
(?: Non capture group
[1-4]\d Match 10 - 49
| or
50 Match 50
) Close group
[A-M] Match A - M
Regex demo | Python demo
Example
result = re.findall(r"/STAR20(?:[1-4]\d|50)[A-M]", text)

Related

python keyword search in csv comments

I am trying to do multiple keyword search in csv file just in column comments. for some reason when I try to search I get this error message 'DataFrame' object has no attribute 'description'
for example
table1.csv
id_Acco, user_name, post_time comments
1543603, SameDavie , "2020/09/06" The car in the house
1543595, Johntim, "2020/09/11" You can filter the data
1558245, ACAtesdfgsf , "2020/09/19" if you’re looking at a ship
1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show
Output
id_Acco, user_name, post_time comments
1543603, SameDavie , "2020/09/06" The car in the house
1543595, Johntim, "2020/09/11" You can filter the data
1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show
code
df = pd.read_csv('table1.csv')
df[df.description.str.contains('house| filter | table | car')]
df.to_csv('forum_fraud_date_keyword.csv')
You can use the following code for filtering using regex by .str.contains()
df = df.loc[df.comments.str.contains(r'\b(?:house|filter|table|car)\b')]
Here, we use r-string for including regex meta-characters.
We use \b to encompass the 4 target words so that it will only match whole word instead of partial string. E.g. carmen won't be matched with car, tablespoon would not match table. If you want to match partial string, you can remove the pair of \b in the regex above.
You can look at this Regex Demo for the matching demo.
Result:
print(df)
id_Acco, user_name, post_time comments
0 1543603, SameDavie , "2020/09/06" The car in the house
1 1543595, Johntim, "2020/09/11" You can filter the data
3 1558245, TDRtesdfgsf , "2020/09/19" you can filter the table to show

How to apply regex for multiple phrases on a dataframe column?

Hello I have a dataframe where I want to remove a specific set of characters 'fwd', 're', 'RE' from every row that starts with these phrases or contains these phrases. The issue I am facing is that I do not know how to apply regex for each case.
my dataframe looks like this:
summary
0 Fwd: Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Fwd:RE:Re: Please take action on the action needed items
4 Fix all the mistakes please
5 Fwd:Re: Take action on the attachments in this email
6 Fwd:RE: Action is required
I want a result dataframe like this:
summary
0 Please look at the attached documents and take action
1 NSN for the ones who care
2 News for all team members
3 Please take action on the action needed items
4 Fix all the mistakes please
5 Take action on the attachments in this email
6 Action is required
To get rid of 'Fwd' I used df['msg'].str.replace(r'^Fwd: ','')
If they can be anywhere in the string, you could use a repeating pattern:
^(?:(?:Fwd|R[eE]):)+\s*
^ Start of string
(?: Non capturing group
(?:Fwd|R[eE]): match either Fwd, Re or RE
)+ Close non capturing group and repeat 1+ times
\s* Match trailing whitespaces
Regex demo
In the replacement use an empty string.
You could also make the pattern case insensitive using re.IGNORECASE and use (?:fwd|re) if you want to match all possible variations.
For example
str.replace(r'^(?:(?:Fwd|R[eE]):)+\s*','')
The key concept in this case I believe is using the | operator which works as Either or Or for the pattern. It's very useful for these cases.
This is how I would solve the problem:
import pandas as pd
df = pd.DataFrame({'index':[0,1,2,3,4,5,6,7],
'summary':['Fwd: Please look at the attached documents and take action ',
'NSN for the ones who care',
'News for all team members ',
'Fwd:RE:Re: Please take action on the action needed items',
'Fix all the mistakes please ',
'Fwd:Re: Take action on the attachments in this email',
'Fwd:RE: Action is required',
'Redemption!']})
df['clean'] = df['summary'].str.replace(r'^Fwd:|R[eE]:\s*','')
print(df)
Output:
index ... clean
0 0 ... Please look at the attached documents and tak...
1 1 ... NSN for the ones who care
2 2 ... News for all team members
3 3 ... Please take action on the action needed items
4 4 ... Fix all the mistakes please
5 5 ... Take action on the attachments in this email
6 6 ... Action is required
7 7 ... Redemption!

Removing rows from a DataFrame based on words in a string

Novice programmer here seeking help.
I have a Dataframe that looks like this:
Current
0 "Invest in $APPL, $FB and $AMZN"
1 "Long $AAPL, Short $AMZN"
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I also have a list with all the hashtags: cashtags = ["$AAPL", "$FB", $AMZN"]
Basically, I want to go through all the lines in this column of the DataFrame and keep the rows with a unique cashtag, regardless if it is in caps or not, and delete all others.
Desired Output:
Desired
2 "$AAPL earnings announcement soon"
3 "$FB is releasing a new product. Will $FB's product be good?"
4 "$Fb doing good today"
5 "$AMZN high today. Will $amzn continue like this?"
I've tried to basically count how many times the word appears in the string and add that value to a new column so that I can delete the rows based on the number.
for i in range(0,len(df)-1):
print(i, end = "\r")
tweet = df["Current"][i]
count = 0
for word in cashtags:
count += str(tweet).count(word)
df["Word_count"][i] = count
However if I do this I will be deleting rows that I don't want to. For example, rows where the unique cashtag is mentioned several times ([3],[5])
How can I achieve my desired output?
Rather than summing the count of each cashtag, you should sum its presence or absence, since you don't care how many times each cashtag occurs, only how many cashtags.
for tag in cashtags:
count += tag in tweet
Or more succinctly: sum(tag in tweet for tag in cashtags)
To make the comparison case insensitive, you can upper case the tweets beforehand. Additionally, it would be more idiomatic to filter on a temporary series and avoid explicitly looping over the dataframe (though you may need to read up more about Pandas to understand how this works):
df[df.Current.apply(lambda tweet: sum(tag in tweet.upper() for tag in cashtags)) == 1]
If you ever want to generalise your question to any tag, then this is a good place for a regular expression.
You want to match against (\$w+)(?!.*/1) see e.g. here for a detailed explanation, but the general structure is:
\$w+: find a dollar sign followed by one or more letters/numbers (or
an _), if you just wanted to count how many tags you had this is all you need
e.g.
df.Current.str.count(r'\$\w+')
will print
0 3
1 2
2 1
3 2
4 1
5 2
but this will remove cases where you have the same element more than once so you need to add a negative lookahead meaning don't match
(?!.*/1): Is a negative lookahead, this means don't match if it is followed by the same match later on. This will mean that only the last tag is counted in the string.
Using this, you can then use pandas DataFrame.str methods, specifically DataFrame.str.count (the re.I does a case insensitive match)
import re
df[df.Current.str.count(r'(\$\w+)(?!.*\1)', re.I) == 1]
which will give you your desired output
Current
2 $AAPL earnings announcement soon
3 $FB is releasing a new product. Will $FB's pro...
4 $Fb doing good today
5 $AMZN high today. Will $amzn continue like this?

How to remove duplicated words in csv rows in python?

I am working with csv file and I have many rows that contain duplicated words and I want to remove any duplicates (I also don't want to lose the order of the sentences).
csv file example (userID and description are the columns name):
userID, description
12, hello world hello world
13, I will keep the 2000 followers same I will keep the 2000 followers same
14, I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car
.
.
I would like to have the output as:
userID, description
12, hello world
13, I will keep the 2000 followers same
14, I paid $2000 to the car
.
.
I already tried the post such as 1 2 3 but none of them fixed my problem and did not change anything. (Order for my output file matters, since I don't want to lose the orders). It would be great if you can provide your help with a code sample that I can run in my side and learn.
Thank you
[I am using python 3.7 version]
To remove duplicates, I'd suggest a solution involving the OrderedDict data structure:
df['Desired'] = (df['Current'].str.split()
.apply(lambda x: OrderedDict.fromkeys(x).keys())
.str.join(' '))
The code below works for me:
a = pd.Series(["hello world hello world",
"I will keep the 2000 followers same I will keep the 2000 followers same",
"I paid $2000 to the car I paid $2000 to the car I paid $2000 to the car"])
a.apply(lambda x: " ".join([w for i, w in enumerate(x.split()) if x.split().index(w) == i]))
Basically the idea is to, for each word, only keep it if its position is the first in the list (splitted from string using space). That means, if the word occurred the second (or more) time, the .index() function will return an index smaller than the position of current occurrence, and thus will be eliminated.
This will give you:
0 hello world
1 I will keep the 2000 followers same
2 I paid $2000 to the car
dtype: object
Solution taken from here:
def principal_period(s):
i = (s+s).find(s, 1)
return s[:i]
df['description'].apply(principal_period)
Output:
0 hello world
1 I will keep the 2000 followers the same
2 I paid $2000 to the car
Name: description, dtype: object
Since this uses apply on string, it might be slow.
Answer taken from How can I tell if a string repeats itself in Python?
import pandas as pd
def principal_period(s):
s+=' '
i = (s + s).find(s, 1, -1)
return None if i == -1 else s[:i]
df=pd.read_csv(r'path\to\filename_in.csv')
df['description'].apply(principal_period)
df.to_csv(r'output\path\filename_out.csv')
Explanation:
I have added a space at the end to account for that the repeating strings are delimited by space. Then it looks for second occurring string (minus first and last character to avoid matching first, and last when there are no repeating strings, respectively) when the string is added to itself. This efficiently finds the position of string where the second occuring string starts, or the first shortest repeating string ends. Then this repeating string is returned.

regular expression in python - can't find a specific string

I started to learn regex in python and I've got the following task:
I need to write a script taking those 2 strings:
string_1 = 'merchant ID 1234, device ID 45678, serial# 123456789'
string_2 = 'merchant ID 8765, user ID 531476, serial# 87654321'
and displaying only the strings which has merchant ID #### and device ID #### in them.
To check for the first condition I wrote the following line:
ex_1 = re.findall(r'\merchant\b\s\ID\b\s\d+', string_1)
print (ex_1)
output: ['merchant ID 1234'] - works fine!
Problem is I can't get the other condition for some reason:
ex_2 = re.findall(r'\device\b\s\ID\b\s\d+', string_1)
output: [] - empty list.
What am I doing wrong?
Because:
ex_2 = re.findall(r'\device\b\s\ID\b\s\d+', string_1)
^^
Which matches a number, but \m in \merchant is still m. However you should remove the \ which before \ID and \device like:
>>> re.findall(r'device\b\sID\b\s\d+', string_1)
['device ID 45678']
Your grouping is wrong. Use brackets for the grouping:
(merchant ID \d+|device ID \d+)
e.g.
>>>re.findall('(merchant ID \d+|device ID \d+)', string_1)
['merchant ID 1234', 'device ID 45678']
Be careful with the special character '\'. '\device\' matches with [0-9] + 'evice'.
With Pythex you can test your regex, and consult a great cheatsheet.

Categories